SGML and software development


From:     Steve Heaney
Date: 20 Jan 1994
An few thoughts that I would appreciate some input on please. It occurs to me that SGML (the Standard Generalized Markup Language) provides many possibilities for supporting a software development environment. Because SGML provides the means of defining and verifying the content and structure of a "document" it offers, for example, the potential for coupling the documentation produced by software process (requirements, spec, design docs) with program source and beyond to maintenance manuals.

This could take the form of a literate programming tool such as Web which combines source and documentation into one file. Alternatively it could form the basis of a more comprehensive CASE environment. Once you start thinking about it, many opportunities to exploit SGML occur. How about:

- being better able to track requirements through specification, design and implementation.
- supporting hypertext in your "source code".
- generating or validating interfaces directly from your documentation.
- formal validation of a program against assertions in the documentation.
- defining a high level programming language which can be transformed into a number of target languages.

Of course, none of these suggestions is immediately provided as an intrinsic feature of SGML - but SGML has the potential to provide an extensible, vendor neutral, language independent environment for software development with a few carefully thought out DTD's and the use of some of the software available on the net. Maybe someone has tried some of this already. I would be very interested if anyone has DTD's or DTD fragments for program specification or design, especially if these have been coupled with tools for generating or validating the resultant source code.


From:     C. M. Sperberg-McQueen
Date: 20 Jan 1994
Steve Heaney writes: Maybe someone has tried some of this already. I would be very interested if anyone has DTD's or DTD fragments for program specification or design, especially if these have been coupled with tools for generating or validating the resultant source code.

There is an SGML DTD fragment for the specification language Z, which is part of the Z Interchange Format (contact John.Nicholls@prg.oxford.ac.uk for more information) and there are a few people, myself included, who have thought and sometimes talked about writing a DTD or DTD fragment for use in an SGML-based literate programming system. But I don't know of any finished systems for general programming.

(I do use a finished system, inspired by Web, to document SGML document type declarations. But the next revision will, I think, be much more like nuweb and noweb, and the scrap mechanism will look a lot like what Mark Carroll describes, with the addition of a 'lang' attribute to identify the language of a scrap, so that language-specific routines can reliably be invoked, to handle indexing or pretty-printing or what have you.)

It is not necessarily the case, of course, that the entire evolution of a system, from specification to code, should be documented in a single document. But of course it would be handy to do specifications documents, user manuals, and literate programming using the same document processing tools. Those who use TeX or LaTeX for everything can already do this; those of us who prefer SGML still need the tools. The first cut, of course, can be a simple translation from the SGML tag set into the existing web system of one's choice.


From:     Eliot Kimber
Date: 21 Jan 1994
Steve Heaney writes: An few thoughts that I would appreciate some input on please. It occurs to me that SGML (the Standard Generalized Markup Language) provides many possibilities for supporting a software development environment.... Maybe someone has tried some of this already. I would be very interested if anyone has DTD's or DTD fragments for program specification or design, especially if these have been coupled with tools for generating or validating the resultant source code.

There are several products that are at least potentials for this, including Passage Systems' PassagePRO, MicroStar's Computer Aided Document Engineering (CADE) system, InfoDesign's WorkSmart, and Documentum's integration platform.

Certainly the application of web-type approaches to code and documentation management is a natural for SGML. In fact we did this (sort of) to develop the IBMIDDoc DTD by making the source for the DTD primarily the source for the IBMIDDoc Reference and then creating a process that builds the working DTD from the documentation source (unfortunately, we didn't have the necessary SGML tools in place when we started so right now it's a DCF/Script process, not an SGML process).

When I talk about designing SGML applications for technical information, I present the task as essentially one of relating information objects (the SGML elements) to objects in the data model for the system being documented (all real systems have a data models, whether it's been formalized or not, and since technical documentation is always about real systems (possibly systems of concepts), all technical documentation can be thought of as documenting objects in data models). The IBMIDDoc language, for example, uses and supports this approach explicitly.

Once you get this far, that the information is always mapped to a data model, I go one more step and suggest that the information objects are in fact objects in the data model itself. In other words, the information about the system is part of the system itself. This should be obvious in the case of integrated online information like help or tutorials, but I think it works for all information, regardless of media. If you've gotten this far, I think it follows that in a CASE environment, there is no meaningful distinction between code and information and therefore no distinction (at a high level) between the tools used to create and manage code and information. And since SGML is, by its nature, generally the richest information source form (because of its generality), it makes sense that it would be the primary encoding for all data in the system, because everything else can be derived from it.


From:     Jeffrey McArthur
Date: 21 Jan 1994
The more I deal with SGML the more I hate SGML. I like the concept of SGML. If John Goldfarb had taken just a few classes in language design then SGML would be a wonderful idea. The problem I have is that SGML, in its current form, is such an awful hack. The syntax for DTD's is needlessly complex. SGML has the problem of creeping featurism. Consider the discussions in comp.text.sgml about ambiguous DTD's. Any language that allows you to create a "grammar", aka DTD, that is ambiguous has serious problems.

On thing I really hate about SGML is the way end-of-line is treated. Instead of being "white space" its meaning is context dependant. This means any tool that you use on an SGML file must understand the context. Unfortunately, SGML appears to be here to stay. Nothing else seems to match what SGML does. I prefer to work with a small subset of SGML; not the entire hideous beast.

Literate programming is supposed to make programs easier to understand. Can anyone honestly say that the following is easy to understand?

<chunk name="a sample chunk" file="sample.c" index="mainindex">
#include <stdio.h>

<d>main</d>() {
   int <d>i</d>;
   for (i=0;i<10;i++)
      <cr ref="another chunk">
}
</chunk>
Besides being ugly it is not even good ANSI C. There is no reason that < and > should be used for tagging. SGML allows you to use other characters. C/C++ programs generally only use the characters in the range [1-127]. This leaves 128 characters in that could be used for tag delimiters. Pick any two and use them instead. Personally, I prefer literate programming tools that understand and can pretty print the language.


From:     Robert Lougher
Date: 21 Jan 1994
There has been quite a few articles posted recently about the use of HTML and SGML in literate programming. It seems as if there is a hell of a lot of interest in using hypertext etc. as a means of augmenting existing literate programming systems. For example, a number of people (sorry I can't remember who) have mentioned using hypertext links to link definitions and usages together. There has also been mentions of the importance of supporting maintenance, as this is the most costly part of the life-cycle. This is where I come in! I have been working on this area towards my PhD and have built a prototype tool to support what can be called a hypertext approach to documenting and browsing software. As many of the issues are related to what's been talked about, I thought people might be interested in it...

First of all, the major interest of our work was documentation support for software maintenance, especially as a means of supporting collaboration between team members. Maintenance programmers spend much of their time understanding other people's code and it is for them documentation should be produced. However, as anybody who has ever tried to maintain code knows, a lot of code has documentation which is inadequate or non-existent. A major concern was therefore that the tool should support the process of redocumentation during maintenance. There's no reason, however, why the system can't be used during development.

We use the term "rationale" to illustrate the fact that a wide variety of information should be able to be recorded, not just traditional stuff such as what this function is used for. For example, this might be comments from an inspection. A number of papers have been published describing the work: "Group Support for the Recording and Sharing of Maintenance Rationale", Robert Lougher and Tom Rodden, IEE/BCS Software Engineering Journal, November, 1993. "Supporting Long-term Collaboration in Software Maintenance", Robert Lougher and Tom Rodden, presented at the Conference on Organizational Computing Systems (COOCS'93), Nov. 1-5, Milpitas, CA, ACM.

Here's the abstract from the first paper. "This paper presents a system which supports maintenance by allowing members of a maintenance team to easily construct and structure a shared pool of maintenance information. Through the provision of a range of unconstrained documentation facilities, documentation can be constructed incrementally by any member of the maintenance team. The general model adopted exploits hypertext technology to allow documentation to be integrated into the source code, by linking comments to the appropriate source component. Structuring facilities are provided which allow the source code and documentation to be seen at different levels of abstraction, in addition to a comment typing mechanism which allows different types of information to be explicitly represented."

The system is essentially an interactive program editor in which hypertext links are displayed by highlighting. Different comment types are displayed using different styles of highlighting, i.e., foreground/background color, bold etc. Documentation is created by annotation, i.e., the use of a variable might be explained by linking a "Variable Usage" comment to it.

The editor in addition supports a wide range of facilities. These include cross-referencing (variables, functions etc.), folding, whereby sections of code can be grouped and folded away, display filters and graphical overviews of the network. Filtering allows complex queries to be formulated, i.e. show me all the "Bug Reports" created by Fred since last week. The overviews include views of the fold structure and graphs of the hypertext net. The system supports multiple files.

Here's the bit relevant to the current discussion. The documentation is stored in the source code files via a markup language. Rather than use an existing standard, I developed my own. As has been mentioned in previous posts, the syntax of SGML is complex. I think one of the reasons literate programming hasn't been as successful as it could have is that the print-out may be pretty, the code is however, less readable than it was before. This would be even worse with SGML. My markup syntax is simple. It has been designed to alter the appearance of the source code as little as possible. All information is commented out according to the convention of the language being used. This means the source code is still compilable (i.e. no tangling) and it may still be edited in a normal editor.

A previous poster mentioned the problem of linking definitions and uses together. I take the same approach and allow the user to indicate which should be linked. This was to keep the system language independent. The markup language's syntax is simple enough however, to allow the source code to be automatically marked up where suitable cross-referencing tools are available (i.e. cxref for C). The papers are also available as technical reports from the department. I am currently writing my thesis so I would welcome any comments.


From:     Andrew Dobrowolski
Date: 22 Jan 1994
Jeffery McArthur writes: Literate programming is supposed to make programs easier to understand. Can anyone honestly say that the following is easy to understand?
<chunk name="a sample chunk" file="sample.c" index="mainindex">
#include <stdio.h>

<d>main</d>() {
   int <d>i</d>;
   for (i=0;i<10;i++)
      <cr ref="another chunk">
}
</chunk>
Besides being ugly it is not even good ANSI C. There is no reason that < and > should be used for tagging. SGML allows you to use other characters. C/C++ programs generally only use the characters in the range [1-127]. This leaves 128 characters in that could be used for tag delimiters. Pick any two and use them instead. Personally, I prefer literate programming tools that understand and can pretty print the language.

I agree that editing SGML with an ASCII editor is not for the faint of heart. To emphasize your point one more time... the less than sign in the "for" loop of the above code fragment would have to be represented by the ASCII character sequence "<". Fortunately with a customizable SGML editor and a good design for the literate programming application the code would not look anything like that. It would look more like an average page from "TeX: The Program". And the less than sign would look like a less than sign.


From:     C. M. Sperberg-McQueen
Date: 22 Jan 1994
Jeffery McArthur writes: (among other things): Consider the discussions in comp.text.sgml about ambiguous DTD's. Any language that allows you to create a "grammar", aka DTD, that is ambiguous has serious problems.

Hmm. Does this mean we can all forget about regular expressions, Backus Normal Form, Generalized Phrase Structure Grammar, and every other grammar formalism known to linguistics or computer science? They all allow the construction of ambiguous grammars, or did last time I read a book on parsing theory.

Unfortunately, SGML appears to be here to stay. Nothing else seems to match what SGML does. I prefer to work with a small subset of SGML; not the entire hideous beast.

Fair enough; many of us who work some or all the time with tools which are not SGML-aware do use subsets. I understand your frustration with some aspects of SGML qua formal language; you even mention one of my pet hobby horses (record-end treatment). But the statement you make with which I agree most strongly is "Nothing else seems to match what SGML does."

Nothing else even comes close. For serious work with textual information, SGML is head, shoulders, chest, belly, and thighs above all of its competitors. But perhaps a discussion of SGML and its strengths and weaknesses belongs on comp.text.sgml, not on comp.programming.literate -- sorry for the digression, literati!


From:     Jeffrey McArthur
Date: 22 Jan 1994
I agree that editing SGML with an ASCII editor is not for the faint of heart. To emphasize your point one more time... the less than sign in the "for" loop of the above code fragment would have to be represented by the ASCII character sequence "<".

Part of my philosophical disagreement is with using < and > for something they were not intended. < and > have a defined meaning in most computer languages: C, C++, Algol, Pascal, Modula-2, Ada, and so on. Forcing the less-than to be represented by "<" is counter-intuitive. This is counter to the whole concept of literate programming. If you show the SGML file to someone who has never seen a literate program they will not be able to understand it at all.

With one of the WEB descendants that is not quite as bad a problem. The WEB programs were designed with formatting computer programs in mind. SGML was designed for tagging text. In conventional text less than and greater than are quite infrequent. In computer programming they are very common. Let me illustrate my point. Consider the following example:

   8chunk name="a sample chunk" file="sample.c" index="mainindex"9
   #include 8stdio.h9

   8d9main8/d9() {
      int 8d9i8/d9;
      for (i=0;i<10;i++)
         8cr ref="another chunk"9
   }
   8/chunk9

This is the same example but using 8 instead of < and 9 instead of >. This code is unreadable. You could, in theory, set up your SGML parser to accept this. 8, and 9 occur relatively infrequently and when they do you could type &eight; and &nine;. Do you see how ludicrous this is?

On another point, most WEB tools are publicly available. They are well documented and easy to use. It is not that difficult to write you own. This is not true of SGML. SGML was designed by a lawyer. And it shows! The more I deal with SGML the more I hate it. It is a good idea gone very, very bad. I think SGML has the same problems as Algol-68. The full standard has everything but the kitchen sink included.

Here is the final telling point. Write out a BNF grammar of the DTD you are using. (Unless you have defined an ambiguous grammar this should be possible.) Now obtain a BNF grammar of your favorite programming language. Compare the size and complexity. In most cases the DTD will have a more complex grammar. That says a lot about the complexity of SGML. (Of course you could do extreme examples like comparing Oberon to something like CALS.)


From:     Mark Carroll
Date: 23 Jan 1994
Jeffery McArthur writes: Literate programming is supposed to make programs easier to understand. Can anyone honestly say that the following is easy to understand?
<chunk name="a sample chunk" file="sample.c" index="mainindex">
#include <stdio.h>

<d>main</d>() {
   int <d>i</d>;
   for (i=0;i<10;i++)
      <cr ref="another chunk">
}
</chunk>

I rather disagree here... one of the biggest problems that I have found with literate programs is that the source code, in non-formatted form is an incredible mess for a human reader. The reason that I decided to implement the tool is precisely because I don't like reading literate code online. When I view it in pre-formatted form, there's so much extra noise that it's difficult to read. After formatting, I can only view it through a dvi viewer. What this ends up meaning, in my experience, is that you've got a computer system sitting in front of you, and you're completely unable to take advantage of the abilities of that computer to help you understand the code that you're reading. You can print the program on paper, or you can view it on virtual paper. I wanted to be able to experiment with a tool that would allow me to see my program in a useful form online, and to take some advantage from the fact that I was using a computer to view it.

Given that goal, and given that I am not at all sure that this is really going to be advantageous, I don't want to end up wasting too much time. So I am working with what I have available. Realistically, what I have available to work with, without implementing a complete new hypertext system is either texinfo, or HTML. Given that choice, I did some experimenting, and decided that HTML is preferable.

So I am working within HTML syntax. Which is, admittedly rather ugly. But is it really all that much more ugly than a program marked up with TeX? Most of what makes the program above ugly is going to have to be present in some form. For my purposes, I cannot write a language specific tool (because I need to be able to work with Perl, Icon, C, C++, Eiffel, and perhaps Sather). So I need to provide some way of allowing the programmer to provide markup him/herself within the program to generate use/def links. Is the following really so much less ugly?

<<A Sample Chunk>>=
#include <stdio.h>

[[main]]() {
   int <d>i</d>;
   for (i=0;i<10;i++)
      <<another chunk>>
}
@
Personally, I don't really think so. And especially, since using abbreviations (and a minor syntactic change that I have done), it could be written:
<chunk name="a sample chunk" file="sample.c" index="mainindex">
#include <stdio.h>

<d/main/() {
   int <d>i</d>;
   for (i=0;i<10;i++)
      <cr/another chunk/
}
</chunk>

Besides being ugly it is not even good ANSI C.

Sorry. I was just trying to throw together a quick example. It wasn't intended to be a paragon of fine programming!

There is no reason that < and > should be used for tagging. SGML allows you to use other characters. C/C++ programs generally only use the characters in the range [1-127]. This leaves 128 characters in that could be used for tag delimiters. Pick any two and use them instead.

Well, there is a reason: a lot of the HTML parsers that exist are rather sensitive about syntax. In order to be fairly certain that they aren't going to choke on my, I am trying to avoid anything that doesn't commonly occur in HTML documents. I have yet to see any HTML using anything as markup other than <>, so I am sticking with the standard.

Personally, I prefer literate programming tools that understand and can pretty print the language.

I have already explained why I am avoiding language specific features. But there's really no reason why the code from chunks in this HTML syntax can't be piped through a formatter to do prettyprinting and definition marking automatically.


From:     C. M. Sperberg-McQueen
Date: 23 Jan 1994
If it had just been one posting, I would have ignored it, but two postings claiming that in an SGML-based literate programming system the line
      for (i=0;i<10;i++)
would have to be represented
      for (i=0;i&lt;10;i++)
are too many for me. Even in the reference concrete syntax of SGML, the less-than sign in this line will not be mis-recognized as a start-tag delimiter, because it is not followed by a letter. If a variable name is substituted, the line can still be rewritten without resorting to an entity reference:
      for (i = startpoint; i < endpoint; i++)
will cause no problems for any conforming parser. Some people will say white space, properly used, actually makes it more legible. If you don't like SGML for literate programming, don't use it. But before you pontificate on its syntax in public, perhaps you could be persuaded to learn a bit more about that syntax. The lawyer is several steps ahead of you.


From:     Tony Coates
Date: 24 Jan 1994
There have been a number of posts recently discussing the use of SGML/HTML mark-up techniques in constructing literate programming tools. However, it seems to me that some recent posts for and against are arguing at cross-purposes. Particularly, I refer to the argument that ASCII SGML/HTML sources are difficult to read, at least compared to existing WEB sources. I won't deny that this point may be true; I would question though whether it is relevant.

Virtually all the current WEB tools are batch oriented, taking ASCII sources and producing TeX output, which can later be viewed. Batch-mode operation is common with new types of programs, avoiding the complications of real-time interface handling, but personally I look forward to the day when the literate programming tools will be interactive, actively formatting the code as it is input, much like a modern word processor. Much as I like working with LaTeX for my sources, I don't see a lot being done by way of creating any kind of "real-time" TeX, which would work in other than batch mode. On the other hand, at least with the HTML extensions to SGML, hypertext offers interactive browsing of code, and in principle the marked-up code should be able to be formatted on the fly as required, I believe.

So it seems to me that on that basis, SGML/HTML markups may offer useful advantages in generating a way of coding the literate programs so as to be able to exploit the features of HTML/SGML. Still, it is clear that if users were to be editing ASCII SGML sources, the experience might be off-putting.

However, since SGML sources are meant to be for the machine, and in principle SGML editors which shield the user from such source code are suppose to be possible or even available, is the better question not that of how well the required mark-ups for literate programming could be added using a proper SGML editor, and what this would look like and feel like for the end-user? I have no experience with such editors, but would be interested to hear from someone who does. I sometimes imagine opening up a WYSIWYG editor, filled with documentation and code, and then highlighting a paragraph of code, selecting "Code" from the menu, and then defining the name of this code block, with the editor providing the appropriate formatting. If SGML editors could offer this, I for one would consider that building literate programming systems using SGML mark-ups as the base could be a worthwhile step forward.

Anyway, this is long, but I did feel that too much was made of what SGML sources would look like, when this did not seem to me to be the question.


From:     Eric Skinner
Date: 24 Jan 1994
Steve Heaney writes: It occurs to me that SGML (the Standard Generalized Markup Language) provides many possibilities for supporting a software development environment. Because SGML provides the means of defining and verifying the content and structure of a "document" it offers, for example, the potential for coupling the documentation produced by software process (requirements, spec, design docs) with program source and beyond to maintenance manuals.

[...] Maybe someone has tried some of this already. I would be very interested if anyone has DTD's or DTD fragments for program specification or design, especially if these have been coupled with tools for generating or validating the resultant source code.

Exoterica's OmniMark product contains our commercial SGML parser, the SGML Kernel. The Kernel is built using an approach similar to what you described in your posting. The Kernel is in fact coded in an SGML markup language, combining engineering documentation with source code in a language close to C. The coding language is extended using SGML-defined constructs to provide a notation specifically useful for writing SGML parsers. It's a Web-like approach, but more powerful as code and documentation are more closely integrated. Our OmniMark product is then used to read this SGML document and generate C code, or to generate various types of documentation. OmniMark can also perform a variety of consistency checks.

We also have an internal highly-detailed marked up version of ISO8879 which we use for many purposes, including the generation of our online hypertext version of the standard, part of the "Complete SGML". The markup in this document is rich enough that it allows OmniMark to generate some C code for the Kernel directly from the markup of ISO8879's productions. The advantages we found in this type of development were many, including greater ease of coding, documentation, and faster development of high-quality code.


From:     Glenn Vanderburg
Date: 24 Jan 1994
Steve Heaney writes: Literate programming is supposed to make programs easier to understand. Can anyone honestly say that the following is easy to understand? [Silly example of C code marked up with SGML tags deleted] Personally, I prefer literate programming tools that understand and can pretty print the language.

So do I. Which is why I am very interested in a literate programming tool based on SGML. When I read your posting, I spotted a couple of assumptions you made which are bogus. Here are my contrary views:

1. Most users of decent SGML applications will not need to read the SGML source. I know SGML, and I actually like the choice of `<' and `>' as tag delimiters. But I have no desire to work with the SGML source in a text editor, and I consider any SGML browser/editor which requires that to be crippled. I want a literate programming tool which will permit me to work online with something that resembles the formatted, printed result of the WEAVE process. I think that SGML makes that an easier task.

2. Everything in an SGML document does not have to be marked up with SGML tags for an SGML processor to understand its structure. Learn about SGML's notion of `notations'. C is a perfect example of a data notation which is not defined in terms of SGML. A notation parser which knows C will not need SGML tags to recognized declarations. That information is already present in the C notation, and the C parser can cooperate with the SGML parser, informing it of the structure. If I were designing an SGML literate programming application, your example would be marked up like this:

<chunk name="a sample chunk">
#include <stdio.h>

main() {
   int i;
   for (i=0;i<10;i++)
      <cr ref="another chunk">
}
</chunk>

That's much more reasonable, and as I have already said, users of the application should see this presented in an even nicer way. I can't criticize you too much for making those false assumptions, because SGML applications that really work this way are rare and expensive. And you do make some good points. SGML is feature-laden, and some of the features (like data tags, short references, and some of the other minimization features) are useful primarily to people who will be working directly with the SGML encoding. But I think that soon, applications will be available which hide that from users.


From:     Kayvan Sylvan
Date: 24 Jan 1994
Tony Coates writes: There have been a number of posts recently discussing the use of SGML/HTML mark-up techniques in constructing Literate Programming Tools. However, it seems to me that some recent posts for and against are arguing at cross-purposes. Particularly, I refer to the argument that ASCII SGML/HTML sources are difficult to read, at least compared to existing WEB sources. I won't deny that this point may be true; I would question though whether it is relevant. [...]

Exactly. It seems to me analogous to arguing that one should not write computer programs because the underlying sequences of ones and zeros are hard to understand. The SGML/HTML markup is for use by tools that would hide the ugly details from the users. I program in WEB because I want some protection from the ugly details of the programming language. I would welcome a chance to program in an interactive HTML editor that would protect me from the ugly details of WEB.


From:     Robert Lougher
Date: 25 Jan 1994
Tony Coates writes: There have been a number of posts recently discussing the use of SGML/HTML mark-up techniques in constructing literate programming tools.

Yes, including mine which seemed to fall on deaf ears! I believe I made a number of interesting points which haven't been picked up on. Maybe I didn't make my post as relevant to the current discussion as I could have... If you remember, I mentioned the system I have developed for my PhD which allows programs to be documented and browsed using hypertext (I also put in rather a lot of blurb - I am writing my thesis!).

Virtually all the current WEB tools are batch oriented, taking ASCII sources and producing TeX output, which can later be viewed. Batch-mode operation is common with new types of programs, avoiding the complications of real-time interface handling, but personally I look forward to the day when the literate programming tools will be interactive ... ... with the HTML extensions to SGML, hypertext offers interactive browsing of code, and in principle the marked-up code should be able to be formatted on the fly as required, I believe.

In a nutshell, my system allows a program to be (re-)documented by attaching annotations to the source code using hypertext links. These can be attached to identifiers, program lines and code segments. Code segments are specified using folds. These allow a number of program lines to be grouped together and folded away. Annotations can be made on annotations and more than one link can be made to the same annotation (i.e. a graph structure). Hyper-links can also be used to browse the code, i.e. links between variable/function use and definition. This is all done in a WYSIWIG manner with windows. The code may be edited, folded and annotated interactively, and graphical views can be generated of the hypertext network. Unfortunately, to make the system language independent no formatting of the source code is performed.

So it seems to me that on that basis, SGML/HTML markups may offer useful advantages in generating a way of coding the literate programs so as to be able to exploit the features of HTML/SGML. Still, it is clear that if users were to be editing ASCII SGML sources, the experience might be off-putting.

I agree.

However, since SGML sources are meant to be for the machine, and in principle SGML editors which shield the user from such source code are suppose to be possible or even available, is the better question not that of how well the required mark-ups for literate programming could be added using a proper SGML editor, and what this would look like and feel like for the end-user?

When developing my system, most of the time was taken up with the development of the editor. This included issues of how should the link anchors be presented, the traversal of links, the creation of annotations, the support for browsing etc. A MAJOR consideration however, was the underlying storage. It is far easier to save the code+documentation in some unreadable format (as a word processor). This however, locks the source code into the system as it can only be modified using the editor (which may not always be available).

I took the mark-up approach instead. However, rather than use a standard markup, I developed my own. This was because: a) it was 2.5 years ago b) you can develop a markup which is suited (i.e. readable) to marking up programs. HTML is angled towards standard text (i.e. lists, paragraphs etc.). I believe a markup for programs should alter the appearance of the source code as little as possible (i.e. indentation). It should be possible to edit the program occasionally in a conventional editor using the markup, or in an interactive WYSIWYG manner using the hypertext system. While my markup isn't perfect, it goes a long way to achieving that.


From:     Lee Wittenberg
Date: 25 Jan 1994
Jeffry McArthur writes: Literate programming is supposed to make programs easier to understand. Can anyone honestly say that the following is easy to understand?
<chunk name="a sample chunk" file="sample.c" index="mainindex">
#include <stdio.h>

<d>main</d>() {
   int <d>i</d>;
   for (i=0;i<10;i++)
      <cr ref="another chunk">
}
</chunk>

I have mentioned before and feel compelled to mention again that the purpose of the web (i.e. markup) is to generate readable versions rather than to be read directly. Tangle generates a file to be read by a compiler, and weave generates a document to be read by humans. Presumably, other filters could generate stuff for other purposes. I should probably stop ranting about this, but I feel it is important.

Besides being ugly it is not even good ANSI C. There is no reason that < and > should be used for tagging. SGML allows you to use other characters. C/C++ programs generally only use the characters in the range [1-127]. This leaves 128 characters in that could be used for tag delimiters. Pick any two and use them instead.

Not knowing SGML, I had assumed that < and > were traditionally used (or required) for tags. If Jeffrey is right about SGML allowing other characters, then I agree with him about choosing something that doesn't interfere with programming languages. I believe that's why DEK chose @< and @>.

Personally, I prefer literate programming tools that understand and can pretty print the language.

Me too, but I am getting to where the free-form style of the document in noweb is more important to me than pretty-printing. On the other hand, I still dream of a filter (still notes on paper) that will pretty-print in noweb.


From:     Lee Wittenberg
Date: 25 Jan 1994
Robert Lougher writes: In a nutshell, my system allows a program to be (re-)documented by attaching annotations to the source code using hypertext links. These can be attached to identifiers, program lines and code segments. Code segments are specified using folds. These allow a number of program lines to be grouped together and folded away. Annotations can be made on annotations and more than one link can be made to the same annotation (i.e. a graph structure). Hyper-links can also be used to browse the code, i.e. links between variable/function use and definition. This is all done in a WYSIWIG manner with windows. The code may be edited, folded and annotated interactively, and graphical views can be generated of the hypertext network. Unfortunately, to make the system language independent no formatting of the source code is performed.

This all sounds very nice, but can it print out a human-readable document as well? The reason I ask is that I spent much of the last year struggling with visual programming systems where the only way to find out which bit of code was attached to an "object" (the system can best be described as widget-oriented rather than object-oriented, but had delusions of grandeur) was to click on the object with the mouse. Code maintenance became next to impossible, because there was no way, short of clicking on every object, to find out where variables and functions (and objects) were accessed.

We solved the problem by building a "decompiler," that produced a text representation of the complete program (in a home-grown mini-language), but this is not possible in all visual programming systems. The text representation was also useful when we wanted to change all the blue objects to green, or some such (we're also working on a compiler to reverse the process, but the VP system we're using is not very helpful in that regard). In any event, we can use all the lovely Unix text tools (diff, grep, etc.) to find out information about our visual programs.

In any event, may I request that you include in your system some way to a) produce human-readable (typeset in some way) documents from your hypertext base, and b) produce some text version that can be analyzed by the tools we all know and love (so we don't have to go out and buy more tools.


From:     Lee Wittenberg
Date: 25 Jan 1994
Robert Lougher says (among other things I happen to agree with): I think one of the reasons literate programming hasn't been as successful as it could have is that the print-out may be pretty, the code is however, less readable than it was before. This would be even worse with SGML. My markup syntax is simple. It has been designed to alter the appearance of the source code as little as possible. All information is commented out according to the convention of the language being used. This means the source code is still compilable (i.e. no tangling) and it may still be edited in a normal editor.

I disagree with this. I find literate code more readable. Remember that what is traditionally source is designed to be filtered before being read. Through tangle, if the reader is a compiler; through weave, if the reader is human. In addition, I find that even without explanations, code written as code chunks is much easier to digest, particularly when it's someone else's code. I am experimenting with this with my CS1 students (more on this in a later message), and they seem to find untangled webs easier to understand than the complete ("tangled") program, primarily because the chunks remain small and complete in themselves.


From:     John Ramsdell
Date: 15 Apr 1994
SGML enthusiasts claim that the logical structure of a document can be captured using SGML is a fashion that allows layout decisions to be made at a later date. In particular, with a carefully designed Document Type Definition (DTD), one could encode literate programs which can be displayed as printed text and hypertext.

For literate programming, one would start with a DTD that encodes technical documents, presumably including facilities for tables and equations. The DTD would be modified so as to allow the inclusion of code scraps and their references. Documents using the literate programming DTD could be translated directly into LaTeX or HTML, or it could be translated into input for your favorite WEB program. For example, for C programs, the target of your translation could be CWEB, FWEB, or Nuweb. Has any one thought about using SGML for literate programming by producing a DTD tuned just for this purpose?


From:     Tommy McGuire
Date: 15 Apr 1994
John Ramsdell writes: In particular, with a carefully designed Document Type Definition (DTD), one could encode literate programs which can be displayed as printed text and hypertext.

As an aside, this is not as simple as it appears. One of the recent flame wars topics on the SGML mailing list has concerned the problem of getting both printed docs and hypertext from the same source. Doing this apparently requires compromises in at least one of the two products.

For literate programming, one would start with a DTD that encodes technical documents, presumably including facilities for tables and equations. The DTD would be modified so as to allow the inclusion of code scraps and their references. Documents using the literate programming DTD could be translated directly into LaTeX or HTML, or it could be translated into input for your favorite WEB program. For example, for C programs, the target of your translation could be CWEB, FWEB, or Nuweb. Has any one thought about using SGML for literate programming by producing a DTD tuned just for this purpose?

The QWERTZ DTD and application includes a simplified literate programming system. QWERTZ (also known as <mumble>FORMAT<mumble> at some FTP sites) is nearly a direct translation of LaTeX tags into a SGML DTD. It provides nearly direct translation into LaTeX as well as somewhat indirect translation into troff and HTML. (So I hear. I haven't actually seen the HTML output.)

The literate programming environment involves a "code" tag. The code scraps are concatenated in the order that they appear in the source when producing a machine-usable file and are formatted similarly to nuweb in a LaTeX file. This works pretty well for things that are not structurally decomposed, such as the QWERTZ DTD itself. Also, it wouldn't be too hard to add an attribute or three to the code element for a scrap name, enabling reordering and embedding of scraps. I haven't tried this since I am rather fond of getting correct line numbers along with error messages from compilers, for example, and I haven't figured out a way to do that. Disclaimer: I haven't done much with SGML or HTML. I just like the flame wars on the SGML mailing list.