Transportable webs in SGML

From:	Edward Keith
Date:	1 Jul 1993

If the purpose of literate programming is to communicate what we have done, then this is a clear failure! Marcus has a WEB that is undecipherable to me, and even if I manage to figure out this problem, that simply postpones the pain until next time. So, the question is, in general, how do we make webs (and other TeX files) transportable?

SGML? It is an ISO standard.

From:	Chris Flatters
Date:	1 Jul 1993

SGML is changing the playing field a little from TeX. SGML is a markup language whereas TeX is a typesetting language. SGML describes the structure of a document (e.g. the following word is an identifier) rather than its appearance (e.g. the following word should appear in italic type). It would certainly make sense to have a WEB system that wove to SGML since we should really be more concerned with the content of a woven document than its appearance. It might be even more useful if it generated HTML (a version of SGML with hypertext extensions used for the World-Wide Web project).

From:	Karl Vogel
Date:	2 Jul 1993

Chris Flatters writes: It would certainly make sense to have a WEB system that wove to SGML since we should really be more concerned with the content of a woven document than its appearance.

This definitely gets my vote. The nicest thing about SGML is the fact that parsers for it are freely available. I can write something to turn valid SGML into Texinfo or Troff a lot more easily than I can write something to manipulate (say) a given style of WEB code. The best thing about SGML is that it can be used to generate several representations of any document. You can have a driver that reads an SGML document and generates output suitable for printing, or ready-to-compile code, or a hypertext representation of your program intended for easy browsing.

It might be even more useful if it generated HTML (a version of SGML with hypertext extensions used for the World-Wide Web project).

I have one minor quibble with this idea. How stable is the HTML document definition? It won't help us if the rules for creating an HTML document are too fluid. We might be better off creating a document definition which is solely intended for literate programming types. This way, we have more control over our own destiny; the HTML folks can do what they like, and it won't hurt us a bit. I am not saying that we couldn't borrow ideas from them; I just don't think we should tie our destiny to a style of document representation that's rooted in information retrieval rather than programming.

From:	Trevor Jenkins
Date:	2 Jul 1993

Back in the days before the ISO 8879 (SGML) was published I was a member of the British committee that participated in the work. For the technical report on using SGML I had proposed that an example SGML DTD be written which would allow WEB files to be analyzed. I even volunteered to write it. My fellow committee members didn't see WEB as sophisticated enough---they wanted a DTD for Z instead. Not being a Z expert I dropped the idea. Their interest in Z was such that they didn't do anything either. On reflection I wished that I had persevered with the WEB one.

From:	Jeffrey McArthur
Date:	2 Jul 1993

Just a few comments about SGML. First, there are a lot of misconceptions about SGML. I know, I have had to learn a tremendous amount in the past few weeks since I am now working on a massive SGML job. TeX and SGML go together very nicely. I am currently feeding raw SGML files into TeX and typesetting them. If are willing to write some macros and play with catcodes you can relatively easily typeset SGML documents with TeX without the need for any pre-processor. The much more serious problem with SGML is that it is almost impossible to edit. Let me give you an idea. This is from a real SGML document that am printing via TeX:

<INGRDNTS ID="S2" NUMBER="2"><TITLE>HAZARDOUS INGREDIENTS
</TITLE><SUBSECT><TITLE>HAZARDOUS INGREDIENTS</TITLE><TABLE>
<TBLHEAD><TBLBODY><TBLCOLS><TBLCOL HALIGN="Justify"><TBLCOL
 HALIGN="Justify"><TBLCOL HALIGN="Justify"><TBLCOL
 HALIGN="Justify"></TBLCOLS><TBLROW><TBLCELL COLSTART="1"
 HALIGN="Center">Hazardous Components</TBLCELL><TBLCELL
 COLSTART="2" HALIGN="Center">OSHA PEL</TBLCELL><TBLCELL
 COLSTART="3" HALIGN="Center">ACGIH TLV</TBLCELL><TBLCELL
 COLSTART="4" HALIGN="Center">CAS Number</TBLCELL></TBLROW>
</TBLBODY></TBLHEAD><TBLBODY><TBLCOLS><TBLCOL HALIGN="Justify">
<TBLCOL HALIGN="Justify"><TBLCOL HALIGN="Justify"><TBLCOL
 HALIGN="Justify"></TBLCOLS><TBLROW><TBLCELL
 COLSTART="1">Calcium Carbonate</TBLCELL><TBLCELL
 COLSTART="2">3.75 mg/m<SUP>3</SUP> 1.4 mg/m<SUP>3</SUP> resp.
 dust</TBLCELL><TBLCELL
 COLSTART="3">10 mg/m<SUP>3</SUP></TBLCELL><TBLCELL COLSTART="4">1317-65-3
</TBLCELL></TBLROW><TBLROW><TBLCELL COLSTART="1">Pyrophyllite
</TBLCELL><TBLCELL COLSTART="2">50 mppcf</TBLCELL><TBLCELL
 COLSTART="3"></TBLCELL><TBLCELL COLSTART="4">12269-78-2
</TBLCELL></TBLROW><TBLROW><TBLCELL COLSTART="1">Muscovite
 (MICA)</TBLCELL><TBLCELL COLSTART="2">20 mppcf</TBLCELL>
<TBLCELL COLSTART="3"></TBLCELL><TBLCELL COLSTART="4">1318-94-1
</TBLCELL></TBLROW><TBLROW><TBLCELL COLSTART="1">Kaolinite
</TBLCELL><TBLCELL COLSTART="2">50 mppcf</TBLCELL><TBLCELL
 COLSTART="3">0.1 mg/m<SUP>3</SUP></TBLCELL><TBLCELL COLSTART="4">1332-58-7
</TBLCELL></TBLROW><TBLROW><TBLCELL COLSTART="1">Quartz (total)
</TBLCELL><TBLCELL COLSTART="2">30 mg/m<SUP>3</SUP> /(% quartz +2)
</TBLCELL><TBLCELL COLSTART="3">0.1 mg/m<SUP>3</SUP></TBLCELL><TBLCELL
 COLSTART="4">14808-60-7</TBLCELL></TBLROW></TBLBODY></TABLE>
</SUBSECT></INGRDNTS>

This is an extreme example. Part of the problem is caused because SGML does not normally break the file into separate lines. It is not unusual to run into a 200K SGML file without a single line feed (or carriage return, or cr/lf, or carriage control, all depending on what OS you are running under). This breaks a lot of tools (including TeX). Another thing to realize is that it is possible to have commands for italic, bold, and so on in an SGML file. <ITALIC> is not an uncommon tag. Also the move verbose <EMPH TYPE="ITALIC"> is also found. Now the first example was a bit nasty because it had a table in it. So here is a second example, from the same document, that is a bit easier to deal with:

 
<OTHER ID="S8" NUMBER="8"><TITLE>GENERAL CONTROL MEASURES
</TITLE><SUBSECT ID="S8-1"><TITLE>Ventilation</TITLE><PARA>None
 other than normal with ordinary use.</PARA></SUBSECT><SUBSECT
 ID="S8-2"><TITLE>Respiratory Protection</TITLE><PARA>None with
 ordinary use. If prolonged exposure, wear a MSHA/NIOSH
 approved dust/pesticide respirator. Avoid breathing dust.
</PARA></SUBSECT><SUBSECT ID="S8-3"><TITLE>Protective Gloves
</TITLE><PARA>None with ordinary use. If handling spill, wear
 impervious gloves.</PARA></SUBSECT><SUBSECT ID="S8-4">
<TITLE>Eye Protection</TITLE><PARA>None with normal use. Wear
 safety glasses with side shields or goggles if eye contact is
 possible.</PARA></SUBSECT><SUBSECT ID="S8-5"><TITLE>Other
 Protective Clothing and Equipment</TITLE><PARA>To avoid contact
 with skin, wear protective apparel during application.</PARA>
</SUBSECT></OTHER>

This is much more code intensive than TeX. This is why SGML tools try and and allow you to edit without the tags (Author/Editor for example). One final thought. SGML can be viewed as a meta-markup-language. You can define you own mark-up scheme in SGML if you are willing to change the characterset (allowed in SGML) and so on. I bet, if you were seriously ambitious, you could almost write a DTD for WEB (classic Pascal WEB).

From:	Trevor Jenkins
Date:	3 Jul 1993

Just a few comments about SGML. First, there are a lot of misconceptions about SGML. I know, I have had to learn a tremendous amount in the past few weeks since I am now working on a massive SGML job.

SGML has always been plagued by this problem. During my stint in the international standards work the problem was from the office people (ODA/ODIF ISO 8613) who felt threatened that SGML was encroaching upon their remit. It doesn't in that SGML is aimed at high-end publishing where as ODA/ODIF was aimed at the typical character-cell based office environment (personally I always felt that what ever the ODA crew were adding into their architecture SGML already did and better.

The much more serious problem with SGML is that it is almost impossible to edit. Let me give you an idea.

That isn't really a problem with the Standard Generalized Markup Language. If it were then one could make the same complaint about programming languages, eg C, Pascal or even (dare I say it) Literate Code. The work that Mike Cowlishaw (IBM) did for the OED project demonstrated that an SGML oriented editing system ca be created and use with a very complex document.

Part of the problem is caused because SGML does not normally break the file into separate lines.

Again it does need to. Consider the output from TANGLE where there is the same problem. If you automated text-generation then you will end up with output that is impossible for human beings to read.

It is not unusual to run into a 200K SGML file without a single line feed (or carriage return, or cr/lf, or carriage control, all depending on what OS you are running under). This breaks a lot of tools (including TeX).

I have received mail and news message which break my mailer/news-reader that is not the fault of the message (which by the way didn't exceed the minimum limits imposed by the appropriate RFCs) but rather laziness on the part of the programmer. (I didn't really mean to insult DEK by that comment.

Another thing to realize is that it is possible to have commands for italic, bold, and so on in an SGML file. <ITALIC> is not an uncommon tag.

Then it is wrong.

Also the move verbose <EMPH TYPE="ITALIC"> is also found.

This is the purist-style and is indeed what the text of ISO 8879 mandates. Not that it is possible to check for it.

One final thought. SGML can be viewed as a meta-markup-language. You can define you own mark-up scheme in SGML if you are willing to change the characterset (allowed in SGML) and so on.

SGML includes both a mechanism for specifying the structure of a document (your meta-markup) and how a document should be encoded so as to conform to that structure. With public entity texts it is possible to create an SGML that doesn't include the specific DTD being used though it must be available when the document is processed.

I bet, if you were seriously ambitious, you could almost write a DTD for WEB (classic Pascal WEB).

I don't think that the task is that "ambitious". Nor do I think that it need to be limited to classic Pascal WEB. Now I have to go off and prove it dont I.

From:	Joachim Schrod
Date:	5 Jul 1993

[This is a long response, sorry. I tried to cut it down, but I cannot make it less text without loosing clarity.]

SGML is changing the playing field a little from TeX. SGML is a markup language whereas TeX is a typesetting language. SGML describes the structure of a document (e.g. the following word is an identifier) rather than its appearance (e.g. the following word should appear in italic type).
Jeffrey wrote: Just a few comments about SGML. First, there are a lot of misconceptions about SGML. I know, I have had to learn a tremendous amount in the past few weeks since I am now working on a massive SGML job.

Please, don't take this mail personally -- but it's my impression that you have more to do. Your mail gives a completely wrong presentation of SGML. In addition,

TeX and SGML go together very nicely. I am currently feeding raw SGML files into TeX and typesetting them.

Yes, that can be done for particular document types -- but it's really not the "canonical" way. TeX isn't a programming language that's really suited for this type of tasks. Let me get a bit more structured, so that you can criticize me, too: First I will give an outline what SGML is. Then I will attack your statement that SGML documents are unreadable, both in principle and with an example. At last I will do a sketch of the `usual' connection with SGML and TeX.

WHAT IS SGML?

SGML is an acronym, it stands for Standard Generalized Markup Language. The important word herein to distinguish it from other markup languages is ``Generalized''. We can distinguish four categories of markup:

presentational markup - The document's structure is shown by laying out the content on the page/screen. This might seem trivial (introducing spaces and lines) and might be more (itemized lists, etc.) This is what I am doing manually now.
procedural markup - The text is interspersed with formatting commands, which explain how the document is to be formatted. (plain) TeX and [nt]roff are typical examples of this.
generic or descriptive markup - The document is tagged to show its structure explicitly. The tags are defined externally. Scribe is the canonical example for this category. From its intent, LaTeX belongs here, too. But one has still the full access to the procedural facilities of TeX. (the greatest strength and the greatest weakness of LaTeX.)
generalized markup - The document is described on three levels, which build the parts of a complete document. The first level tells how the input will look like. It will describe the character set, what interpretations are associated with characters, how tags are created, etc. I.e., one explains the lexical conventions, how lexems to describe the document are built. The second level defines the potential structure of the document. One explains explicitly which structural elements may occur and which relationships and attributes they have to another (consists-of, is-optional, etc.) I.e., one defines a grammar which explains the document structure. The third level is the document's content, tagged according to the conventions introduced on level 1 and 2. This part is called document instance in SGML terminology.

SGML belongs to category 4. In so far as it contains a language to define markup languages, it is more than "yet another markup language." That's the reason why it's called a meta-language by some people. That's arguable, 'though I won't subscribe to this view -- we describe full documents after all, and full documents have no `meta'-ness whatsoever.

READABILITY OF DOCUMENT INSTANCES

Jeffrey presented horrible examples of SGML input. The problem here is that one has to distinguish between two ways of working with SGML documents: the "wealthy way" (with appropriate tools) and the "poor man's way" (by hand). If the example concerned the former way it was simply wrong; if it concerned the latter, it was plain bad. The wealthy way uses context-sensitive editors, embedded in author systems. The author will (should never see the tagged text. As somebody already noted, it's like not looking at the output of TANGLE. So the readability of the internal [sic!] document representation -- as presented by Jeffrey -- is not of any concern here. The readability of the document as presented by the editor is the point to ask for. And here you don't see that mess, you'll see a nice presentation of your structure, with outlining possibilities and all kind of things you dream of (querying for the contents of specific elements and similar things). The poor man's way uses a simple editor and types in the markup declaration and the document instance. Then he or she will add shortcuts which makes typing more easily. Tags can be omitted from the document instance, the SGML system will insert them for your convenience. As a real life example, a document instance taken (almost verbatim, minus the DTD and some omitted text declaration from a file here:

<itiman>

<headline>TUD/ITI
<name>sman
<chapter>1
<whatis>convert SGML itiman manual page to nroff man format

<synopsis>
<synline>sman [-esis] <em>file[.sgm]</>
</synopsis>

<description>

<mref>sman</> provides an easy way for converting manual pages in
SGML itiman format to nroff (conversion to TeXinfo is planned).
Either SUN and IBM/HP man package format is created automatically.

Two different kinds of document structures are supported: one for
command man pages that consist of sections like synopsis,
description, options, etc., and one for miscellaneous ones which only
consist of sections with arbitrary names.

For both, the text can contain markups for emphasized text,
description and option lists, verbatim mode, and more. A detailed
description of the markup can be found in the tutorial ``<em>How to
write a SGML itiman Manpage<em>''.

</description>

<options>
<optionlist>
    <option>     -esis
    <optiondesc> generate only the intermediate representation as created by
        the sgmls frontend
</optionlist>
</options>

[...]

<seeAlso>
<mref>sgmls(1)</>, <mref>nroff(1)</>
</seeAlso>

</itiman>

To cite Jeffrey: The much more serious problem with SGML is that it is almost impossible to edit. Let me give you an idea.

I find the example above neither unreadable nor impossible to edit. My idea is obviously different from yours -- the readers of this mail should judge for themselves. (I.e., the created nroff source is much more unreadable... [this is only a partly extraction] )

.st "sman" 1 "TUD/ITI" \*(]W
.SH NAME
sman \- convert SGML manual page to nroff man format
.SH SYNOPSIS
.B sman [-esis] \fIfile[.sgm]\fR
.br
.SH DESCRIPTION
\fBsman\fR provides an easy way for converting manual pages in
SGML itiman format to nroff (conversion to TeXinfo is planned).
Either SUN and IBM/HP man package format is created automatically.
.PP
Two different kinds of document structures are supported: one for
command man pages that consist of sections like synopsis,
description, options, etc., and one for miscellaneous ones which only
consist of sections with arbitrary names.
.PP
For both, the text can contain markups for emphasized text,
description and option lists, verbatim mode, and more. A detailed
description of the markup can be found in the tutorial ``\fIHow to
write a SGML itiman Manpage\fR''.
.SH OPTIONS
.TP
.B      -esis
generate only the intermediate representation as created by
the sgmls frontend

PROCESSING SGML DOCUMENTS OR THE CONNECTION TO TeX

If are willing to write some macros and play with catcodes you can relatively easily typeset SGML documents with TeX without the need for any pre-processor.

The pre-processor -- usually called SGML parser -- is exactly the tool which makes SGML so valuable. It delivers a canonical form of the document instance, where all omitted tags and all shortcuts are expanded. It checks the validity of the markup, i.e., one is sure afterwards that the document is correctly tagged. Therefore it is easy to transform this now into valid TeX markup. To program this validation and this `normalizing' in TeX itself is a nightmare. (So much about the statement of Dominique that companies are more concerned about maintenance, modularity, and the ability to support it by other people, than universities. That it is doable, doesn't mean this work is well invested. (But I am programming in TeX for 11 years now, and know its limitations quite well.)

If you have enough money and work under UNIX systems -- have a look at the SGML Publisher of Arbortext. (I have no connection to this company except knowing a few people there personally.) It uses an extended version of TeX as the publishing engine to SGML documents. It's a great piece of software for professional document preparation. Especially the table and math support is great (where traditionally SGML systems were weak). Author/Editor -- mentioned from Jeffrey already -- is a nice system. DynaText from Electronic Book Technologies is often mentioned as outstanding, 'though I hadn't the chance to look at it yet. On the freely distributable (poor man's) side: There are some converters available, most notably gf and Format. A large archive of freely distributable material is accessible by anonymous FTP at "ftp.th-darmstadt.de [130.83.55.75] directory pub/text/sgml" In the subdir documentation/ there's also the c.t.sgml FAQ and the famous SGML bibliography of Robin Cover.

From:	Jeffrey McArthur
Date:	7 Jul 1993

Let me respond to a few points Joachim raised.

TeX isn't a programming language that's really suited for this type of tasks.

Actually I find TeX to be extremely well suited for this task. It is much better than any alternative I can find. Let me give you a bit of history about the data I presented. The data was given to us in this fashion. We received over 5000 separate SGML files. Each of them coded this way. Only 10% of the files parsed. This job has been a nightmare. I have had to work for several weeks just trying to get the data to parse and resemble something reasonable. (To give you an idea, as a parting shot the previous contractor replaced all the occurrences of "oc" in all the files with the hex character A1 followed by an uppercase C.)

Your arguments are fine if you are working with rational people who present you rational data. We are not. We received data that has MANY problems. Including entire sections miss-coded and so on. This is in spite of the problem that the original DTD was very, very sloppy. We have done a lot of work re-writing the DTD so that it has some semblance of order. Let me explain it this way. If you are writing from scratch and using good tools, SGML is a wonderful tool. If you are handed 50 Meg of data which the client claims to be SGML (but only 10% of it parses) and you have to deal with it in whatever way you can, well it is not a pleasant task.

What would you do if you were give 5000 files, of which only 500 parsed, and you had to edit the data? For example, Author Editor will not read in many of the files because they are coded wrong. You have no choice but to use the ``poor man's way''. You have to look at the coding. You also need a parser which will show you the problems.

Context-sensitive editors work great if the data parses. Consider a common problem we run into. We need to create an SGML document from a hard copy source. There is no electronic copy of the data. We have two options: scanning and keying. Scanning is fine, but you still have to add all the tags. Keying can have both done at the same time. Most keying houses can easily add the SGML tags to a document. However, in both cases you will end up with documents that probably will not parse on the first pass. Also with keying you can run into some structure problems, that is, the tags are wrong in such a way that it conflicts with the DTD. In that case, there is no option, but the "poor man's way". One of the serious problems with all the editing tools for SGML is that they assume that the data is tagged in a reasonable in accordance with the DTD. This is what you want to end up with, but you may not start with anything resembling that.

One of the serious problems with SGML is that the name has almost become a "buzzword". Not quite as bad as "object-oriented" but it may get there. We have another client who is moving in the direction of SGML. The data is not there yet. They don't want to spend the time nor the money to try and validate the 100 or so Meg of data that they have. But the tags are now consistent with SGML style, and it may be possible to write a DTD for what they have. But we know that the data would not parse. There are too many inconsistencies in the data. It would take a lot of time and effort to get all the data to pass a parser. But eventually they will do that, but not this year, nor next year.

So what do you do if you receive a Pseudo-SGML document? In our case we must deal with it. We try and point out the problems with the document to our clients. We will try and fix them, if we can. So if you live in a world where you only deal with real SGML documents that always parse, you have it easy. I do not. I must deal with documents that don't parse, that are miss-tagged, and have structural errors in them. When I am through with them, they parse.