ETD Guide/Students/Preparing for conversion to SGML\XML

From Wikibooks, open books for an open world
Jump to: navigation, search

Section SGML/XML Overview defines SGML and XML.

The concept of Document Type Definitions (DTDs)[edit]

A document type definition (DTD), in the sense of XML, defines rules or templates, which are used to produce similarly, structured documents. A DTD describes the content model of a class of documents. It consists of:

  • An element declaration, which is the main part of a DTD and the structural definition. Elements can contain other elements, characters or nothing. Element declarations define the name of the element and the logical content (sub elements) of an element. (See [10].) An important part of the element declaration is the content model. It is here that the document architect indicates the order and occurrence of other element or character data.
  • A notation declaration, which defines a notation for external formats, e.g., for graphics (gif, jpeg), mathematics (TeX, LaTeX), 3D objects (VRML) and other formats, that cannot be coded directly in XML.
  • An entity declaration, which defines character, sets and replacement objects for characters. Everything from a single character on up can be defined with a single entity. There are two basic types of entities: general and parameter. Parameter entities are only allowed in declarations, and are usually used to make a DTD more readable or to control processing. General entities are used in the document instance; the documents build upon the DTD.
  • An attribute list declaration, where attributes and their values for the different element types defined in the element type declaration is listed.

To define a DTD, a special syntax is needed, which does not conform to the usual XML syntax where a document contains elements which are enclosed in "tags:" a start tag (e.g. <author>) and an end tag (e.g. </author>), producing code like this: <author> Joe Miller </author>

DTDs for electronic dissertations used worldwide[edit]

The fact that currently available authoring systems for XML still have not won wide recognition has led to different strategies at different universities regarding XML documents. Most of these projects were started between 1995 and 1997, in a time when XML was alive, but where no tools or standardized DTDs were available. A view of those projects from today’s perspective illustrates the demand for a rethinking and redesign of those approaches in order to come to a standardization.

All the presented DTDs are built upon similar principles. A classical dissertation (which can be seen as monograph) consists of 3 main components: an extensible title page with abstracts, declarations, etc., the dissertation corpus, which includes text, pictures, audio, video, tables and so on, as well as the appendices, which contain data sheets, bibliographies, acknowledgements and others.

The following DTDs are currently in use at different institutions:

  • ETD-ML.DTD: Virginia Polytechnic Institute and State University (Virginia Tech)
  • DiML.DTD: German Dissertationen Online Projectes
  • TDM.DTD: University of Iowa
  • HutPubl.DTD: Technical University Helsinki
  • TEI-Light.DTD: Ann Arbor und Lyon
  • ISOBook.DTD: University of Oslo
  • TEI-based DTD with extensions for natural sciences: Swedish University of Agricultural Sciences Uppsala

All those Document Type Definitions are so-called author-DTDs. This means that they are primarily used to support the authoring and the conversion process and do not first of all address document archiving and preservation issues. One may ask why all those different DTDs have prevailed. This is mainly because the scientific orientation of the mentioned universities is quite varied. Lyon, Oslo and Michigan, which use TEI-Light.dtd, mainly serve students in the arts and humanities. Problems using TEI.DTD or DocBook.DTD are recognized at universities, which support a strong natural science community, such as Berlin, Helsinki or Uppsala. Often a dissertation is a cumulative work, e.g., in Lyon or Helsinki.

Preparing for Conversion[edit]

Converting from word processing forms to SGML or XML requires more planning in advance, different tools, and broader learning about document processing concepts than does working with PDF. In addition, the end result is a representation that is easier to preserve, more reusable, and supportive of more powerful and effective schemes for searching and browsing. All of these advantages, however, must be weighed against the facts that there are fewer people knowledgeable about these matters, that often tools to help are more expensive and less mature, and that the process may be complicated, difficult, and time consuming. In 2000, there are tens of thousands of ETDs created by scanning (mostly by UMI, but also at sites like MIT and the National Document Center in Greece), thousands converted from word processors into PDF, and hundreds in SGML or XML – illustrating the relative effort required of students to prepare ETDs in each of these forms.

Simple word processing emphasizes layout or what-you-see-is-what-you-get (WYSIWYG) editing. Emphasizing what documents look like is quite distinct from focusing on the logical structure, for which markup schemes are best. Shifting from word processing representations to XML, requires a different way of thinking, a different approach. The problem is harder than producing HTML by exporting from a word processor, since instead of just having a document that looks like the original, it is necessary that the marked-up version itself is correctly tagged.

Some word processors have been extended to facilitate such an approach. Microsoft produced SGML Author for Word as an add-on package for Word 95, and new versions of WordPerfect can export content according to markup schemes. Eventually it is likely that most popular word processors will export to XML. Clearly, the resulting markup can surround document sections, headings, paragraphs, lists, figures, tables, citations, footnotes, hyperlinks, and other obvious constructs. In addition, regions with the same style can be tagged. Thus, to allow easy conversion from word processing to markup schemes requires choosing a target DTD and then consistently using document objects and styles so that there is a clear mapping from them to tags.

Conversion from LaTeX is slightly simpler since the TeX approach involves using formatting commands that can be mapped to tags in XML. However, LaTeX does not require strict nesting of commands, so it may not be clear where to place end-tags. Further, LaTeX users may not consistently use the same sequences to designate changes in structure, making translation more complex. Finally, LaTeX coding of mathematical expressions is very difficult to translate to markup schemes for mathematics, like MathML.

Because of the inherent complexity of converting from word processing schemes to markup representations, it is necessary to include steps for checking and correcting converted forms. Parsers can ensure syntactic correctness, so detecting problems is often simple. To ensure semantic correctness, however, manual inspection may be required. A further test would involve rendering the marked-up document, for example to a printed or PDF form, and ensuring that the result suitably matches the output resulting from the original word processing version. In any case, human labor is likely to be needed to correct conversion errors, and presupposes that students understand enough about the process and desired output to accomplish this with facility.






[5] Edward Fox: Networked Digital Library of Theses and Dissertations, Web matters, Aug., 12th 1999,

[6] Website of the standards committee of NDLTD:


[8] Tad Lane, Scalable Vector Graphics - Web Graphics with Original-Quality Artwork, in: BITS, November 1999,

[9] Neill Kipp: Beyond the Paper Paradigm: XML and the Case for Markup; in: Part II "Guideline for Writing and Designing ETDs" ETD Sourcebook, Weisser, Moxley and Fox editors, 1999

[10] B. Travis, D. Waldt: The SGML Implementation Guide, Springer, Berlin- Heidelberg-New York, 1995 [11] Ed Dumbill: The State of XML, June, 16th, 2000 in,

Next Section: In MS Word