compiler theory question

Tom Bradford plug-discuss@lists.PLUG.phoenix.az.us
Wed, 02 May 2001 18:00:48 -0700


Trent Shipley wrote:
> The parser engine reads the DTD.
> It reads the XML document.
> It produces a parse tree based on the DTD and XML inputs.

The actual order is:

- The parser reads in the XML document
- The parser sees a reference to a DTD and tries to resolve it from
wherever it may be.
- As the parser parses into the Document Element, if there is a DTD, it
uses the DTD rules to determine if the nestings and attributes it's
parsing are valid.
- It produces SAX events or a DOM tree based on the validly parsed XML.

In the case of a programming language, the language syntax drives the
parsing of the stream.  In the case of XML, the parsing of the document,
drives the validation process.  So you can't really compare the two.

The idea behind XML is that it should be very simple to write a parser
to read an XML document.  An XML document can be well-formed, but still
not be valid (if it even requires validation).  The problem with DTDs is
that they can change the canonical value of a Document, which means that
they can potentially break systems to a great degree.  

For example, Netscape yanking the RSS 0.9 DTD broke content syndication
for a lot of people.  But this breaks things beyond simply validating
the Documents... even if I decided that the original DTD was gone, and
that I would just not validate it anymore, the parser might still have
to resolve the DTD if my document had any entity references in it that
the DTD defined.  Worse, if I had been creating elements, but was
leaving out attributes that had defaulted values in the DTD, I'd
completely lose that data if the DTD were lost.  

Another problem area is someone changing the default value of an
attribute.  Default column values in the relational database world can
be changed without much fear of breaking a system because the values are
filled in as they are inserted into the system and are retained from
then on.  In XML, to use a default value, you simply don't assign the
attribute, and the parser will report to you its default value as if it
had actually been in your document.  So if you approached creating a
document like you would approach creating a relational database record,
you'd be SOL when the value assumptions you made at creation time come
back to you completely differently the next time you retrieve the
document.

Issues like this are why projects like Minimal XML have been started,
and are why discussion of XML Infosets and Canonical XML are so
debated.  Schemas are definitely a better approach than DTDs, but are
far more complex, and still have some of the same DTD failings (like
defaulted values)

-- 
Tom Bradford --- The dbXML Project --- http://www.dbxml.org/
We store your XML data a hell of a lot better than /dev/null