You can’t parse XML DTDs
Another surprising (to me) observation about the XML grammar for XML.
I was thinking some more about parsing DTDs with the specification grammar for XML and I realized that you can’t. I mean, obviously, you can parse XML DTDs, but you can’t do it by using the grammar andI don’t think this is news. And maybe I knew it 30 years ago and have just forgotten. In any event, iXML has rekindled my interest in parsers, so I’m coming at this with a different perspective now than I would have had back then. off-the-shelf tools to build a parser from the grammar.
This problem, like the ambiguity, arises in the definition of conditional sections. But this one is insoluble.
Consider the example from the specification:
<!ENTITY % draft 'INCLUDE' >
<!ENTITY % final 'IGNORE' >
<![%draft;[
<!ELEMENT book (comments*, title, body, supplements?)>
]]>
<![%final;[
<!ELEMENT book (title, body, supplements?)>
]]>
It just flat out doesn’t parse according to the grammar because the
grammar for conditional sections only allows INCLUDE
or EXCLUDE
and we’ve got parameter entity references (%draft;
and %final;
) in that example.
To be fair to the XML specification, this is called out explicitly:
If the keyword of the conditional section is a parameter-entity reference, the parameter entity MUST be replaced by its content before the processor decides whether to include or ignore the conditional section.
That means the parser has to keep track of what parameter entity declarations it’s seen and replace them in a conditional section before attempting to parse the conditional section. At that point, you’re so far outside the scope of what an off-the-shelf grammar-based parser can do that you might just as well special case the whole task of parsing conditional sections.
I expect this is why no one noticed that the grammar was ambiguous.
And note that fixing the grammar so that includeSect
and ignoreSect
could parse a parameter entity reference in addition to the literal
words INCLUDE
and IGNORE
wouldn’t help. The content model of the
section depends on what the parameter entity resolves to!
In SGML, you could put conditional sections in your document. They were sometimes quite useful especially because, unlike comments, they can be nested. But in XML, they can only occur in the DTD and I bet they are very nearly unused.
I’m just saying up front, XMLn’t ain’t gonna parse ’em if they’re defined with parameter entities.