Improved XMLn’t

Volume 9, Issue 21; 06 Oct 2025

XMLn’ter? I decided I could support marked sections identified with entity references.

XMLn’t is my “XML parser” built directly from the XML 1.0 fifth edition grammar. It’s non-conforming but it has two features you won’t get in a conforming parser: it doesn’t do attribute value normalization and it can preserve entity references without expanding them. It was fun to write, might be useful, and reminded me of some things about XML and grammars and parsing.

I have previously observed that there’s an ambiguity in the published grammar for XML and you can't parse XML with it. It was necessary to fix the ambiguity, and not too difficult. I left the parsing problem to one side, having concluded that it was a very tiny edge case and too difficult to fix.

And I’m sure I was right, about the tiny edge case part anyway, but the small part of my brain that wants to get it right, even for the edge cases, was unhappy.

Recall that the problem is that this is allowed in a⊕In SGML, you could put this in documents as well, where it was sometimes useful because an ignored section is a comment that can be nested. But it’s not allowed in XML documents, only DTDs. DTD:

<!ENTITY % draft 'INCLUDE' >
<!ENTITY % final 'IGNORE' >

<![%draft;[
<!ELEMENT book (comments*, title, body, supplements?)>
]]>
<![%final;[
<!ELEMENT book (title, body, supplements?)>
]]>

But the grammar for XML says:

[61] conditionalSect ::= includeSect | ignoreSect
[62] includeSect
         ::= '<![' S? 'INCLUDE' S? '[' extSubsetDecl ']]>'
[63] ignoreSect
         ::= '<![' S? 'IGNORE' S? '[' ignoreSectContents* ']]>'

In the grammar, the literal string INCLUDE or IGNORE is required. The string <![%draft;[ simply doesn’t parse.

If you’re writing a real parser, this isn’t all that problematic. It must be the case that you’ve already seen the declaration for %draft; (or that’s a different error), so you know what its value is and you can do the right thing.

The XMLn’t parser is just running a grammar parser. The grammar parser knows from nothing about what entities have been declared.

But in point of fact, I have a couple of different grammars and I do a sort of piecemeal parse of the input. It occurred to me that I could approach it this way:

Extend the grammar so that marked section with a parameter entity boundary is treated exactly like an IGNORE marked section for the purpose of grammar parsing. This means that it’s properly nested with other marked sections, but makes no effort to recognize other structures in the section.
When my bit of code that’s recording the output from the parser sees the end of a parameter entity declaration, it already records the value of that entity.
When it sees a completed marked section, if that section was marked with a parameter entity, and the entity value is IGNORE, just throw it away. But if that entity value is INCLUDE, strip off the start and end section marks and reparse the content. (If the entity value is something else, that’s an error.)

Reparsing⊕In the case of multiply nested marked sections identified with parameter entities, parsing and reparsing may happen more than once. If you came here looking for high performance, I’m afraid you may leave disappointed. has to be done with a second instantiation of the grammar parser. There’s a little bookkeeping to make sure the reparsed content appears to have been where the marked section was, but that’s not too difficult.

Is this ever going to be used in practice? Probably not. But there’s a test in the test suite so it does get used!

I also recently fixed a bug where the XMLn’t parser wasn’t happy about names in the xml: namespace because there’s no declaration for it. Sorry about that one.