The XML grammar is ambiguous

Volume 9, Issue 17; 31 Aug 2025

The grammar for XML grammars is infinitely ambiguous. That came as a surprise to me.

In the course of hacking at XMLn’t, I was reminded that the XML grammar for conditional sections is as follows:

[61] conditionalSect
       ::= includeSect | ignoreSect
[62] includeSect        
       ::= '<![' S? 'INCLUDE' S? '[' extSubsetDecl ']]>'
[63] ignoreSect         
       ::= '<![' S? 'IGNORE' S? '[' ignoreSectContents* ']]>'
[64] ignoreSectContents 
       ::= Ignore ('<![' ignoreSectContents ']]>' Ignore)*
[65] Ignore             
       ::= Char* - (Char* ('<![' | ']]>') Char*)

Observe that ignoreSect contains 0 or more occurrences of ignoreSectContents and ignoreSectContents matches Ignore followed by 0 or more occurrences of a properly nested section followed by Ignore. And, finally, Ignore matches 0 or more characters.

Given this input:

<![IGNORE[this]]>

At the point where t is the next token in the input, the grammar can decide to match nothing. And then match nothing again. And then again. Ad infinitum.

I translated this bit of the grammar into iXML, just for fun:

conditionalSect
  = ignoreSect, S? .
ignoreSect         
  = -'<![', S?, -'IGNORE', S?, -'[', ignoreSectContents*, -']]>' .
ignoreSectContents 
  = Ignore, (-'<![', ignoreSectContents, -']]>', Ignore)* .
-Ignore
  = Char* .

-S      = [#9|#a|#d|#20] .
-Char   = ~[] .

I’ve elided includeSect because it’s not relevant and the grammar for extSubsetDecl is substantial. I’ve also added some ignore marks and added dummy productions for S and Char. In iXML you can’t (today, at least) express subtraction, so the rule for Ignore doesn’t actually handle nested sections correctly.

Given

<![ IGNORE [this]]>

CoffeePot finds 11,135 possible parses of infinitely many before spitting out:

<conditionalSect xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'>
   <ignoreSect>
      <ignoreSectContents>this</ignoreSectContents>
   </ignoreSect>
</conditionalSect>

which is one of the infinitely many correct parses. (In fact all infinitely many of them are identical.) But that only succeeds because CoffeePot won’t traverse a loop indefinitely.

The lovely parsers produced by Gunther Rademacher’s REx parser generator weren’t happy with this situation, so I rewrote the grammar to avoid the ambiguity.

I don’t recall discussion of this back in the XML Core Working Group days, nor do I have access to those archives now. I think it must have gone unnoticed because I think we would have fixed it, but I could be wrong.

It was almost 30 years ago.