so

The XML grammar is ambiguous

Volume 9, Issue 17; 31 Aug 2025

The grammar for XML grammars is infinitely ambiguous. That came as a surprise to me.

In the course of hacking at XMLn’t, I was reminded that the XML grammar for conditional sections is as follows:

[61] conditionalSect
       ::= includeSect | ignoreSect
[62] includeSect        
       ::= '<![' S? 'INCLUDE' S? '[' extSubsetDecl ']]>'
[63] ignoreSect         
       ::= '<![' S? 'IGNORE' S? '[' ignoreSectContents* ']]>'
[64] ignoreSectContents 
       ::= Ignore ('<![' ignoreSectContents ']]>' Ignore)*
[65] Ignore             
       ::= Char* - (Char* ('<![' | ']]>') Char*)

Observe that ignoreSect contains 0 or more occurrences of ignoreSectContents and ignoreSectContents matches Ignore followed by 0 or more occurrences of a properly nested section followed by Ignore. And, finally, Ignore matches 0 or more characters.

Given this input:

<![IGNORE[this]]>

At the point where t is the next token in the input, the grammar can decide to match nothing. And then match nothing again. And then again. Ad infinitum.

I translated this bit of the grammar into iXML, just for fun:

conditionalSect
  = ignoreSect, S? .
ignoreSect         
  = -'<![', S?, -'IGNORE', S?, -'[', ignoreSectContents*, -']]>' .
ignoreSectContents 
  = Ignore, (-'<![', ignoreSectContents, -']]>', Ignore)* .
-Ignore
  = Char* .

-S      = [#9|#a|#d|#20] .
-Char   = ~[] .

I’ve elided includeSect because it’s not relevant and the grammar for extSubsetDecl is substantial. I’ve also added some ignore marks and added dummy productions for S and Char. In iXML you can’t (today, at least) express subtraction, so the rule for Ignore doesn’t actually handle nested sections correctly.

Given

<![ IGNORE [this]]>

CoffeePot finds 11,135 possible parses of infinitely many before spitting out:

<conditionalSect xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'>
   <ignoreSect>
      <ignoreSectContents>this</ignoreSectContents>
   </ignoreSect>
</conditionalSect>

which is one of the infinitely many correct parses. (In fact all infinitely many of them are identical.) But that only succeeds because CoffeePot won’t traverse a loop indefinitely.

The lovely parsers produced by Gunther Rademacher’s REx parser generator weren’t happy with this situation, so I rewrote the grammar to avoid the ambiguity.

I don’t recall discussion of this back in the XML Core Working Group days, nor do I have access to those archives now. I think it must have gone unnoticed because I think we would have fixed it, but I could be wrong.

It was almost 30 years ago.

#XML

Please provide your name and email address. Your email address will not be displayed and I won’t spam you, I promise. Your name and a link to your web address, if you provide one, will be displayed.

Your name:

Your email:

Homepage:

Do you comprehend the words on this page? (Please demonstrate that you aren't a mindless, screen-scraping robot.)

What is three times two?   (e.g. six plus two is 8)

Enter your comment in the box below. You may style your comment with the CommonMark flavor of Markdown.

All comments are moderated. I don’t promise to preserve all of your formatting and I reserve the right to remove comments for any reason.