The XML grammar is ambiguous
The grammar for XML grammars is infinitely ambiguous. That came as a surprise to me.
In the course of hacking at XMLn’t, I was reminded that the XML grammar for conditional sections is as follows:
[61] conditionalSect
::= includeSect | ignoreSect
[62] includeSect
::= '<![' S? 'INCLUDE' S? '[' extSubsetDecl ']]>'
[63] ignoreSect
::= '<![' S? 'IGNORE' S? '[' ignoreSectContents* ']]>'
[64] ignoreSectContents
::= Ignore ('<![' ignoreSectContents ']]>' Ignore)*
[65] Ignore
::= Char* - (Char* ('<![' | ']]>') Char*)
Observe that ignoreSect
contains 0 or more occurrences of ignoreSectContents
and ignoreSectContents
matches Ignore
followed by 0 or more occurrences of a
properly nested section followed by Ignore
. And, finally, Ignore
matches 0
or more characters.
Given this input:
<![IGNORE[this]]>
At the point where t
is the next token in the input, the grammar can decide to
match nothing. And then match nothing again. And then again. Ad infinitum.
I translated this bit of the grammar into iXML, just for fun:
conditionalSect
= ignoreSect, S? .
ignoreSect
= -'<![', S?, -'IGNORE', S?, -'[', ignoreSectContents*, -']]>' .
ignoreSectContents
= Ignore, (-'<![', ignoreSectContents, -']]>', Ignore)* .
-Ignore
= Char* .
-S = [#9|#a|#d|#20] .
-Char = ~[] .
I’ve elided includeSect
because it’s not relevant and the grammar for
extSubsetDecl
is substantial. I’ve also added some ignore marks and added
dummy productions for S
and Char
. In iXML you can’t (today, at least)
express subtraction, so the rule for Ignore
doesn’t actually handle nested
sections correctly.
Given
<![ IGNORE [this]]>
CoffeePot finds 11,135 possible parses of infinitely many before spitting out:
<conditionalSect xmlns:ixml='http://invisiblexml.org/NS' ixml:state='ambiguous'>
<ignoreSect>
<ignoreSectContents>this</ignoreSectContents>
</ignoreSect>
</conditionalSect>
which is one of the infinitely many correct parses. (In fact all infinitely many of them are identical.) But that only succeeds because CoffeePot won’t traverse a loop indefinitely.
The lovely parsers produced by Gunther Rademacher’s REx parser generator weren’t happy with this situation, so I rewrote the grammar to avoid the ambiguity.
I don’t recall discussion of this back in the XML Core Working Group days, nor do I have access to those archives now. I think it must have gone unnoticed because I think we would have fixed it, but I could be wrong.
It was almost 30 years ago.