Is it really ambiguous?
Exploring the intersection of ambiguity and serialization in iXML. A, uh, tagging Tuesday post.
I’ve missed a few markup Mondays. I’ve been…busy. I’ll try to catch up. This post is about ambiguity in iXML and how it appears to interact with serialization. (I say “appears to” because technically, it doesn’t, but we don’t need to be that pedantic for now.)
Consider the following grammar, example1.ixml:
S: A | B .
A: "a" .
B: "a" .
That grammar is ambiguous, it parses the sentence “a” in two different ways. I can ask CoffeePot to show me both of them:
coffeepot -g:example1.ixml a --parse-count:2
for the unsurprising result:
<ixml parses="2" totalParses="2" xmlns:ixml='http://invisiblexml.org/NS'>
<S ixml:state='ambiguous'><A>a</A></S>
<S ixml:state='ambiguous'><B>a</B></S>
</ixml>
If S goes to A goes to “a”, you get one parse, if S goes to B goes to “a”, you get the other.
By default, CoffeePot is deterministic, it will always give you the “A” parse
first and the “B” parse second. If you want a random one, --axe:random is the
option you’ve been missing.
With a small change to the grammar, we’ll get different results.
Consider, example2.ixml:
S: A | B .
A > C: "a" .
B > C: "a" .
From a parsing perspective, that’s exactly the same grammar. It parses the same sentences. It has the exact same ambiguity. But if we ask for both parses now, we get something perhaps surprising:
<ixml parses="2" totalParses="2" xmlns:ixml='http://invisiblexml.org/NS'>
<S ixml:state='ambiguous'><C>a</C></S>
<S ixml:state='ambiguous'><C>a</C></S>
</ixml>
They’re the same!
That trick with > is called a renaming and it changes the name of a nonterminal in the serialization.
If you look at the ambiguity of the results only through the lens of the XML serialization, you can assert that it isn’t ambiguous. (You’re wrong, it is exactly as ambiguous from a parsing perspective, but I understand where you’re coming from.)
Another way to get this effect is with suppressed names. The example3.ixml
grammar:
S: A .
A: B | C .
B: "a" .
C: "a" .
is ambiguous:
<ixml parses="2" totalParses="2" xmlns:ixml='http://invisiblexml.org/NS'>
<S ixml:state='ambiguous'><A><B>a</B></A></S>
<S ixml:state='ambiguous'><A><C>a</C></A></S>
</ixml>
but if we add marks to suppress the B and C nonterminals in the serialization, we get example4.ixml:
S: A .
A: B | C .
-B: "a" .
-C: "a" .
and once again, the results don’t look ambiguous.
<ixml parses="2" totalParses="2" xmlns:ixml='http://invisiblexml.org/NS'>
<S ixml:state='ambiguous'><A>a</A></S>
<S ixml:state='ambiguous'><A>a</A></S>
</ixml>
Starting in NineML version 3.3.10, you can use the --ignore-same-serializations option on CoffeePot to
ignore ambiguity if it isn’t visible in the serialization. For example:
coffeepot -g:example4.ixml a --ignore-same-serializations
produces:
<S><A>a</A></S>
I hasten to add this is behavior that’s nonconformant with the specification. That grammar is ambiguous.
The algorithm for detecting “the same serialization” is conservative. It recognizes a couple of patterns in the output that have the property that they make the serializations the same. There are lots of more complex ways to arrive at the same place, but NineML won’t detect those.
If you have a grammar that always produces the same serialization and you think it would be nice if NineML could tell, feel free to open an issue about it. No promises though.