Towards NineML 3.0.0

Volume 7, Issue 34; 04 Jul 2023

The first beta release of what will be NineML 3.0.0 is available. Feedback most welcome.

What happened was, I was working on my Balisage paper. The paper is about ambiguity in Invisible XML grammars. Ambiguity arises when there’s more than one way to match an input against a grammar.

Here’s a trivial example:

S = A | B .
A = 'a' .
B = 'a' .

That grammar says, essentially, an “S” is either an “A” or a “B”. An “A” is exactly a lowercase “a”. A “B” is also exactly a lowercase “a”. That grammar only accepts one input, no points for guessing it’s a lowercase “a”. But it can produce two different, and equally valid, parses for it: <S><A>a</A></S> or <S><B>a</B></S>.

When you’re standing on “A”, you don’t have any choices. An “A” matches if what you’re looking at in the input is an “a” and it doesn’t match if you aren’t. But when you’re standing at “S”, you have a choice. You can “choose A” and proceed from there or you can “choose B” and proceed from there. In practical terms, the parser has to do both. It has to check if “A” works and it has to check if “B” works⊕Claus Brabrand, Robert Giegerich, and Anders Möller have demonstrated that there are two different ways that this can arise in a grammar: “horizontally” or “vertically”. More about that in the paper. and the parse is ambiguous if they both “work”.

For the paper, I was some exploring some examples. I had one grammar with a bunch of ambiguity (not this kind of ambiguity, but a bunch). I asked CoffeePot to list all of the possible parses for some short bit of input and it produced a dozen or so. Except, two of them were the same.

Two…of…them…were…the…same.

That is not supposed to happen. So I did a little investigating.⊕“No, no, no, don't tug on that. You never know what it might be attached to.” —Buckaroo Banzai Then I spent a long time down the rabbit hole, “what does it mean to enumerate all of the parses in a cyclic graph?” An added complication is to return them sequentially: return the first, then the second, then the third, etc. Standard graph-theoretic approaches that do a recursive walk of the graph and then return all the paths at the end aren’t so helpful.

Consider the easy case first. To enumerate all the choices in an acyclic graph, keep track of the choices you’ve made and make one different choice every time. There’s a small amount of bookkeeping necessary to make sure you make the different choices in the right order to find all the paths, but it’s not too hard.

In the cyclic case, there’s more bookkeeping. Sometimes you have to repeat paths, but you also have to make sure that you don’t get stuck in an infinite loop. My solution is imperfect. It will find some of the cycles, but it won’t necessarily find them all. Roughly, it keeps track of the paths it hasn’t taken and keeps going until it runs out of them.

Here’s a small example:

A = B | C .
B = D .
C = D .
D = A | 'a' .

My algorithm finds ABDa, ABDABDa, ACDa, and ACDABDa. It misses ACDACDa because it’s “run out” of trees that don’t have any repeated paths.

This task feels like a problem with a clever solution that I failed to find in the afternoon I devoted to it. Pointers most welcome. But the iXML use case is grammars that make XML out of non-XML inputs. In such cases, grammars with loops are probably an error anyway. Infinite ambiguity is not helpful in generating usable markup.

One thing leads to another enters the chat.

Implementing the new algorithm meant updating the tree construction code. When I went to look at that code again (for the first time in maybe a year) my immediate reaction was “what the 🤬?”

So I rewrote all of that. Then the next bit, then the next bit, and finally, the last bit. The largest, most visible change is that I completely redesigned how CoffeeSacks chooses among alternatives. I think it’s better.

Along the way, a bunch of things got simpler and more robust. And more testable. And I changed a whole bunch of public APIs, so this will be 3.0.0 when it’s ready. Is it ready? Try it and let me know!

Postscript: there’s a proposal to support Invisible XML in XPath 4.0. If you think that’s a good idea, what do you think the specification should say about resolving ambiguity? I’ve had two different APIs in the space of about as many months and I’m not convinced I’m done. I don’t think I’d want to forbid an implementation from offering some escape hatch for resolving ambiguity. But what about interoperability?