NineML tools version 2.2.0
Finding the right nonterminals and implementing the priority pragma.
CoffeeGrinder, CoffeeFilter, CoffeeSacks, and CoffeePot versions 2.2.0 have been published. Most of the heavy lifting is down in CoffeeGrinder.
There have been a few things bothering me about how CoffeeGrinder works. In the course of trying to understand them, I discovered some things that really disturbed me. I’ll spare you the details.
In brief, when a result was ambiguous, the parser was failing to locate the correct nonterminals on the “right hand side.” An example will probably help:
S = A, B, C | A, @B, C .
It’s not really important what A, B, and C are, the point is that you
have to know exactly which B tokens you’re looking at. One has the
default mark and one is marked @
. It absolutely isn’t sufficient to
just know that they’re both “B”s.
I’ve fixed that and again I’ll spare you (most of) the details. In the parser, marks are actually a special case of a more general attribute mechanism. To solve the problem of distinguishing between nonterminals with different attributes, I create new nonterminals with different names. That makes it easy to identify them correctly in the rules. (The process of turning an iXML grammar into a plain BNF for the Earley or GLL parser adds new nonterminals anyway, so I don’t think I’ve introduced any new problems.)
I also persuaded myself that I understand how the parse tree construction process differs for Earley and GLL parsers and I have renewed confidence that the results are correct.
One practical outcome of this effort is that the “priority” pragma now works correctly. (Priorities are an attribute of nonterminals, so finding the right ones is critical.) Consider this grammar:
number: hex | decimal .
hex: hex-digit+ .
decimal: decimal-digit+ .
-hex-digit: ["0"-"9" | "a"-"f" | "A"-"F" ] .
-decimal-digit: ["0"-"9" ] .
It’s ambiguous, if you ask it to parse “42”, there are two possible
ways to parse the number: as either a hex or a decimal. This problemThere are lots of ways you could fix this grammar to make it
unambiguous. You could require 0x
in front of the hexadecimal
numbers for example, or require d
or h
after them. Don’t write
ambiguous grammars if you can avoid it! This is just a simple
example of ambiguity. doesn’t arise if you parse the number “cafe”. That has to be hex
because it contains characters that can’t be in decimal numbers. But
42 could be either.
Recall that Invisible XML doesn’t consider ambiguity an error, but it also doesn’t provide any mechanism for controlling it. All parses are considered equal and the processor’s only obligation is to provide one of them.
The priority pragma lets the grammar author identify which alternative should be selected:
{[+pragma nineml "https://nineml.org/ns/pragma/"]}
number: hex | {[nineml priority 2]} decimal .
hex: hex-digit+ .
decimal: decimal-digit+ .
-hex-digit: ["0"-"9" | "a"-"f" | "A"-"F" ] .
-decimal-digit: ["0"-"9" ] .
The grammar is still ambiguous (though there’s some temptation to elide that in the output if there were explicit priorities to govern every case of ambiguity), but now it will always prefer decimal:
$ coffeepot -pp -g:hex.ixml 42
Found 2 possible parses.
<number xmlns:ixml="http://invisiblexml.org/NS"
ixml:state="ambiguous">
<decimal>42</decimal>
</number>
I also improved the CoffeeSacks API for resolving ambiguity. Because the function may get called on alternatives that have already been processed (for example, by the priority mechanism), the first element in the list of alternatives will always be the currently selected alternative. (That’s true even if there aren’t any priorities, because selecting the first is always acceptable, no matter which one is first.)