so

Announcing NineML

Volume 6, Issue 1; 29 Mar 2022

A suite of tools for working with Invisible XML.

So. What happened was, back in December, I’d spent several months working intensely on XML Calabash 3. I made a lot of progress, but I was starting to feel a little burned out. At the same time, I’d really enjoyed Steven’s Invisible XML tutorial at Declarative Amsterdam and I’d started hanging out in the CG. I think Invisible XML is important.

(If you’re not familiar with Invisible XML, I wrote an introduction for XML.com. I’ve also written a follow-on article about how to write Invisible XML grammars.)

No one was doing an implementation for the JVM and Achim and I had discussed the possibility of collaborating on an implementation that we could use in our respective XProc 3.0 implementations.

I thought what I’d do was, I’d spend a just a weekend or so hacking out a quick Invisible XML implementation as a relaxing break from XML Calabash. [Narrator voice: it did not take “just a weekend or so”.]

In my defense, I didn’t really think I’d finish in a weekend, but I thought it was going to be fairly quick. What you need to implement Invisible XML is a parser that can deal with ambiguity. An Earley parser is one common example. I thought I’d pick some existing open source implementation of an Earley parser on the JVM, wrap a little Invisible XML facade around it, and be off to the races.

Ha!

If there’s an example of such a parser on the JVM, I failed to find it. I found a few recognizers, which will tell you if a sentence is in the grammar, but can’t tell you what the derivation was. That’s not sufficient. I found a couple of parsers, but they exhibited catastrophically poor performance on simple grammars.

Unfortunately (or fortunately, I suppose, depending on your perspective), by the time I concluded that the hard part hadn’t been done for me, I felt like I’d put in enough time, I didn’t want to give up. “I don’t like to lose,” as Kirk famously says to Saavik in The Wrath of Khan.

So I wrote one. Well, let’s be honest here, Elizabeth Scott did all the heavy lifting, I just implemented the algorithm she described in SPPF-style Parsing from Earley Recognisers. (I like to imagine that I added a bit of novelty in my implementation, such as the ability to store arbitrary metadata with each token, which is how I implement Invisible XML “marks” and some pragma extensions.)

Thus, CoffeeGrinder was born. CoffeeGrinder is an implementation of Scott’s algorithm for an Earley parser that returns a shared packed parse forest representation of all possible parses of a sentence against a grammar.

CoffeeGrinder is the first NineML (“ix”-ml, get it?) project. (Oh, also, NineML goes all in on Java/Coffee jokes. That’s just the price of admission.) There’s nothing about CoffeeGrinder that’s directly related to Invisible XML. It is a general purpose Earley parser that operates on a sequence of tokens. It supports an extensible set of token classes. Out of the box, you get characters, strings, character classes (of the Invisible XML variety), and regular expressions.

Invisible XML parsing is provided by CoffeeFilter, built on top of CoffeeGrinder. CoffeeFilter provides a convenient API for loading Invisible XML grammars and processing strings or files with them. It returns a document abstraction that encompasses all possible parses of the input against the grammar. It supports prefix parsing and has options to relax the constraints of Invisible XML on “grammar hygiene”: it will optionally accept grammars with undefined, unused, and unproductive symbols as well as grammars that have multiple definitions for a given nonterminal.

CoffeeFilter accepts a number of pragmas, a feature not yet defined by the Invisible XML specification. These give you the ability to rename XML elements and to match input with regular expressions. There is a “pedantic” mode which disables all non-conformant behavior.

At the moment, CoffeeFilter passes all of the tests in the Invisible XML test suite. (Well, okay, it skips two tests, but that’s because of problems with the test suite itself. The correct output for the expr1 test requires a modification to the test suite and there’s a bug in the range test.)

CoffeeGrinder and CoffeeFilter are JVM APIs. If you program in Java (or Scala or Kotlin, or any other JVM language), you can use them in your programs. They’re not immediately useful otherwise.

CoffeePot is a command line application for Invisible XML processing. If you just want to play with Invisible XML, this is the place to start. You give CoffeePot a grammar and an input, and it gives you back XML. For example, given game.ixml:

game: duck+' ', ' ', goose .
duck: -"duck" .
goose: -"goose" .

You can run:

coffeepot -g:game.ixml duck duck duck duck goose

To get:

<game>
   <duck/>
   <duck/>
   <duck/>
   <duck/>
   <goose/>
</game>

(That one’s for you Syd. Thanks for the detailed review of my XML.com article about writing Invisible XML grammars.)

A command line tool is nice as far as it goes, but there are other places where you’d like to process XML.

CoffeeSacks is a set of Saxon extension functions for processing Invisible XML in an XSLT stylesheet.

CoffeePress is an XProc 3.0 extension step for processing Invisible XML with XML Calabash 3. (I also plan to publish a version for XML Calabash 1, “real soon now”.)

Comments, questions, etc. most welcome.

“Share and enjoy.”

PostScript: On 22 March, the Invisible XML CG agreed to publish some changes to the specification. For the most part, they don’t effect what grammars match, so I’m not planning to rush out an update to the suite until a couple of minor kinks have been worked out (the CG has changed its mind how to achieve one of the changes and I think the new grammar has a bug.)