so

Make it go faster

Volume 7, Issue 44; 05 Nov 2023 (updated 09 Nov 2023)

Experiments in optimization of my iXML processor.

I’ve basically torn my iXML implementation apart and put it back together again over the past month or two. This was motivated largely, if not exclusively, by a desire to improve performance.

My original implementation used an Earley parser lifted straight from the paper SPPF-style Parsing from Earley Recognisers. It worked, but the performance was…mediocre. Hoping to do better, I implemented a GLL parser, lifted from Derivation representation using binary subtree sets (BSRs).

That had better performance, but generated a different data structure to represent the parse graph. Since all of my tree construction code was based on walking the SPPF, I had to build an SPPF from BSRs. Along the way, I discovered that one of the larger parses took a few hundred milliseconds to parse the input and then several seconds to construct the SPPF from the BSR.

Dang it.

So most of the work of the past couple of months has been rewriting the tree construction to work directly from the BSRs and then rewriting the Earley parser to generate BSRs.Bethan turns out to be remarkably good at this stuff: looking at grammars, looking at BSRs, understanding what they mean, and recognizing what’s needed. This would have been much harder without her help! (The BSR paper claims to include a description of a BSR-generating Earley parser. I couldn’t get it to work, but I don’t know if that’s a mistake in the paper or a failure on my part. In any event, I gave up and rolled my own.)

That’s what version 4.x will be based on. I hope to get an initial release out in the next week or two.

Having got what I thought was a parser with better performance, I turned my attention to the performance tests (they’re not linked from the main test suite catalog and, I confess, I’d largely forgotten about them).

Let’s look at a particular example, the a-star tests. They’re about as simple as you can get. The grammar is:

S = 'a'*.

And the input is a sequence of zero or more a characters. The current test suite has two sets of tests: one that doubles the number of a characters in each successive test and one that multiplies the number by a factor of ten. John Lumley observed recently that a linear sequence would be more useful for performance measurements and I’ve constructed one. (I’ll make a pull request to get it into the test suite in the next day or two.)

In the linear test set, the first test has 0 a characters, the second has 20, the third 40, and so on. Here’s how Earley does on the first 100 tests. (Time, on the Y axis, is in milliseconds.)

[Photo]
Earley parse time

Oh, dear; not really that well. I hasten to add that, in general, Earley performs reasonably well, this test just happens to be problematic for it.

The GLL parser does better. So much better that I’ve plotted the entire set of 1,000 tests, none of which breaks the 30ms mark.

[Photo]
GLL parse time

That’s pretty much linear which isn’t unreasonable. (I haven’t worked out what causes the odd stepped nature of the plot; garbage collection maybe? I really don’t know. A kind reader pointed out that truncating to integer milliseconds was probably the reason. I checked. That's the reason.)

But, damn it, I bet every single reader who got this far is thinking something along the lines of “for the love of all things, a regular expression would do that in nothing flat!”

And I thought that too. My parser already allows you, the grammar author, to nominate a regular expression for a nonterminal. You could rewrite this grammar to use a regular expression and it would be a lot faster

How much faster? If you use a regular expression, none of the first 1,000 tests takes more than about 1ms to parse. That much faster.

Two things, though: first, if you nominate a regular expression, the parser uses it, whether it’s correct or not. That tool is sharp enough to give you a nasty cut. And second, the parser can figure this out for you. Consider this grammar rule:

S = X, A, Z.

If the parser can recognize that A, whatever that is, can be matched by a regular expression, and if Z, whatever that is, never begins with any character that could be matched by the regular expression, you can safely replace the iXML rule for A with the regular expression.

That turns out to be not too difficult. The parser already knows (for other reasons) what characters Z can begin with and some iXML expressions are easy to translate into regular expressions. (Some are easier than others and there are some subtleties related to Unicode character classes, versions of Unicode, and the like.)

That optimization will also available in verison 4.x.

A work in progress.

#Invisible XML