Record-oriented iXML parsing

Volume 6, Issue 9; 04 Sep 2022

Sometimes it’s easier to solve a problem if you break it up into smaller problems.

As far as I can tell, with current parsing technologies, parsing really big files is not going to be Invisible XML’s strongest suit. Because parsers for iXML have to be prepared for ambiguity, the data structures they build tend to be quite complicated. Earley parsers build a big state chart. Other parsers manage multiple stacks of states in parallel. In the worst case, you can imagine parsing the entire file up to the very last character, discovering that the last character doesn’t work, and having to go all the way back to the beginning of the file to try a different parse. In practice, most grammars don’t work like this and most parsers manage to preserve intermediate states so they don’t really have to parse the whole file twice. But, in principle…

I haven’t begun to scratch the surface of optimizing my implementation, so I expect it will improve. But that’s hard work and I have cough other projects demanding my attention. In the meantime, I’ve seen three separate use cases where the input file is logically a collection of records.

Consider the UnicodeData.txt file published by the Unicode Consortium:

0020;SPACE;Zs;0;WS;;;;;N;;;;;
0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
…

Each line consists of a set of fields separated by semicolons and there are 34,626 lines in the version I have. I can write a grammar that will recognize the whole file:

{ https://www.unicode.org/L2/L1999/UnicodeData.html }

UnicodeData = record++LF, LF .
record = codepoint, -';', name, -';', category, -';', combining, -';',
         bidi, -';', decomposition, -';', decimal, -';', digit, -';',
         numeric, -';', mirrored, -';', name_1_0, -';', comment, -';',
         uppercase, -';', lowercase, -';', titlecase .

-LF = -#a.

codepoint = ["0"-"9";"A"-"F"]+ .
name = ~[';']+ .
category = ~[';']* .
combining = ~[';']* .
bidi = ~[';']* .
decomposition = ~[';']* .
decimal = ~[';']* .
digit = ~[';']* .
numeric = ~[';']* .
mirrored = ~[';']* .
name_1_0 = ~[';']* .
comment = ~[';']* .
uppercase = ~[';']* .
lowercase = ~[';']* .
titlecase = ~[';']* .

But it does not perform well. It’s not clear that my Earley parser would finish before the heat death of the Universe. I think my GLL parser would manage it, but it would take days. Alternatively, I could break the file into lines and run an iXML parse over each line. The grammar for that just requires replacing UnicodeData and record with a single character rule (and deleting the rule for LF if you wish):

character = codepoint, -';', name, -';', category, -';', combining, -';',
         bidi, -';', decomposition, -';', decimal, -';', digit, -';',
         numeric, -';', mirrored, -';', name_1_0, -';', comment, -';',
         uppercase, -';', lowercase, -';', titlecase .

With this approach, I can process the whole file in 30 seconds. That seems quite tractable.⊕The point of iXML isn’t that it’s the fastest way to parse any particular file. The point is that writing a simple grammar is often the easiest way to parse a file and get a useful XML structure out of it. I know this because I’ve implemented a couple of new options in CoffeePot to support record-based parsing.

In brief, you specify a delimiter for records in the file, CoffeePot breaks the file into records, and your grammar is applied separately to each record. You can specify that records begin or end with a particular regular expression. If you ask for record-oriented parsing without specifying a delimiter, the default separator is “newline at the end” which makes each line a record.

A couple of more interesting examples will help explain how it works, I think. Suppose you have a file like this one:

This is a line \
that continues
This is a new line
So is this
But this one \
continues \
for \
several lines.

The convention here is that a line that ends with a “\” logically continues on the next line. A line that doesn’t end with a slash marks the end of a “record”.

First, let’s observe that you can parse this whole file with iXML. Here’s one grammar that will do it:

records = record+ .
record = simple-value | extended-value .
-simple-value = ~[#a]*, ~["\"], -#a .
-extended-value = ~[#a]*, -"\", -#a, +" ", -record .

Parsing the input above will produce:

<records>
   <record>This is a line that continues</record>
   <record>This is a new line</record>
   <record>So is this</record>
   <record>But this one  continues  for  several lines.</record>
</records>

Alternatively, and perhaps more efficiently if the file is very large or the grammar for recognizing what’s on a line is more complex, you could change the grammar so that it expected only a single line at a time:

record = simple-value | extended-value .
-simple-value = ~[#a]*, ~["\"] .
-extended-value = ~[#a]*, -"\", -#a, +" ", -record .

And use the --record-end option on CoffeePot to specify the regular expression that identifies the end of each record:

$ coffeepot -g:line.ixml -i:lines.txt -pp --record-end:"[^\\\\]\\n"
<records>
<record>This is a line that continue</record>
<record>This is a new lin</record>
<record>So is thi</record>
<record>But this one  continues  for  several lines</record>
</records>

A couple of things here. First, double-escaping the “\” characters in the regular expression because it’s first being interpolated by the shell makes it extra hard to read. Second, we’ve lost the “non-backslash” character at the end of each record!

The first problem we can solve by putting the regular expression in the grammar using a pragma. The second problem we can solve with a capture group. The new grammar is:

{[+pragma opt "https://nineml.org/ns/pragma/options/"]}
{[+opt record-end "([^\\])\n"]}

record = simple-value | extended-value .
-simple-value = ~[#a]*, ~["\"] .
-extended-value = ~[#a]*, -"\", -#a, +" ", -record .

CoffeePot will detect the “record-end” option and automatically select record-oriented parsing with that separator. The semantics of separators is that anything in a capture group (or capture groups) is preserved, everthing else is dropped. So this grammar will produce:

$ coffeepot -g:line.ixml -i:lines.txt -pp
<records>
<record>This is a line that continues</record>
<record>This is a new line</record>
<record>So is this</record>
<record>But this one  continues  for  several lines.</record>
</records>

Just what we want.

For an example of a starting separator, consider this input:

This is a line
  that continues
This is a new line
So is this
But this one
  continues
  for
  several lines.

Here the convention is that a new record begins on any line that doesn’t start with whitespace. This grammar:

{[+pragma opt "https://nineml.org/ns/pragma/options/"]}
{[+opt record-start "\n([^ ])"]}

record = simple-value, extended-value?, -#a? .
-simple-value = ~[" "], ~[#a]* .
-extended-value = -#a, -" "+, (+" ", ~[" "], ~[#a]*)?, -extended-value? .

Will parse it:

$ coffeepot -g:start.ixml -i:lines2.txt -pp
<records>
<record>This is a line that continues</record>
<record>This is a new line</record>
<record>So is this</record>
<record>But this one continues for several lines.</record>
</records>

A couple of notes:

Anything that precedes the first separator (whether it’s a starting or ending separator), is considered a record.
If that extra -#a? at the end of record confuses you, it’s because a newline following the very last record in the file won’t match the separator so it’ll be part of the last record.

In an iXML grammar for an input, you have to account for every character in the input. When using record-oriented processing, your grammar must account for every character in every record.

Because each record is parsed separately, ixml:state attributes representing failure or ambiguity will appear on each record (where appropriate). If any record fails, the ixml:state="failed" attribute will appear on the root records element.

I still consider this experimental code, but feedback is welcome.