Unicode in Java, part 2

Volume 7, Issue 47; 19 Nov 2023

Making Java work with a specific version of Unicode, at least for NineML.

As I observed yesterday, Java (11, at least) lies doesn’t support a specific version of Unicode. Instead, it supports an extended version. (A cursory examination of some JavaDocs suggests that later versions of Java support specific versions of Unicode, but I don’t have control over the version of the JVM used to run NineML and I can’t be sure the JVM won’t play the same game again in the future.)

I’m sure that 99.999…% of the time, this is just fine. Even for Invisible XML, it’s probably fine well over 99% of the time. But for the specific case of using Unicode character categories to suss out the version of Unicode supported, it is a problem, and in principle it could be a problem in any grammar that uses Unicode character classes.

It’s a very tiny edge case and the obvious thing to do is simply document the fact and move on. But it’s exactly the kind of edge case that makes part of my brain itch. I try to leave it alone, but I just can’t.

So I coded up a solution that reads the Unicode character database and does character class comparisons against the actual rules in the database. Done and dusted.

Except.

Except I’ve recently introduced an optimization where the processor detects when a nonterminal can safely be replaced by a regular expression. (Regular expressions are fast. Consuming a thousand “a” characters with the regular expression “a*”: instantanous. With the parser, not.) This optimization has to handle character classes and the obvious regular expression that matches [Ll]+ (the character class lower-case letters one or more times) is \p{Ll}+. You see the problem, right?

The problem is that Java’s regular expression engine is going to rely on its understanding of character classes to match lower-case letters. And it lies doesn’t support an explicit version of Unicode.

So I fussed some more. Rather than using \p{Ll} to match a lower-case letter, I use a regular expression that explicitly enumerates all of the lower-case letters: [\x61-\x7a\xb5\xdf-\xf6\xf8-\xff…\x{1e922}-\x{1e943}]+. The enumeration comes from reading the Unicode database, so it’s always correct for any particular version of Unicode. (I was a little worried about the performance of regular expressions with very long lists of characters, but they seem to be fine, and they are only constructed once, when the grammar is compiled.)

Done and dusted.

Except. Well. If you can do that with one version of the Unicode character database, you can do it for any version, right?

for f in 10 11 12 12.1 13 14 15 15.1; \
  do java -Dorg.nineml.unicode-version=$f -cp $CP \
     org.nineml.coffeepot.Main -g:unicode-version-diagnostic.ixml \
     -i:unicode-version-diagnostic.txt; done
<unicode-version><unicode-10.0/></unicode-version>
<unicode-version><unicode-11.0/></unicode-version>
<unicode-version><unicode-12.0/></unicode-version>
<unicode-version><unicode-12.1/></unicode-version>
<unicode-version><unicode-13.0/></unicode-version>
<unicode-version><unicode-14.0/></unicode-version>
<unicode-version><unicode-15.0/></unicode-version>
<unicode-version><unicode-15.1/></unicode-version>

Right.