Towards XML Resolver 3.0.0!

Volume 5, Issue 5; 03 Jun 2021

I’ve pushed a snapshot release of XML Resolver 3.0.0. No, really, I actually mean it.

Shortly after I did the 2.0.0 release, I was motivated to do a bunch more work on the XML Resolver. (This is partly in support of a couple of projects for my day job; more about those in the near future, I hope.)

I did a serious clean up of the way catalog files are actually managed. The “clever idea” I had way back when, when I forked from the Apache resolver: just load the catalogs as XML DOM instances and navigate around to find catalog entries, was not, actually, I think, very clever. So I’ve replaced that with a proper back end data structure.

I could almost make that work without changing the public API, but it was kind of lame. Instead, I’m just going to take the hit and admit that my 2.0.0 release was premature. There’s nothing wrong with it, but I’ve changed the API again. I thought about just sticking with 2.x and making the breaking change in 2.2.0 (who, if anyone, would notice?) but that doesn’t sit well philosophically and there are a lot of integers. What’s another one between friends?

Once I started writing tests for the new release, I decided I wanted to write a “getting started” repository to demonstrate the new features. And once I started doing that, I got all sorts of ideas for things that should be possible. Most of them were easily supported by the new data structures, so I feel pretty good about everything, really.

Here’s what’s new, in a nutshell. (These are all features; you can disable them if you wish.)

Loading catalogs from the classpath. The work I did on classpath: and jar: URIs meant it became easy to package up some schemas, like DocBook say, in a JAR file, stick them on the classpath, and point to a catalog in that jar file. No more unpacking schema distributions, just point to the resources in the JAR file!

But then, I thought, if you can point into the JAR file, what about if the resolver just automatically found the catalog? Stick the JAR file on your classpath (e.g., declare a dependency in your build tool) and you’re done. Can it all just work, fast and seamlessly?

Yes, I think it can. The XML Resolver now automatically adds any file with the name /org/xmlresolver/catalog.xml on your classpath to the end of your catalog list.

Compare http: and https: catalog entries transparently. The web used to be http:, then the villains moved in and we all switched to https:. Trouble is, it’s easy to copy and paste the old http: URIs in catalogs and it’s easy to copy and paste the new https: URIs into documents.

For a few years, I’ve been conscientiously creating catalog entries for both:

<uri name="http://example.com/thing" uri="/path/to/thing"/>
<uri name="https://example.com/thing" uri="/path/to/thing"/>

That’s just annoying. And not a good general solution anyway since there are plenty of read-only catalogs around (on the web, published in standards, etc.) that only use the http: URIs.

So now I just ignore the distinction in uri comparisons in catalogs. Yes, it’s technically possible for those to be different documents (and if you’ve done that, you can turn this feature off!), but it’s overwhelmingly the case that they’re just aliases and that http: redirects to https: anyway.

I want to be clear: this has no impact on the actual retrieval of documents. This is just about how the system identifier or URI in your document is compared against the system identifeir or URI entry in the catalog.

Mask jar URIs. Most entity resolver APIs are defined to say that if resolution succeeds, the base URI of the resource returned is the base URI of the actual, local resource. This greatly simplifies things because subsequent relative URIs can be resolved against the local resource directly.

Resolve http://example.com/thing to /path/to/thing. Now if thing makes a relative URI reference to otherthing, that gets resolved to /path/to/otherthing automatically, no catalog entry required.

However, the Java URI class does not treat jar: or classpath: URI schemes as hierarchical,⊕And even if it did, I’m not sure the relevant RFCs support resolution of jar: URIs in the way that you need for them to work as hierarchical URIs anyway. No hating on java.net.URI here. so any subsequent attempts to resolve relative URIs will fail. To fix this, the XML Resolver lies. If the URI is a jar: or classpath: URI, it returns the locally resolved resource, but leaves the base URI unchanged.

This does mean that you’ll need a more complete catalog. If you want the relative reference to otherthing to work, you’ll have to have a catalog entry for http://example.com/otherthing because that’s what the process will attempt to retrieve. (In most cases, I’ve found that the “rewrite” catalog entry make this pretty easy.)

Support alternate catalog loaders. By design, the resolver doesn’t report errors or raise exceptions for invalid or missing catalogs. You don’t want your app crashing in production because someone made a typo in a catalog.

On the other hand, it’s really easy to make typos and telling folks they ought to validate the catalogs they publish only gets you so far. Now there’s a property you can set which tells the resolver to use a validating loader. It raises an exception if there’s an error. That’s the first thing to try if you think catalog resolution isn’t working!

(I’ve also reworked the logging so that it’s easier to get logs out and the log messages are, I hope, a little clearer about what the resolver looked for in the catalog and what it decided to return.)

Actually get the RDDL document parsing right. I don’t know why I care. I don’t think RDDL ever got that widely deployed, but I think it’s a neat idea. I wrote more tests and fixed more bugs. I think it actually works now, for what its worth.

More tests. There are now seven hundred some odd tests instead of, I dunno, eleven or something. I’m a lot more confident that this release is doing the right thing. And the “getting started” repository functions as a nice set of integration tests.

The XML Resolver 3.0.0 release includes a “data” JAR file that provides a lot of common W3C resources. It’s a separate JAR file so that you don’t have to download it or put it on your class path, but if you do, it’ll just work automatically. This should make it really easy to avoid the ten second delay imposed by www.w3.org if you attempt to get popular DTDs or schemas from there. (This is, not coincidentally, the same, or a very small superset, of the W3C resources that have historically been included with Saxon.)

As I said, I’m really pleased with how this has come together. I’m working on a new DocBook schemas release to take advantage of these features and finishing up the “getting started” repository that let’s you play with it in a “real” application.

Here’s hoping it makes things easier for you!