Updated XML Resolver for Java

Volume 7, Issue 17; 13 Feb 2023 (updated 19 Feb 2023)

Making the resolver do the heavy lifting (to avoid errors caused by HTTP redirects).

Updated 19 February 2023, I've pushed a bug fix. You want 5.1.0 or later!

I’ve pushed version 5.0.0 of the XML Resolver for Java. This version introduces a new “always resolve” feature, enabled by default.

The problem I found was that some (perhaps most) documents have DTDs and schemas identified with http: URIs. Consider this test:

<!DOCTYPE book SYSTEM "http://xmlresolver.org/ns/sample/sample2.dtd">
<book>
<title>DTD/redirect test</title>
<article>
<p>DTD/redirect test</p>
</article>
</book>

If you don’t have “sample2.dtd” in any catalog, the XML Resolver won’t find it. The entity resolver API (in Java) says that the resolver returns null if it can’t find the resource. In principle, this gives the parser an opportunity to try resolving it some other way. In practice, all I’ve ever seen parsers try to do is read the resource from the URI.

Trouble is, an HTTP GET request on that URI will return a 301 response code and the literal text

Redirecting to https://xmlresolver.org/ns/sample/sample2.dtd

If the parser naïvely turns around and tries to use that text as a DTD for validation, bad happens. Instead the parser needs to follow the redirect to get the https version of the resource. And if it doesn’t, you’re kind of hosed.

Maybe you could change the system identifier in your document, but what if the DTD contains

<!ENTITY % blocks SYSTEM "http://xmlresolver.org/ns/sample/blocks2.dtd">
%blocks;

What then? You can’t change that URI because it’s in a file on the server.

I punted. I added an “always resolve” feature that tells the resolver, if it can’t find the resource, read it from the URI and return that. The code in the resolver that reads the URI will attempt to follow redirects and return the “final” resource.

It’s worth nothing that this is the API contract in the .NET version of the resolver. The System.Xml resolver requires the resolver to always return a resource. I’ve ranted against that design, but at least it avoids this problem.

Two more things of note.

If you really just want to interrogate the catalog, you can use the CatalogManager API directly. That is, in fact, what Saxon does in its standard URI resolver. The new feature has no effect on the catalog manager directly, that really is just about looking things up in the catalog.
I’ve augmented the various sources returned by the entity resolving methods so that they can tell you the headers and response code returned. (See if they’re instances of ResolvedResourceInfo; I can’t think of a better way.) The headers are empty and the response code is always 200 for file: and other schemes that don’t have headers and response codes.

Feedback welcome.