Updated XML Resolver for Java
Making the resolver do the heavy lifting (to avoid errors caused by HTTP redirects).
Updated 19 February 2023, I've pushed a bug fix. You want 5.1.0 or later!
I’ve pushed version 5.0.0 of the XML Resolver for Java. This version introduces a new “always resolve” feature, enabled by default.
The problem I found was that some (perhaps most) documents have DTDs and schemas
identified with http:
URIs. Consider this test:
<!DOCTYPE book SYSTEM "http://xmlresolver.org/ns/sample/sample2.dtd">
<book>
<title>DTD/redirect test</title>
<article>
<p>DTD/redirect test</p>
</article>
</book>
If you don’t have “sample2.dtd” in any catalog, the XML Resolver won’t
find it. The entity resolver API (in Java) says that the resolver
returns null
if it can’t find the resource. In principle, this gives
the parser an opportunity to try resolving it some other way. In
practice, all I’ve ever seen parsers try to do is read the resource
from the URI.
Trouble is, an HTTP GET
request on that URI will return a 301
response code and the literal text
Redirecting to https://xmlresolver.org/ns/sample/sample2.dtd
If the parser naïvely turns around and tries to use that text as a DTD
for validation, bad happens. Instead the parser needs to follow the
redirect to get the https
version of the resource. And if it
doesn’t, you’re kind of hosed.
Maybe you could change the system identifier in your document, but what if the DTD contains
<!ENTITY % blocks SYSTEM "http://xmlresolver.org/ns/sample/blocks2.dtd">
%blocks;
What then? You can’t change that URI because it’s in a file on the server.
I punted. I added an “always resolve” feature that tells the resolver, if it can’t find the resource, read it from the URI and return that. The code in the resolver that reads the URI will attempt to follow redirects and return the “final” resource.
It’s worth nothing that this is the API contract in the .NET version
of the resolver. The System.Xml
resolver requires the resolver to
always return a resource. I’ve ranted against that design, but at
least it avoids this problem.
Two more things of note.
- If you really just want to interrogate the catalog, you can use the
CatalogManager
API directly. That is, in fact, what Saxon does in its standard URI resolver. The new feature has no effect on the catalog manager directly, that really is just about looking things up in the catalog. - I’ve augmented the various sources returned by the entity resolving
methods so that they can tell you the headers and response code
returned. (See if they’re instances of
ResolvedResourceInfo
; I can’t think of a better way.) The headers are empty and the response code is always 200 forfile:
and other schemes that don’t have headers and response codes.
Feedback welcome.