Ring any bells?
When Xerces resolves a URI on Windows, it does it…differently than on Linux and MacOS. And incorrectly.
I’m actually just going to ignore this bug. As far as I can tell, it only happens in the unit test. It doesn’t seem to happen when the code is running in integration. (I can’t explain that either.)
The situation is, I’m parsing a document that has a parameter entity declaration in its internal subset.There’s another bit of weirdness here that has to do with the way base URIs are being passed around in the parser. Also, apparently, only in the unit test. If some of the relative URIs look a little odd, that’s why. The parameter entity loads some general entities:
<!ENTITY % gls.entities SYSTEM "src\test\iss0184\master\glossary\gls.ent">
The problem is that there are backslashes in that “URI”. That causes
java.net.URI
to throw an exception. I found the problem and fixed it, that’s
not where things get weird.
With some debugging messages added to the test, on Linux and MacOS, we get:
OPEN: src/test/iss0184/src/SBBVT0T-Deployment-Flat-mod.xml
BASE: file:///Volumes/Projects/xmlresolver/resolver/
ABS: file:/Volumes/Projects/xmlresolver/resolver/src/test/iss0184/src/SBBVT0T-Deployment-Flat-mod.xml
What that says is, a request was made for the XML file, it was resolved against the base URI, and the absolute URI shown was returned. All well and good.
The next request from the parser to the entity resolver is to resolve the system identifier in the parameter entity:
RE: src\test\iss0184\master\glossary\gls.ent
That’s the URI of the parameter entity and it resolves fine against the same base URI. Still all well and good.
I injected an exception so that I could get a stack trace. Here’s how we get to “RE:”:
java.lang.RuntimeException: bang
at org.xmlresolver.Resolver.resolveEntity(Resolver.java:222)
at org.xmlresolver.tools.ResolvingXMLFilter.resolveEntity(ResolvingXMLFilter.java:140)
at org.apache.xerces.util.EntityResolverWrapper.resolveEntity(Unknown Source)
at org.apache.xerces.impl.XMLEntityManager.resolveEntity(Unknown Source)
at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
…
The Xerces utility class calls resolveEntity
with the relative URI. That’s what
I consider the expected behavior.
On Windows, not so much.
We start about the same:
OPEN: src/test/iss0184/src/SBBVT0T-Deployment-Flat-mod.xml
BASE: file:///C:/tmp/ab/xmlresolver/
ABS: file:/C:/tmp/ab/xmlresolver/src/test/iss0184/src/SBBVT0T-Deployment-Flat-mod.xml
The file paths are different, but like before, we resolve the document URI against the base URI and get back an absolute URI that correctly resolves to the XML document.
But the next request from the parser is:
RE: file:///C:/tmp/ab/xmlresolver/src/test/iss0184/src/src/test/iss0184/master/glossary/gls.ent
The stack trace is the same, so once again this is what the Xerces utility class is passing to the entity resolver.
Why in the name of all things has the Xerces parser on Windows done some URI resolution before calling the resolver? I don’t think it’s non-conformant for the parser to do this, but:
- It got it wrong. Look at that URI, it has an extra
src/test/iss0184/
in the path! - Why does it only do it on Windows?
Tracing through the Xerces codebase is much too tedious for this time on a Monday afternoon. And tracing through anything on Windows is too tedious to begin with!
I wrote this weblog post half in the hope that it would function as a rubber duck. It did not. Anyone seen this before? Am I being completely stupid in some way that’s obvious to everyone except me? To anyone!?