More accurate locations

Volume 9, Issue 34; 08 Dec 2025

When it all goes wrong, can I tell you where?

When you’re working with a tool like XProc that can perform a lot of different tasks, all other things being equal, you’d like all of the tasks to be implemented on the same data model. All other things are not equal.

XML Calabash is built on top of Saxon. Parsing an XML document (or a JSON map, or basically anything that isn’t binary), builds a Saxon data model. That’s great because it’s easy to evaluate XPath expressions on it, query or transform it, etc.

But what happens when you ask for something not implemented natively on the Saxon data model, for example RELAX NG validation? I could implement RELAX NG validation natively on top of the Saxon data model, but I haven’t. I rely on Jing to do the validation. Jing expects an input source, so effectively the document gets parsed again for validation. Performance consequences aside, this is still a problem.

Consider:

<?xml version="1.0"?>
<document status="checked
                  rechecked
                  checked-again
                  final">
  <para id="nope">The attribute name should be xml:id.</para>
</document>

With this schema:

start = document
document = element document {
  attribute status { text },
  para+
}
para = element para {
  attribute xml:id { xsd:ID },
  text
}

If you run validation on the command line with Jing, you’ll get a report that there’s a problem on line 6, column 19.

Put the validation in a pipeline:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
                name="main" version="3.1">
  <p:output port="result"/>

  <p:validate-with-relax-ng>
    <p:with-input port="source" href="document.xml"/>
    <p:with-input port="schema" href="document.rnc"/>
  </p:validate-with-relax-ng>
</p:declare-step>

What happens now? In the past (versions before 3.0.31), you’d get a report that there’s an error on line 2. What!?

Why the difference?

To resolve href="document.xml", XML Calabash parses the document. This happens before the p:validate-with-relax-ng runs. Unfortunately,⊕In principle, the XMLn’t parser could also be used here, but that’s not a conformant parser, so I wouldn’t use it except where you really need it. It’s possible that validation or some other downstream process depends on attribute value normalization! what’s happened here is that parsing has removed the XML declaration and performed attribute value normalization. The document is effectively:

<document status="checked rechecked checked-again final">
  <para id="nope">The attribute name should be xml:id.</para>
</document>

Consequently, the para is on line 2. The longer the document, and the more attributes there are that change when normalized, the further the locations will drift.

For as long as XML Calabash has existed, all the way back to the 1.0 implementation, there’s been a utility method to turn a Saxon tree back into an input source for APIs that require one. It has always been preceded by this comment:

// FIXME: THIS METHOD IS A GROTESQUE HACK!

Not my finest work. But it got the job done and there were a lot of other jobs to do. What that method did was serialize the tree back into a character stream and setup an input source on that. (I already confessed it was a hack!)

I’ve been meaning to fix that for a long time. A recent bug report about exactly the problem I described above was the encouragement I needed to take a different approach. I wrote an adapter that walks the tree, directly generating SAX events, including location events based on the locations from the original parse. Now XML Calabash knows the error was on line 6 as well!

I can’t decide if that approach should be packaged up for independent reuse. It’s not especially difficult, but it did take me a dozen years or more to get around to writing it.