XMLn’t

Volume 9, Issue 16; 31 Aug 2025

Could :: Couldn’t. Should :: Shouldn’t. Would :: Wouldn’t. XML :: XMLn’t.

Last week, John Dziurlaj organized the first “XProc user group” meeting. (Thanks, John!) The idea is that folks interested in XProc will get together quarterly and have a chat (online). Sign up details on the xproc-dev mailing list. Agendas T.B.D., but not strictly necessary.

One topic of conversation last week was how possible or practical it would be to transform XProc 1.0 pipelines to XProc 3.1 pipelines mechanically. In principle, you could write an XSLT transformation to do some of the conversion. This post isn’t really about how practical that would be.

The problem with transforming XProc (or XSLT or Schematron or XML Schema or basically any XML vocabulary that has XPath expressions in attribute values) is that the XML parser performs attribute value normalization.

You give the parser this:

<doc>
   <if test="significant
             line
             breaks">
      …
   </if>
</doc>

and you get back this:

<doc>
   <if test="significant              line              breaks">
      …
   </if>
</doc>

This is not an uncommon source of irritation. (Erik Siegel had the clever idea that you might be able to recover the line breaks by analyzing the pattern of repeated spaces. And you might. But I didn’t try that.)

Long story, short: I hacked up a non-conforming XML parser (it’s “XMLn’t”) that doesn’t do attribute value normalization. You give it XML with newlines in the attribute values, it constructs a data model with newlines in the attribute values.

And since I’d persuaded myself that this was what I should spend Saturday morning hacking, I decided to address another common frustration with XML parsing: entity references. Specifically, the fact that the XML parser expands them all.

You give the parser this:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE doc [
<!ENTITY spoon "Spoon!">
]>
<doc>
   <if test="significant
             line
             breaks">
      &spoon;
   </if>
</doc>

and you get back this:

<doc>
   <if test="significant              line              breaks">
      Spoon!
   </if>
</doc>

If you ask it to, xmlnt will preserve entity references. It does this by mapping each distinct entity reference to a distinct single character (by default, from the private use area, but you can pick any starting point you like). It builds a character map so that the serializer will put back the entity references.

(This has to be done with characters and not, for example, processing instructions because entity references can occur in attribute values.)

In other words, you can do things like this:

<p:add-attribute attribute-name="wrapper" attribute-value="testing">
  <p:with-input>
    <p:document href="/path/to/test.xml"
                parameters="map { 'cx:xmlnt': 'entities' }"/>
  </p:with-input>
</p:add-attribute>

where /path/to/test.xml is the document above, and the serialized result is:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE doc [
<!ENTITY spoon "Spoon!">
]>
<doc wrapper="testing">
   <if test="significant
             line
             breaks">
      &spoon;
   </if>
</doc>

Note that the attribute has been added without otherwise changing the serialization!

The parser is built directly from the XML 1.0 fifth edition grammar. It is absolutely not a conformant processor. It makes little to no effort to check well-formedness constraints. The expectation is that you will feed it well-formed markup.

At the same time, it tries to be a useful parser. It reads the internal and external subsets, tries to resolve encodings correctly, attempts to handle parameter entity expansion, deals with include and ignore marked sections, etc. But it’s a morning’s hack. YMMV. Bug reports welcome.

It’s worth pointing out that the parser can’t actually preserve markup that is explicitly insignificant in XML such as whitespace inside start and end tags or the choice of quote character for attribute values.

You give it this:

<doc
>
   <if double-quoted="this one"
    test='significant
          line
          breaks'
   >…</if
   >
</doc>

and you get back this:

<doc>
   <if double-quoted="this one"
       test="significant
          line
          breaks">…</if>
</doc>

At the moment xmlnt makes no attempt to preserve CDATA sections, though it’s easy to imagine that it might be useful if it could. That would require even deeper changes in the serializer and that was a bit of a nuisance so I decided not to. There’s enough weirdness in overriding bits of the serializer as it is.

I had thought to release this as a stand-alone parser, but in order to work, you need to address both parsing and serialization. It’ll be in the next release of XML Calabash where I have easy access to both ends of the process.