XProc 3.0: Ready or Not

Volume 4, Issue 13; 18 Feb 2020

Published by Norman Walsh

My presentation from XML Prague about XProc 3.0.

This weblog post is adapted from the presentation that I gave at XML Prague on 15 February, 2020. I’ve redrafted it to be more readable in a prose format.

The top-level agenda for this talk was:

Am I ready?
Are the specifications ready?
Are the implementations ready?
Are you ready?

Experimentally, I composed and presented⊕I’ll post my ~/.emacs file and the setup I used for the presentation “real soon” now. this talk in Emacs using Org-mode. (Yes, I can get XML out of Org.) I thought it worked pretty well.

Am I ready?

Yes, I am. Born ready. The XML: bring it.

Are the specifications ready?

XProc 3.0: An XML Pipeline Language went into last call just a few days before this conference last year. We started a second last call in December. This was motivated by:

Useful feedback from implementors and early adopters.
Improved semantics for p:if/p:choose.
Removal of p:document-properties-document().
Tightening the constraint on base URI so that it must always be a legal URI per RFC 3986.
Improvements to the semantics of p:viewport
Clarification of a number of details in the specification.

XProc 3.0: Standard Step Library defines the core XProc steps. If you’re familiar with the steps in XProc 1.0, you’ll see some familiar faces in this list:

p:add-attribute	p:json-merge	p:text-head
p:add-xml-base	p:label-elements	p:text-join
p:archive	p:load	p:text-replace
p:archive-manifest	p:make-absolute-uris	p:text-sort
p:cast-content-type	p:namespace-delete	p:text-tail
p:compare	p:namespace-rename	p:unarchive
p:compress	p:pack	p:uncompress
p:count	p:rename	p:unwrap
p:delete	p:replace	p:uuid
p:error	p:set-attributes	p:wrap-sequence
p:filter	p:set-properties	p:wrap
p:hash	p:sink	p:www-form-urldecode
p:http-request	p:split-sequence	p:www-form-urlencode
p:identity	p:store	p:xinclude
p:insert	p:string-replace	p:xquery
p:json-join	p:text-count	p:xslt

We plan to move XProc 3.0: Standard Step Library into last call in February of this year.

In fact, shall we just do that now!?

«Minor theatrical interlude where Norm makes a show of creating the pull requests that publish the last call draft!»

Additional step libraries will also be published, but we aren’t holding up “the 3.0 release” for them.

Dynamic pipeline evaluation (p:run)
Text-, file-, OS-, and mail-related steps
Paged media steps (XSL-FO + CSS)
RDF/Semantics web steps
Validation steps
- Extensible Validation Report Language

We’ll continue to work on these as fast as we can after we have the specification and standard step libraries finished.

In fact: you can help. We’d be delighted if domain experts in some areas, such as linked data, would help us construct a library of useful steps.

Are the implementations ready?

There are two implementations:

MorganaXProc-III by Achim Berndzen
XML Calabash 3.0 by Norman Tovey-Walsh

Both run on the JVM and we plan to have plug-and-play interoperability between them at the API level.

Check out the test suite to track our progress!

MorganaXProc-III: 2 February 2020
- Passed 1784/1785 tests, 99.94%
  - 1 failure

Achim’s failing test is pretty minor: it’s the wrong error code from an exception. Now that p:try can have multiple p:catch statements that select on the error code, we’re trying to be much more rigorous about error codes.

XML Calabash: 16 December 2019
- Passed 1203/1608 tests, 74.82%
  - 405 failures, 2 skipped

Those are my results from December. I was going to update my results, but I couldn’t improve on the 75.82% so I didn’t! This is mostly because of a lot of new tests related to p:http-request that I still mostly fail.

Are you ready?

XProc doesn’t exist in a vacuum. You have to use it within the ecosystem of tools and processes you already have. If I had to distill the XProc 3.0 changes down to just two highlights, they’d be: support for non-XML formats as first-class citizens and ease of use.

I want to talk about both of those.

But first, an aside about formats. Once upon a time, there were XML APIs. Back in those days, a web service might have been described like this (I’m not saying this is an especially good description, semantically, but that’s not relevant to the point I’m trying to make):

<webservice xmlns="http://nwalsh.com/ns/webservice">
  <host>localhost</host>
  <protocol>http</protocol>
  <service>
    <crew-list>/cgi-bin/demo</crew-list>
    <crew-bio>/cgi-bin/demo/{serialNumber}</crew-bio>
  </service>
</webservice>

But XML was too hard:

Namespaces are confusing.
Making sure that start and end tags match is too hard.

«Minor theatrical interlude where Norm attempts to make a joke.»

There’s an obvious solution, yes? The solution is obvious: S-expressions!

(webservice (host "localhost") (protocol "http")
            (service
             (crew-list "/cgi-bin/demo")
             (crew-bio "/cgi-bin/demo/{serialNumber}")))

S-expressions? No? No. Nevermind. You’re not ready.

«End of “joke”.»

But seriously, the solution was JSON:

{"host": "localhost",
 "protocol": "http",
 "service": {
   "crew-list": "/cgi-bin/demo",
   "crew-bio": "/cgi-bin/demo/{serialNumber}"
 }
}

JSON is easier and better because all you have to do is match up the curly braces.

Just…match…up…the…curly…braces…

The error in this payload is obvious, right?

{"AWSTemplateFormatVersion":"2010-09-09","Parameters":
{"KeyName":{"Type":"AWS::EC2::KeyPair::KeyName"},
"InstanceType":{"Type":"String","Default":"t2.small",
"AllowedValues":["t1.micro","t2.nano","t2.micro",
"t2.small"]},"SSHLocation":{"Type": "String","MinLength":
"9","MaxLength": "18","Default": "0.0.0.0/0",
"AllowedPattern":
"(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})/(\\d{1,2})"
}},"Mappings":{"AWSInstanceType2Arch":{"t1.micro":
{"Arch":"HVM64"},"t2.nano":{"Arch":"HVM64"},"t2.micro":
{"Arch":"HVM64"},"t2.small":{"Arch":"HVM64"},
"AWSInstanceType2NATArch":{"t1.micro":{"Arch":"NATHVM64"},
"t2.nano":{"Arch":"NATHVM64"},"t2.micro":{"Arch":"NATHVM64"},
"t2.small":{"Arch":"NATHVM64"}},"AWSRegionArch2AMI":
{"us-east-1":{"HVM64":"ami-0080e4c5bc078760e","HVMG2":
"ami-0aeb704d503081ea6"},"us-west-2":{"HVM64":
"ami-01e24be29428c15b2","HVMG2":"ami-0fe84a5b4563d8f27"},
"us-west-1":{"HVM64":"ami-0ec6517f6edbf8044","HVMG2":
"ami-0a7fc72dc0e51aa77"}}},"Resources":{"EC2Instance":
{"Type":"AWS::EC2::Instance","Properties":{"InstanceType":
{"Ref":"InstanceType"},"SecurityGroups":[{"Ref":
"InstanceSecurityGroup"}],"KeyName":{"Ref":"KeyName"},
"ImageId":{"Fn::FindInMap":["AWSRegionArch2AMI",{"Ref":
"AWS::Region"},{"Fn::FindInMap":["AWSInstanceType2Arch",
{"Ref":"InstanceType"},"Arch"]}]}}},"InstanceSecurityGroup":
{"Type":"AWS::EC2::SecurityGroup","Properties":
{"SecurityGroupIngress":[{"IpProtocol":"tcp","FromPort":"22",
"ToPort":"22","CidrIp":{"Ref":"SSHLocation"}}]}}},"Outputs":
{"InstanceId":{"Value":{"Ref":"EC2Instance"}},"AZ":{"Value":
{"Fn::GetAtt":["EC2Instance","AvailabilityZone"]}},
"PublicDNS":{"Value":{"Fn::GetAtt":["EC2Instance",
"PublicDnsName"]}},"PublicIP":{"Value":{"Fn::GetAtt":
["EC2Instance","PublicIp"]}}}}

Maybe you think I’m cheating by showing you badly formatted example. It’s obvious here, then, yes?

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Parameters": {
    "KeyName": {
      "Type": "AWS::EC2::KeyPair::KeyName"
    },
    "InstanceType": {
      "Type": "String",
      "Default": "t2.small",
      "AllowedValues": [
        "t1.micro",
        "t2.nano",
        "t2.micro",
        "t2.small"
      ]
    },
    "SSHLocation": {
      "Type": "String",
      "MinLength": "9",
      "MaxLength": "18",
      "Default": "0.0.0.0/0",
      "AllowedPattern": "(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})/(\\d{1,2})"
    }
  },
  "Mappings": {
    "AWSInstanceType2Arch": {
      "t1.micro": {
        "Arch": "HVM64"
      },
      "t2.nano": {
        "Arch": "HVM64"
      },
      "t2.micro": {
        "Arch": "HVM64"
      },
      "t2.small": {
        "Arch": "HVM64"
    },
    "AWSInstanceType2NATArch": {
      "t1.micro": {
        "Arch": "NATHVM64"
      },
      "t2.nano": {
        "Arch": "NATHVM64"
      },
      "t2.micro": {
        "Arch": "NATHVM64"
      },
      "t2.small": {
        "Arch": "NATHVM64"
      }
    },
    "AWSRegionArch2AMI": {
      "us-east-1": {
        "HVM64": "ami-0080e4c5bc078760e",
        "HVMG2": "ami-0aeb704d503081ea6"
      },
      "us-west-2": {
        "HVM64": "ami-01e24be29428c15b2",
        "HVMG2": "ami-0fe84a5b4563d8f27"
      },
      "us-west-1": {
        "HVM64": "ami-0ec6517f6edbf8044",
        "HVMG2": "ami-0a7fc72dc0e51aa77"
      }
    }
  },
  "Resources": {
    "EC2Instance": {
      "Type": "AWS::EC2::Instance",
      "Properties": {
        "InstanceType": {
          "Ref": "InstanceType"
        },
        "SecurityGroups": [
          {
            "Ref": "InstanceSecurityGroup"
          }
        ],
        "KeyName": {
          "Ref": "KeyName"
        },
        "ImageId": {
          "Fn::FindInMap": [
            "AWSRegionArch2AMI",
            {
              "Ref": "AWS::Region"
            },
            {
              "Fn::FindInMap": [
                "AWSInstanceType2Arch",
                {
                  "Ref": "InstanceType"
                },
                "Arch"
              ]
            }
          ]
        }
      }
    },
    "InstanceSecurityGroup": {
      "Type": "AWS::EC2::SecurityGroup",
      "Properties": {
        "SecurityGroupIngress": [
          {
            "IpProtocol": "tcp",
            "FromPort": "22",
            "ToPort": "22",
            "CidrIp": {
              "Ref": "SSHLocation"
            }
          }
        ]
      }
    }
  },
  "Outputs": {
    "InstanceId": {
      "Value": {
        "Ref": "EC2Instance"
      }
    },
    "AZ": {
      "Value": {
        "Fn::GetAtt": [
          "EC2Instance",
          "AvailabilityZone"
        ]
      }
    },
    "PublicDNS": {
      "Value": {
        "Fn::GetAtt": [
          "EC2Instance",
          "PublicDnsName"
        ]
      }
    },
    "PublicIP": {
      "Value": {
        "Fn::GetAtt": [
          "EC2Instance",
          "PublicIp"
        ]
      }
    }
  }
}

Yes, I’m making fun of JSON. Because it’s fun.

To avoid the problems of randomly nested “{“, “[“, “}”, and “]”, another modern solution is YAML. I will spare you the long CloudFormation example, what’s what we’ve been looking at in JSON; here’s the service description for the demo in YAML:

host: localhost
protocol: http
service:
  crew-list: "/cgi-bin/demo"
  crew-bio: "/cgi-bin/demo/{serialNumber}"

YAML is an alternate serialization of (a superset of) JSON. It trades all those curly braces and square brackets for significant whitespace, indented “-” characters, and other tricks. What could possibly go wrong?

Non-XML formats · Ok, enough poking fun. The real point here is to show how an XProc pipeline can process a variety of non-XML sources in interesting ways.

At MarkupUK, Achim demonstrated a sophisticated pipeline that dealt with complex JSON from a web service. I wanted something a little simpler for this presentation. I’m going to load the YAML description of a web service, convert that YAML to JSON, use that JSON to construct a call to the web service, and transform the JSON returned into CSV. Look ma, no angle brackets at all!

Load a YAML service description

In XProc 1.0, you can only⊕There are a few edge cases where you can bend the rules with base64 encoded documents, but generally, support is limited to XML. load and process XML documents. In XProc 3.0, you can load any kind of document you like.

The p:load step will happily load YAML:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
                version="3.0">
  <p:output port="result"/>

  <p:load href="examples/webservice.yaml"/>

</p:declare-step>

This produces:

host: localhost
protocol: http
service:
  crew-list: "/cgi-bin/demo"
  crew-bio: "/cgi-bin/demo/{serial-number}"

(My implementation just dumps out the octets of a non-XML document, so we’re lucky this one is text.)

Convert YAML to JSON

Next, we can use p:cast-content-type to convert it to JSON. Casting between content types is a convenient way to do a selection of “trivial” conversions: image/svg+xml to application/xml, for example, which doesn’t actually change the data at all. It will also do some identity-like transformations such as text to JSON, by attempting to parse the JSON.

Implementations are free to support additional castings and I decided that YAML to JSON was a reasonable case.

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
                version="3.0">
  <p:output port="result"/>

  <p:load href="examples/webservice.yaml"/>

  <p:cast-content-type content-type="application/json"/>

</p:declare-step>

This time, the output is, as I hope you expected, JSON:

{"host": "localhost",
 "protocol": "http",
 "service": {"crew-list": "/cgi-bin/demo",
 "crew-bio": "/cgi-bin/demo/{serial-number}"}}

Call the web service described

Now the context item is JSON. JSON is represented in XProc using an XPath map. We can easily access the properties of the map in an attribute value template to construct the href URI on a p:http-request step.

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
                version="3.0">
  <p:output port="result"/>

  <p:load href="examples/webservice.yaml"/>

  <p:cast-content-type content-type="application/json"/>

  <p:http-request
     href="{?protocol}://{?host}{?service?crew-list}"/>

</p:declare-step>

The result of this pipeline is whatever the service produces. In this case, it produces JSON:

[{"name": "James T. Kirk",
"rank": "Captain",
"serialNumber": "SC937-0176 CEC"}, {"name": "Spock",
"rank": "Captain, retired",
"serialNumber": "S179-276SP"}, {"name": "Leonard H. McCoy",
"rank": "Admiral, retired"}, {"name": "Montgomery Scott",
"rank": "Captain",
"serialNumber": "SE-197-54T"}]

Format as CSV

The last step I promised was to convert this JSON to CSV. I think a vocabulary of CSV-related steps would be a good idea, but we don’t have any at the moment. What to do instead?

Overload p:cast-content-type even further?
Define a new step?
Hack it with XSLT?

I could have extended p:cast-content-type further, but that doesn’t feel right. In practice, only a very small subset of possible JSON documents can reasonably be represented in CSV, so it doesn’t feel like casting between near-equals.

A new step would obviously have been possible, but I thought it would be interesting to do it with XSLT.

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
                version="3.0">
  <p:output port="result"
            serialization="map{'method': 'text'}"/>

  <p:load href="examples/webservice.yaml"/>

  <p:cast-content-type content-type="application/json"/>

  <p:http-request
     href="{?protocol}://{?host}{?service?crew-list}"/>

  <p:xslt template-name="convert">
    <p:with-option name="global-context-item" select="."/>
    <p:with-input port="source"><p:empty/></p:with-input>
    <p:with-input port="stylesheet" href="json2csv.xsl"/>
  </p:xslt>

</p:declare-step>

This produces the expected result:

"serialNumber","name","rank"
"SC937-0176 CEC","James T. Kirk","Captain"
"S179-276SP","Spock","Captain, retired"
"","Leonard H. McCoy","Admiral, retired"
"SE-197-54T","Montgomery Scott","Captain"

Yay!

If you haven’t used XSLT 3.0 much yet, the stylesheet may be interesting:

It expects to start at a named template: convert
It has no primary input document. Note, in particular, that I couldn’t pass the map to the step as its primary input. Maps aren’t nodes, so you can’t do that.
Instead it processes the global-context-item which is specified as an option.

Here’s the json2csv.xsl stylesheet:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:map="http://www.w3.org/2005/xpath-functions/map"
 xmlns:array="http://www.w3.org/2005/xpath-functions/array"
 exclude-result-prefixes="xs"
 version="3.0">

<xsl:output method="text" encoding="utf-8"
            indent="no"/>

<xsl:template name="convert">
  <xsl:variable name="results" select="."/>

  <xsl:variable name="keys"
                select="map:keys(array:get(.,1))"/>
  <xsl:for-each select="$keys">
    <xsl:if test="position() gt 1">,</xsl:if>
    <xsl:text>"</xsl:text>
    <xsl:value-of select="."/>
    <xsl:text>"</xsl:text>
  </xsl:for-each>
  <xsl:text>&#10;</xsl:text>

  <xsl:for-each select="1 to array:size(.)">
    <xsl:variable name="result"
                  select="array:get($results, .)"/>
    <xsl:for-each select="$keys">
      <xsl:if test="position() gt 1">,</xsl:if>
      <xsl:text>"</xsl:text>
      <xsl:value-of select="map:get($result, .)"/>
      <xsl:text>"</xsl:text>
    </xsl:for-each>
    <xsl:text>&#10;</xsl:text>
  </xsl:for-each>
</xsl:template>

</xsl:stylesheet>

(I make no assertions about the quality or generality of that stylesheet; it works for my demo!)

In short: a complete, useful transformation with no angle bracketed data at all!

Easier to use · In preparation for the pre-conference day, Achim converted the pipeline that transforms DocBook into other formats from XProc 1.0 to XProc 3.0. To highlight some of the ease-of-use improvements in XProc, I thought I’d walk briefly through some of the changes. And thank you, Achim!

Parameters

The 1.0 way:

<p:input port="parameters" kind="parameter"/>

XProc 1.0 has a complicated mechanism for dealing with parameters that involves a special kind of “parameter” input port. It’s all very complex and messy. The challenge is to deal with name/value pairs (i.e, parameters) whose names are not known until runtime.

In XProc 3.0, we can just use a map!

The 3.0 way:

<p:option name="parameters" select="map{}"/>

One tiny bit of magic behavior having to do with parameters passed into the top-level pipeline has been lost, but it was no where near valuable enough to justify all of the complexity.

Output and serialization

The 1.0 way:

<p:output port="result" sequence="true" primary="true">
  <p:pipe step="process" port="result"/>
</p:output>

<p:serialization port="result" method="html"
                 encoding="utf-8" indent="false"
                 version="5"/>

In XProc 1.0, ports and their serializations are declared separately. When an explicit binding is required, it must be specified with a p:pipe element.

In XProc 3.0, we’ve made the serialization properties into a map on the p:output step and introduced a pipe attribute shortcut for explicit bindings.

The 3.0 way:

<p:output port="result" pipe="@process"
          sequence="true" primary="true" 
          serialization="map{'method'  : 'html',
                             'encoding': 'UTF-8',
                             'indent'  : false(),
                             'version' : '5'}"/>

It’s worth pointing out also that in 3.0, the serialization parameters can be specified dynamically at runtime. It’s also possible for the processor to adapt to the data type, that is, to serialize XML as XML, JSON as JSON, etc. It’s now easier to have pipelines that produce different kinds of data.

Extensions

The 1.0 way:

<!-- Ideally, this pipeline wouldn't rely on an XML Calabash
     extensions, but it's a lot more convenient this way.
-->

<p:declare-step type="pxp:set-base-uri">
  <p:input port="source"/>
  <p:output port="result"/>
  <p:option name="uri" required="true"/>
</p:declare-step>

<p:declare-step type="cx:message">
  <p:input port="source" sequence="true"/>
  <p:output port="result" sequence="true"/>
  <p:option name="message" required="true"/>
</p:declare-step>

This pipeline wants to update the base URI of some documents, a feature only available via an extension step in XProc 1.0, and to produce progress messages.

The 3.0 way:

<!-- n/a -->

Both of those features are built in to XProc 3.0! (The underlying extension mechanisms are the same, we just don’t need them in this case.)

Conditional processing

The 1.0 way

<p:choose>
  <p:when test="$schema = ''">
    <p:output port="result"/>
    <p:identity/>
  </p:when>
  <p:otherwise>
    <p:output port="result"/>
    <p:load name="load-schema" xmlns:exf="http://exproc.org/standard/functions">
      <p:with-option name="href" select="resolve-uri($schema, exf:cwd())"/>
    </p:load>
    <p:validate-with-relax-ng>
      <p:input port="source">
        <p:pipe step="xinclude" port="result"/>
      </p:input>
      <p:input port="schema">
        <p:pipe step="load-schema" port="result"/>
      </p:input>
    </p:validate-with-relax-ng>
  </p:otherwise>
</p:choose>

XProc 1.0 requires the pipeline author to be explicit about every branch of a choice and to make sure that every branch produces exactly the same outputs.

In XProc 3.0, the constraint⊕It’s worth noting that every branch of a choice is still required to produce the same primary output (either on the same named port or on an implicit port in every case). This assures that (implicit) connections to the following step will always be consistent. that every branch must produce exactly the same outputs has been relaxed. We’ve also defined the semantics of a “missing otherwise” as “perform the identity step”. This greatly simplifies many choices and allowed us to add a p:if step.

The 3.0 way:

<p:if test="$schema != ''">
  <p:validate-with-relax-ng>
    <p:with-input pipe="@xinclude"/>
    <p:with-input port="schema" href="{p:urify($schema)}"/>
  </p:validate-with-relax-ng>
</p:if>

Here we also see the simplifications provided by pipe and href attributes directly on the p:with-input element and the use of attribute value templates.

Per-document parameters

This pipeline has a feature that allows parameters to be defined in the input document. They’re extracted with XSLT and then they have to be combined with any other parameters that might have been passed in.

The 1.0 way:

<!-- combine them with the pipeline parameters -->
<p:parameters name="all-parameters">
  <p:input port="parameters">
    <p:pipe step="main" port="parameters"/>
    <p:pipe step="document-parameters" port="result"/>
  </p:input>
</p:parameters>

XProc 1.0 has a step that combines two of the special “parameter” input types into a single document.

In XProc 3.0, we’ve eliminated all that complexity. The extracted parameters are cast to JSON (a map) and then explicitly merged.

The 3.0 way:

<!-- combine them with the pipeline parameters -->
<p:cast-content-type content-type="application/json" />  

<p:variable name="all-parameters"
            select="map:merge(($parameters, .))"/>

Using the combined parameters

The p:xslt step that uses the parameters is (yet) another example of the complexity of parameter input ports.

The 1.0 way:

<p:xslt name="normalize">
  <p:input port="stylesheet">
    <p:document href="../preprocess/50-normalize.xsl"/>
  </p:input>
  <p:input port="parameters">
    <p:pipe step="all-parameters" port="result"/>
  </p:input>
</p:xslt>

In XProc 3.0, this is once again, just a map!

The 3.0 way

<p:xslt name="normalize" version="2.0"
        parameters="$all-parameters">
  <p:with-input port="stylesheet"
     href="../preprocess/50-normalize.xsl"/>
</p:xslt>

In short: I hope that demonstrates that XProc 3.0 is simpler and easier to use!

Thank you

I managed to finish with time left for a few questions. I hope that’s not because I talked too fast.

There were several questions, but the one that stands out in my mind was about backwards compatibility: given all these changes, are XProc 3.0 processors expected to process XProc 1.0 pipelines?

Unfortunately, the answer is “no”. We expect to be able to provide stylesheets that will help “uptranslate” pipeline documents from 1.0 to 3.0, but we decided not to require every 3.0 implementation to support 1.0 pipelines. The changes are just too dramatic.