🧩 XML 🧩
I love how the pieces fit together. In this case, for a possibly mad idea about publishing with Typst.
On Saturday afternoon, I stumbled across a reference to Typst. I think I might have seen it once before, but not really paid much attention. I’m not interested in using an online platform for authoring, but the typst compiler is a cross platform, open source application that I can run locally.
The idea is simple enough, you type in a sort of elaborated Markdown and the typst compiler turns it into an accessible PDF (and maybe other things). The typst markdown language is, in fact, a complete programming language with variables and conditionals and loops and all. That’s an idea that’s as old as TeX, possibly older. But typst is a modern reinvention.
I took a look and I wondered if I could use it to generate PDF. By which I mean, can I generate its flavor of Markdown from XML? (Because of course that’s what I mean!)
The answer to this question is obviously “yes”, but it’d be tedious and perhaps difficult to write an XML-vocabulary-to-typst stylesheet. And when I was done, I’d have that, but nothing especially reusable. Nothing that would help me when I had some other flavor of XML (say, expense reports in a bespoke vocabulary) that I wanted to typeset.
Speaking of TeX, I’ve had exactly this idea before. My intuition is that rather than going from the source vocabulary to TeX directly, it would make more sense to define an XML vocabulary for the TeX data model, transform to that XML from the source vocabulary, and transform from that XML to TeX. Trouble is, the TeX data model is…large, complicated, and rather loose. Every time I imagine trying to write an XML vocabulary for it, I get lost in the weeds before I get to the first hurdle.
Typst is a lot simpler and has a much cleaner model. I had the wild thought that this might be an opportunity to try out the conversion idea and test something else I’ve wondered about. Peter Flynn and I have talked about the TeX transformation; Peter thinks that the place to do the conversion is in a custom serialization method. Might work, but I’d have to try it to know for sure.
My brain, having found about the third rabbit hole of the weekend, plunged in. I’ve no particular interest in recreating the whole typst programming environment and language in XML, but if I can model enough of it, I’ll have a playground for comparing conversion-with-XSLT and conversion-with-serializer. (And maybe a new publishing tool.)
So what do we need? We need an XML model for typst, we need a schema for that model so that we have some guard rails, we need a stylesheet to convert that vocabulary to Markdown (serializer experiment to come later), and we need a stylesheet to convert an arbitrary XML vocabulary (cough DocBook) into the typst model.
I started looking at the typst documentation and constructing a RELAX NG grammar to model it in XML. A half hour in, I’d got a few things sort of working, but it was feeling a bit daunting. There’s a lot of detail and copying it from the documentation into RELAX NG is…wait. What? Why am I doing that? What is wrong with me?
The typst documentation lays out the model in a clear, regular format. This, for example, is a paragraph:
par(
leading: length,
spacing: length,
justify: bool,
justification-limits: dictionary,
linebreaks: auto str,
first-line-indent: length dictionary,
hanging-indent: length,
content,
) -> content
Clear, regular formats can be parsed. I grabbed all the model descriptions and stuffed them in a text file. Then I wrote an iXML grammar to turn them into XML:
<model name="par">
<keyword name="leading">
<choice name="length"/>
</keyword>
<keyword name="spacing">
<choice name="length"/>
</keyword>
<keyword name="justify">
<choice name="bool"/>
</keyword>
<keyword name="justification-limits">
<choice name="dictionary"/>
</keyword>
<keyword name="linebreaks">
<choice name="auto"/>
<choice name="str"/>
</keyword>
<keyword name="first-line-indent">
<choice name="length"/>
<choice name="dictionary"/>
</keyword>
<keyword name="hanging-indent">
<choice name="length"/>
</keyword>
<content>
<choice name="content"/>
</content>
</model>
The markup isn’t really important.The markup would be a little nicer if it was
hanging-indent/length instead of
keyword[@name='hanging-indent']/choice[@name='length’] but that’s not
possible directly in iXML. (Another rabbit hole opens up nearby,
wait, I have an idea…) It’s a more-or-less 1:1 translation of the
function signatures into XML. Critically, it distinguishes between positional
and keyword arguments and enumerates what the choices are for each argument.
One XSLT stylesheet later and we have a fully formed RELAX NG grammar for all of typst, of which this is the corresponding part:
<define name="par">
<element name="par">
<optional>
<ref name="par-leading"/>
</optional>
<optional>
<ref name="par-spacing"/>
</optional>
<optional>
<ref name="par-justify"/>
</optional>
<optional>
<ref name="par-justification-limits"/>
</optional>
<optional>
<ref name="par-linebreaks"/>
</optional>
<optional>
<ref name="par-first-line-indent"/>
</optional>
<optional>
<ref name="par-hanging-indent"/>
</optional>
<ref name="content"/>
</element>
</define>
The par-leading pattern defines the leading element, but by using distinct
pattern names, I don’t have to worry about some other model that uses leading
in some different way.
And I lied. That’s not really a complete RELAX NG grammar because it doesn’t
have patterns for the basic data types (int, str, bool, …). There’s few enough
of them that I just scratched out a datatypes.rnc file and generated an
include for it in the typst.rng file.
Given what I’ve described so far, we need to process the text file with
Invisible XML, convert the resulting XML into RELAX NG, and convert the
datatypes.rnc file to an rng file. Then we’ll have a grammar we can use to
validate our typst XML input.
One short XProc pipeline later and that’s sorted.
But does it work? Let’s find out! Here’s a test document:
<article xmlns="http://docbook.org/ns/docbook">
<title>A test article</title>
<para>This is generated from <emphasis>DocBook</emphasis>.</para>
</article>
Two more (very experimental and very incomplete) stylesheets and another XProc pipeline later and we can turn our test document into this:
<typst xmlns="http://nwalsh.com/ns/typst">
<document>
<title>
<content>A test article</content>
</title>
</document>
<content>
<heading>
<level>
<int>1</int>
</level>
<content>A test article</content>
</heading>
<par>
<content>This is generated from <emph>
<content>DocBook</content>
</emph>.</content>
</par>
</content>
</typst>
and then this:
#set document(title: [A test article])
#heading(level: 1)[A test article]
#par()[This is generated from #emph()[DocBook].]
and then compile it with typst. Out comes an entirely reasonable PDF file!
Win!
Obviously, neither the typst XML format nor the resulting Markdown output is anything a user would want to type by hand. But that’s not important because no user is going to.
There are still some aspects of the typst data model that I don’t understand.
I’m having trouble, for example, working out exactly what’s allowed in content
in various places. Behind the scenes, I have a small hack that distinguishes
between “inline content” where text is allowed and “block content” where it
isn’t. But I’m pretty confident that’s me holding the wrong end of some stick.
Is this going to go anywhere? I dunno. But it was fun and I absolutely love how quickly things came together with the XML stack: Invisible XML, XML, XSLT, RELAX NG, and XProc really make short work of things.