Syntax highlighting with Pygments (Again!)

Volume 3, Issue 21; 29 Sep 2019

I’m trying to learn how to use CSS for paged media. It was all mostly fine until I decided I wanted syntax highlighting.

Several of us at XML Summer School use XML to write our presentations. We use the “slides” customization of DocBook. I have stylesheets that produce an HTML document with appropriate CSS and JavaScript to give a “paginated” feel. I even manage synchronous display across two browsers so that presenters can see their notes while projecting the slides. This is neither novel nor especially interesting.

For the HTML version, I use Prism for syntax highlighting. Prism is a JavaScript library that does highlighting in the browser for many different languages. (The picture that follows is a little blurry because of the cheap and cheerful way that I scaled it to be the same size as the picture of the PDF that follows below; the slides are nice and sharp in practice.)

HTML Output

In addition to being able to attend the sessions, each of the XML Summer School delegates gets a USB key with all of the presentations from all of the courses. The most convenient thing to put on the USB key is a PDF of each presentation.

I have some rather rickety XSL FO stylesheets to do this, but I decided this was an ideal project for learning more about CSS paged media: the files are small, the layouts are simple, and I already have CSS that produces the presentation I want.

My plan:

  1. Leverage the existing HTML stylesheets. It’s entirely possible to style arbitrary XML with CSS, but from a practical perspective, some transformations are always going to be necessary. Using HTML as the intermediate format makes sense.
  2. Run a post-processing pass to clean up some things in the generated HTML to prepare it for print: tidy up headers and footers (produce one header and one footer with markup suitable for print styling and delete the rest), discard speaker notes, links in the head, scripts, and markup that exists only to be manipulated by JavaScript. I also ended up reorganizing a few class attribute values. Note to self: all of the CSS needs to be refactored and cleaned up.
  3. Add CSS for paged styling.
  4. Run the whole thing through AntennaHouse and be done in an evening.

(All of this is a distraction from my XProc 3.0 work, but I need to do the XML Summer School changes while they’re fresh in my mind and I’m way overdue to put in a little maintenance effort on DocBook.)

It went pretty much according to plan. There was a fair bit of trial and error, plenty of web searching, a bit of reading, and more binary searching through CSS files to find errant rules than I’d care to admit, bit I got there. It took a couple of evenings, but I got there.

Then I noticed that the program listings in the PDF version weren’t syntax highlighted. This wasn’t new, they’d never been syntax highlighted in the XSL FO versions either. Nevermind, I said to myself, hopelessly, it’s fine. Except it wasn’t fine and I wanted to fix it.

Years ago, I did this with Pygments by way of Jython. Then Jython development seemed to stall and it was a bit of a pain anyway. I switched to using JavaScript instead, but that won’t work here for obvious reasons.

I thought I’d find a JVM-based syntax highlighter, plug it in, and be done. There are between “some” and “many” syntax highlighters written in JavaScript. I thought there would be at least “several” to choose from on the JVM, but no. There’s xslthl, which dates back to the XSLT 1.0 stylesheet days, but it crashed on one of my source code fragments and I didn’t feel like trying to learn its codebase well enough to fix it.

I’m temped to write one: a nice library with an Earley parser that takes an input document and a grammar and returns an AST in XML. But that is way too big a distraction right now if I ever want to catch up to MorganaXProc!

In the end, I said “[expletive deleted]!” and decided to just write an XProc step that uses p:exec to run Pygments.

Except you can’t in XProc 1.0. The problem is that if you have a program listing that contains embedded markup, you only want to send the text portions of the listing to the syntax highlighter, not the markup. That means a p:viewport that matches on text() and that you can’t do in XProc 1.0. [You can do it in XProc 3.0!].

I said “[expletive deleted]!” much more loudly and just wrote the damned step: xmlcalabash1-pygments. It runs Pygments (if you have pygmentize on your path, naturally) and cleans up the markup a bit: it removes the div and pre elements that Pygments inserts because those are likely to be redundant in the context where I imagine this step being used.

PDF Output 

In this context, the output is actually better than what Prism provides because it’s easy to adjust the style according to the markup vocabulary. Unfortunately, spinning up an external process for every fragment of every program listing does introduce a non-trivial performance penalty.

But it works and I’m willing to pay an extra few (tens of) seconds to get the improved results.