“Chunking” DocBook

Volume 4, Issue 27; 27 Apr 2020

Published by Norman Walsh

Breaking large DocBook documents into pages for the web is just a little non-trivial. I’ve rewritten how it works in the DocBook XSLT 2.0 Stylesheets.

Styling DocBook for the web is mostly straightfoward.⊕DocBook has some markup that’s definitely not straightfoward: callouts, CALS tables, and several of the programming language synopsis elements come to mind. But they’re very much in the minority. For each DocBook element that you use, you just work out how to map it to HTML5, probably decorated with CSS and possibly made interactive with JavaScript.

That’s totally fine for small or even small-to-medium sized documents. It’s a little less friendly for really big documents. There are basically two approaches you can take to this problem: you can author in something that’s naturally composed of small pieces, like DITA topics or DocBook “Website” pages, or you can break the large document into pieces when you format it.

There are good, perhaps even compelling, arguments to be made for authoring in smaller units, but let’s leave that to one side. Books exist, they are structurally large and monolithic, and it’s entirely reasonable to want to publish them on the web. The natural, linear structure of a book makes the task logically easy: make separate pages and link them together “like a book”.

Here’s an example of the structure of a DocBook book, DocBook 5.2: The Definitive Guide in this case:

Book tree

Suppose we want to publish it so that each one of those nodes is a separate web page, a separate “chunk” as it were. One thing to note is that chunks can be nested arbitrarily (book, part, and chapter, for example, are all chunks) and that the nesting isn’t fixed: in a book with short chapters, you might put the chapters in the part chunks, conversely, in a book with very long chapters, you might make each top-level chapter section a separate chunk.

When I first set out to add chunking to the DocBook stylesheets, I decided that chunking should be implemented entirely as a customization layer on top of the base stylesheets. That is, the templates for book and part and chapter should neither know nor care if they were formatting as chunks or not.

In order to achieve this, I relied on import precedence. Consider this stylesheet:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                exclude-result-prefixes="xs"
                version="2.0">

<xsl:import href="base-stylesheet.xsl"/>

<xsl:template match="*">
  <xsl:apply-imports/>
</xsl:template>

</xsl:stylesheet>

If you transform with this stylesheet, the match="*" template will fire for every single element in your document. It immediately defers processing to the imported stylesheet and has basically no effect. But it’s a place to hang the chunking logic, something⊕Actually, something a good deal more complicated than this. Each chunk page needs navigation elements, simple ID/IDREF cross references may now cross page boundaries, etc. This is an over-simplified example just to reveal the technique. like this:

<xsl:template match="*">
  <xsl:choose>
    <xsl:when test="f:this-is-a-chunk(.)">
      <xsl:result-document href="{f:compute-chunk-uri(.)}">
        <xsl:apply-imports/>
      </xsl:result-document>
    </xsl:when>
    <xsl:otherwise>
      <xsl:apply-imports/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

Ok, that works for a lot of documents. But fast forward far enough and a real-world use case involves customization to both the underlying base stylesheet and the way chunking is performed.

The first bug I encountered was that reference pages (abbrev, abstract, … year) were being both chunked and included inline in the Reference. It turned out that in my customization, my template for refentry had higher precedence than the chunking match="*" template. When I’d fixed that, I discovered that the templates that customized page headers and footers now had a lower precedence than the default versions, so they weren’t getting used.

Eventually, I persuaded myself (it was late, I was tired, I’m not 100% certain I’m right) that what I really needed was for the chunking logic to stand sort-of half way between the base stylesheets and my customizations. To achieve that, I think I’d have to cut and paste a whole bunch of the underlying chunking code into my customization.

And I just didn’t want to do that. Having to do that, it seemed to me, was too high a price to pay “just” to keep the base stylesheets agnostic to whether or not chunking was being performed.

In version 2.6.0 of the DocBook XSLT 2.0 Stylesheets,⊕Which you can’t get from Maven because publishing to Maven has just stopped working for me. But that’s a different problem for a different day. I’ve rewritten chunking. The new code is much simpler and more reliable. But it does mean that the template for every element that could potentially be a chunk has to know this fact and be written accordingly, like this one for refentry:

<xsl:template match="db:refentry">
  <xsl:param name="processing-chunk-root" select="false()"/>
  <xsl:if test="$processing-chunk-root or not(f:chunk(.))">
    <article>
      <!-- … transform it here … -->
    </article>
  </xsl:if>
</xsl:template>

Each chunk will be processed once with $processing-chunk-root set to true() when it’s being written out as a chunk. On every other occasion when it’s processed, for example when the body of the reference chunk is being processed, it’ll produce no output.

I may try again to do better, but that’s it for now.