Volume 1, Issue 2; 01 Mar 2017

You did what now?

This post, like the hello world post, is a natural one to write. It will be of passing interest to the sorts of folks who care how website sausage is made. If that doesn’t include you, feel free to wander off and read something more interesting.

History · My previous weblog had three distinct implementations. It started out as a site of mostly static HTML pages shaken and stirred by a bunch of Perl and Python (and XSLT). The XSLT transformed DocBook sources into HTML, the Perl extracted some RDF, ran it through an inference engine, and then cobbled all the bits and pieces together.

The second iteration ran on MarkLogic, with all the RDF pulled out and replaced by indexed XML markup. (Because MarkLogic of that vintage didn’t have support for semantics.)

In the last iteration, I put most of the RDF back. I did it partly to get some practical experience with the semantics features of MarkLogic, but also because SPARQL queries are convenient.

For this reboot, I wanted to go in a different direction. I chose a new direction partly because this weblog is an excuse for me to play with new technologies, but mostly because it’s been almost 15 years since I started doing this and that’s a long time in web years.

The old SGMLers’ dream of arbitrary markup on the web has come and gone. If you’re going to publish on the web, you’re going to do it with HTML, CSS, and JavaScript. That’s HTML5, CSS3, and ES6, sortof.

These days, you’re likely to write in Markdown instead of HTML, use SASS or Less instead of CSS, and JQuery (or maybe TypeScript or CoffeeScript) instead of “plain” JavaScript. (In addition, everything seems to be buried in frameworks of such rococo complexity that they’d make Louis XV wince, but that’s a topic for another day.)

Today · So what am I doing today? Today, this weblog is authored in Markdown (specifically CommonMark), styled with CSS (I confess, I haven’t made the switch to SASS or the like), and tarted up here and there with a bit of JQuery.

But there’s no DocBook!? No, in fact, there isn’t. I still think DocBook is great, but I wanted to try something different. Using DocBook for a weblog is a bit like driving to the corner grocery store in a Ferrari. It’ll get you there, but you’re not exactly taking advantage of it, are you?

The heavy lifting is still all provided by MarkLogic, of course. In addition to the pages themselves, there’s extracted (and inferred) semantic metadata on each article, plus a taxonomy and a big bag of semantic data. These are combined with XQuery and SPARQL to provide formatted pages plus all of the various views. (I know it’s possible to build web sites with tools other than MarkLogic, but for the love of all things, why put yourself through that hell? You know MarkLogic is a free download, right?)

CommonMark · I chose CommonMark for a couple of reasons. First, it has a really good specification. It’s not terribly long; it’s clearly written, unambiguous, and filled with useful examples. Second, there’s a complete and conformant JavaScript implementation of a CommonMark interpreter that produces well-formed HTML.

I’ve poked about with a number of markup flavors. In daily use, I have an affinity for org-mode because…Emacs. I have also used AsciiDoc which has good support for round tripping to DocBook. But neither of them has a clear, concise specification and, while they may have JavaScript implementations, those implementations can’t be as complete and conformant as the CommonMark interpreter. They can’t be because there’s no proper specification against which to write tests.

That matters because all of these less-than-XML markup formats have something in common: they make easy things easy. The less than easy things are…less than easy. The formats tend to introduce increasingly arbitrary punctuation to accomplish anything even moderately complicated. So knowing that there’s a bulletproof specification is what gives you confidence that you’ll never be surprised. In particular that the interpretation of punctuation won’t drift over time. I want to have confidence that the characters I write today will have the same interpretation at the end of the unix epoch as they do now.

Actually, another point in CommonMark’s favor is the ruthlessly simple way that the specification deals with this dilemma: if it’s not easy, just stick in literal HTML markup. End of story. Literal HTML is a bit incongruous when you find it jutting out in the middle of your otherwise mostly markup-free prose, but it’s damned simple to understand.

CommonMark may be the bee’s knees for authoring, but I need to turn it into actual markup to make use of it. I need HTML to display and I need structured markup from which to derive indexes and semantic data. This is where the JavaScript interpreter comes in. Yes, I could write an interpreter for any of these formats if I wanted to, or call an external process, but the fact that the reference implementation just drops into MarkLogic is awfully sweet. Here it is:

var commonmark = require ('commonmark.sjs');

var reader = new commonmark.Parser();
var writer = new commonmark.HtmlRenderer();
var parsed = reader.parse(mdtext);
var result = writer.render(parsed);


Stick the source markup in mdtext, call that module, and I get good, structed HTML back almost instantly. Perfect. Almost perfect.

What about? · Yes, exactly! What about those things? What about bibliographic metadata? What about hierarchical document elements? What about syntactic shortcuts for my particular editorial needs?

One of the absolute advantages of XML (the feature that makes it superior to Markdown and to HTML and to JSON and and and…) is its extensibility. It is always possible to extend an XML format simply by adding new markup. And that extension is always both apparent to consumers and ignorable by consumers.

But I don’t have XML this time, so I cheat.

CommonMark++ · Within the overall design of this weblog, I have four requirements that are not directly satisfied by CommonMark without resorting to inline HTML. Since they occur in almost every posting, I decided I wanted to handle them specially:

  1. Arbitrary bibliographic metadata
  2. Abstracts
  3. Epigraphs
  4. Extensible inline markup

I address these by imposing additional constraints on the input. In particular, these posts are not formed from completely arbitrary Markdown. Each posting has (must have!) the following format:

# The post title

A “paragraph” of arbitrary bibliographic metadata (see below).

A paragraph that is taken to be the abstract for the posting.

> An optional
> epigraph.

The rest of the input is the body of the post, which is
ordinary Markdown except for the special interpretation
of a particular inline syntactic extension.

The bibliographic metadata is further encoded into keyword/value pairs like so:

:uri: /2017/03/01/how
:subject: SelfReference
:where: us-tx-austin
:anytoken: Any value

Without some sort of an extension for metadata, I don’t see how to use Markdown in a publishing context without some considerable inconvenience. Well, I suppose if you’re working in a system where the metadata can be tracked externally, you don’t need to put it in the documents.

My inline syntactic extension is really just laziness. The markup could absolutely be inserted as HTML. But typing

<a href=””>Topic</a>

everytime I want to refer to a Wikipedia page, or

<span class="person" data-person="Walsh,Norman">Norman Walsh</span>

everytime I wanted to refer to a person, just seemed too tedious and intrusive. That’s a completely arbitrary value judgement and given the amount of markup that I’ve happily typed in my life, may even be a bit hypocritical. But the fact remains that that’s what I decided.

For the use cases I have in mind, it’s sufficient to encode a keyword, a token, and a string. After a few minutes skimming the CommonMark spec, I concluded that I could hijack the sequence “{:”. In particular, that I could encode arbitrary inline metadata for my own purposes like so:

{:keyword:token “string”}

It makes the reference to a Wikipedia {:wiki:Topic} or personal name, like {:person:Walsh,Norman “Norman Walsh”}, easier to type and less intrusive to the flow of the paragraph (for the editor). And, naturally, once the mechanism existed, I found another half dozen uses for it.

Putting it all together · To write a post, I author it in Markdown according to my conventions. I usually do this in Emacs, but I can also do it in SimpleMDE. Regardless, the Markdown is eventually sent to the weblog via an HTTP POST.

The CommonMark Javascript converts it to HTML. The HTML is post-processed according to my conventions. Semantic metadata is added, inference is performed (using ad hoc queries today, perhaps using MarkLogic inferencing in the future), and the result is stored in the database, ready to be served up.

(Well. Mostly ready. In fact, I do a little bit of additional processing in some cases. But it’s not especially interesting and the result can be cached so that responding to requests is nearly instantaneous.)

  • Yes, I know about web components. Yes. Maybe. But not yet and, frankly, I expect not ever.