tree-sitter iXML

Volume 7, Issue 39; 22 Aug 2023

An Emacs editing mode for Invisible XML using tree-sitter. “It works.” And it wasn’t even hard! And it works!

TL;DR, https://github.com/nineml/tree-sitter-ixml/.

Last summer, Syd asked about an Emacs mode for editing Invisible XML grammars. I fussed with it a bit. It’s certainly possible. Modes have been made for much more complicated things, after all. But it was complicated and tedious and I didn’t wanna. (Michael Sperberg-McQueen got further more than I did.)

This past Sunday afternoon, I was tinkering with my Emacs configuration, mostly so that I could try out tree-sitter. Tree-sitter is (roughly speaking) an application-independent framework for syntax-aware editing. Historically, emacs and lots of other editors used complex regular expressions to provide syntax highlighting: function names in red, quoted strings in blue, keywords in green, that sort of thing.

That has worked very well, except for the complicated and tedious part. It also doesn’t help (or help much) with anything else you might want to do in a directed way. Like renaming a function, for example.

The way tree-sitter works is that it parses the text that you’re editing and provides an abstract syntax tree (AST), a representation of the document with the structures identified logically, to the editor. Like so:

(source_file
 (rule
  (rulename ixml)
  (rulesep)
  (alt (nonterminal) (termsep) (nonterminal))
  (fullstop))
 (rule (mark) (rulename) (rulesep)
  (alt (nonterminal))
  (altsep)
  (alt (nonterminal) (termsep) (nonterminal))
  (fullstop))
 (rule (rulename) (rulesep)
  (alt (nonterminal) (termsep) (nonterminal) (termsep) (nonterminal) (termsep)
   (ERROR (tmark) ")
   (nonterminal) (termsep) (nonterminal) (termsep)
   (literal
    (quoted (tmark)
     (string " ")))
   (termsep) (nonterminal) (termsep) (nonterminal))
  (fullstop))
  …

This is very powerful. If, for example, you want to rename a function from “foo” to “bar”. The editor doesn’t just find-and-replace all occurrences of “foo” with “bar”, it finds all the functions and function calls named “foo” and renames them “bar”, leaving the variable names, strings, comments, and other occurrences alone.

Somewhat less ambitiously, because the editor knows which things are functions, function calls, variable references, etc., it can highlight them appropriately.

Point is, tree-sitter is built into Emacs 29.1 and I wanted to play with it. (I’ve been wanting to play with it since I started running the Emacs 29 pretests, but it never got to the top of my list.)

It was fairly easy to set up and pretty soon I had it working for a bunch of languages. It was at this point that I wondered, how hard would it be to make the Invisible XML mode with tree-sitter?

Step 1: make a tree-sitter grammar for Invisible XML. This was kind of weird and meta, but I got there. Tree-sitter grammars are defined in JavaScript, even though they’re compiled to C/C++. (I don’t know, I’m not sure I want to know.)

Here’s the tree-sitter rule for an iXML rule:

rule: ($) => seq(optional($.mark), $.rulename, $.rulesep,
                 optional($._alts), $.fullstop)

Tree-sitter is more like most conventional grammars (and less like Invisible XML): it tokenizes (“lexes”) the input before parsing, it wants the grammar to be unambiguous (or mostly, there are mechanisms for dealing with some ambiguity), it uses only a single character or token of lookahead, has rules for precedence, etc. This is all completely reasonable because its goal is to be fast, fast enough to parse between every keystroke when you’re editing.

One tree-sitter constraint that bites kind of hard (and has disabused me of the idea that I might write a translator from iXML grammars to tree-sitter grammars) is that you may not have a nonterminal that matches the empty string. All nonterminals have to consume something.

Right away, that last rule means that the tree-sitter grammar for Invisible XML has to be different from the iXML grammar. In Invisible XML, an alt can be empty (which also means alts can be empty). I (believe I) worked around that by making alt contain at least one term but making alt and alts optional where they occur. (That’s why there’s an optional() around alts above.)

The next thing I realized is that the purpose of a tree-sitter grammar is very different from the more traditional purpose of a grammar. Most grammars want to abstract away “irrelevant” input tokens. An iXML range inclusion (["a"-"z"]) has a “from” value (“a”) and a “to” value (“z”): the square brackets and the hyphen aren’t usefully part of the inclusion. Except, if you want to syntax highlight them or identify them semantically for the purposes of editing, then you want to assign them to a nonterminal that isn’t suppressed!

Short story just a little bit longer: I made a bunch of choices about what to identify and what to suppress. If you try to do something more ambitious, you may wish I’d made different choices. Just open an issue or make a pull request.

Step 2: make an Emacs mode to syntax highlight iXML grammars with tree-sitter. Find someone to steal from, basically. I stole from Steve Purcell.

With boiler plate in hand, I tinkered a bit with the actual font mappings. Nothing about my visual design skill would be called, uh, “skill” by any actual craftsperson.

Example iXML grammar in Emacs

I mapped various iXML features to font lock faces. Improvements most welcome. It’s interactive and fast. You can see the cursor on line three where I’ve deleted the quotation mark from the end of a string introducing an error.

For the effort of an afternoon and a couple of evenings, I have a functional Emacs mode for syntax highlighting iXML. I had a lot more fun doing this than I would have had trying to make it all work with regular expressions. And I can totally imagine writing modes for other languages this way.

You have to be running Emacs 29 (or, I guess, some earlier version with tree-sitter patched into it), but on the other hand, at least the AST part should also work in a bunch of other editing tools that I guess you might want to use.