so

ixml with PEP?

Volume 5, Issue 28; 05 Nov 2021

Inspired by the discussions of ixml at Declarative Amsterdam, I did a little tinkering.

I’m a huge fan of invisible XML (ixml) and I want it now. Steven Pemberton gave a great tutorial at Declarative Amsterdam (2021). Michael Sperberg-McQueen followed up the next day with a report on the current status of his XQuery based parser, Aparecium.

I want an XProc step, so I’m perfectly happy to implement it in Java or at least on the JVM.

I poked around a bit this evening and decided to see what it would look like to build an ixml step on top of PEP. I tried crafting a parser for this small ixml grammar from the tutorial:

date: s?, day, s, month, (s, year)? .
-s: -" "+ .
day: digit, digit? .
-digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
month: "January"; "February"; "March"; "April";
       "May"; "June"; "July"; "August";
       "September"; "October"; "November"; "December".
year: (digit, digit)?, digit, digit .

This (crude hack in Scala) seems to work:

  val grammar = new Grammar("date")

  val date = new Category("date")
  val ts = new Category(" ", true)
  val s = new Category("s")
  val day = new Category("day")
  val month = new Category("month")
  val year = new Category("year")
  val digit = new Category("digit")

  for (d <- 0 to 9) {
    grammar.addRule(new Rule(digit,
                      new Category(d.toString, true)))
  }

  grammar.addRule(new Rule(s, ts))
  grammar.addRule(new Rule(s, s, ts))
  grammar.addRule(new Rule(date, s, day, s, month, s, year))
  grammar.addRule(new Rule(date, day, s, month, s, year))
  grammar.addRule(new Rule(date, s, day, s, month))
  grammar.addRule(new Rule(date, day, s, month))
  grammar.addRule(new Rule(day, digit))
  grammar.addRule(new Rule(day, digit, digit))

  val month_names = List("January", "February", "March",
    "April", "May", "June", "July", "August", "September",
    "October", "November", "December")
  val letters = mutable.HashMap.empty[Char, Category]
  for (name <- month_names) {
    for (letter <- name) {
      if (!letters.contains(letter)) {
        letters.put(letter, new Category(letter.toString, true))
      }
    }
  }

  for (name <- month_names) {
    val rhs = ListBuffer.empty[Category]
    for (letter <- name) {
      rhs += letters(letter)
    }
    grammar.addRule(new Rule(month, rhs.toArray: _*))
  }

  grammar.addRule(new Rule(year, digit, digit))
  grammar.addRule(new Rule(year, digit, digit, digit, digit))

  val parser = new EarleyParser(grammar)
  val tokens = ListBuffer.empty[String]
  for (ch <- "5 November 2021") {
    tokens += ch.toString
  }

  val parse = parser.parse(tokens.toList.asJava, date)
  println(parse)

I’m not enormously enthusiastic about having to expand optionality in the grammar this way, and make tokens of every character, but that may be unavoidable. (Also, I may be overlooking something obvious.)

I haven’t examined what it’d be like to extract the tree from the parser result:

ACCEPT: date -> [5,  , N, o, v, e, m, b, e, r,  , 2, 0, 2, 1] (1)

but it doesn’t look intractable. The “toString()” value of the parse is:

[date[day[digit[5]]][s[ ]][month[N][o][v][e][m][b][e][r]]
[s[ ]][year[digit[2]][digit[0]][digit[2]][digit[1]]]]

Must. Not. Be. Distracted. From. Finishing. XML. Calabash 3.0.