so

Thinking differently

Volume 5, Issue 1; 15 Jan 2021

You aren’t supposed to have noticed, but it’s all a bit different around here now.

As I mentioned before, I’ve been heads down for a couple of months transitioning my personal sites off of AWS and MarkLogic and onto a greener VM at The Positive Internet Company. On Monday, I shut down my EC2 instance.

It was unclear from the start how best to go about this. I flirted with Redis briefly before settling on what seemed like the obvious choice, some flavor of eXist-db.

Except that turned out to be a lot harder than I expected. Part of the problem is that after a decade working with MarkLogic, eXist was just different in ways that I found difficult to get my head around.

One interesting example is indexing. Over the years, I’ve had more than a few people tell me that they were frustrated by MarkLogic because of its widespread use of extension functions for indexing and querying. Even though XQuery is an open standard, an XQuery application written to perform well on MarkLogic isn’t easily portable with other XQuery implementations because it’s going to rely on all kinds of MarkLogic-specific functions.

I genuinely do not believe that “vendor lock-in” was ever a motivation for that design, but I did tend to feel like it was a legitimate grievance.

Indexes are important because if you want an application to scale, all of your queries have to be answered by consulting the indexes. If you have to touch the actual documents to answer the query, you’re screwed. Maybe not for 1,000 or even 10,000 documents, but there is some threshold beyond which your query will either time out or run out of memory. I know this is true. I have the scars to prove it.

So when I started porting my queries to eXist, I naïvely asked “how do I make sure that all my queries will be resolved exclusively against the indexes?” There wasn’t, as far as I was able to determine, a concise answer to that question. I’m not saying that it can’t be done, it just appears to rely on reviewing query evaluation plans and understanding how the optimizer works and having a much deeper understanding of the internals than I have.

This gave me a wholly new perspective on the MarkLogic use of extension functions. Yes, it may contribute to the learning curve to become familiar with the vocabulary of functions, but it makes the simple question of “am I using the indexes correctly?” a whole lot easier to answer. (In my experience.)

And then I encountered a bug.I know that I added a comment to a bug report about the issue but, alas, I cannot now find that bug. ☹ I spent an afternoon, perhaps the better part of a day, trying to figure out why some aspect of my application was not working. Eventually, I tracked it down to a function. I had a function, not unlike this one:

declare function my:foo(params) as xs:integer {
  some statements
};

I don’t remember precisely what the bug was: some aspect of my attempt to rewrite a query, perhaps, or just some sort of simple cut-and-paste error. But the bug I was chasing turned out to be caused by the fact that this function was returning a string. An xs:string. Like “Hello, World.”

Rather than raising a type error, the string value was blindly coming back from my:foo and wreaking havoc later on.

I gave up. I am perfectly happy working in an untyped language. I am perfectly happy (perhaps slightly more happy) working in a typed language. I am prepared to work within various static-vs-dynamic typing compromises. I am not sure how to work with a type system that lies to me.

It’s a bug and I’m sure it’ll be fixed. This posting is not intended as a rant about eXist, but about what happened when I cried into a virtual beer with a friend about this particular bug. He suggested that I ought to just do it all with Saxon-JS.

That’s an idea I hadn’t previously considered seriously. I was motivated, as it turns out, to learn more about Saxon-JS and the honest truth is that none of the applications involved are markup heavy. I use markup in all of them, because of course I do, even if it’s just to render web pages, but I’m not doing critical analysis of multiple editions of a manuscript or managing the patient records of a substantial enterprise or rapidly deploying applications over a wide variety of document types that change frequently.

In fact, the application that I had picked to rewrite first, photos.nwalsh.com, I picked specifically because it has no real markup at all: everything about photographs is really atomic values.

So what are my actual requirements?

  1. Fast queries over a range of atomic values (all posts newer than a particular date, all photographs tagged “food”, all drinks that contain tequila, etc.)
  2. Geospatial queries (all photographs within 10km of LHR, all posts about Austin, etc.)
  3. Full text queries.
  4. The ability to transform markup (both XML and, it turns out, JSON) to produce complex web pages.

Saxon-JS is absolutely perfect for the last requirement and after poking about for a bit, I decided that PostgreSQL wouldI’m a total n00b when it comes to SQL. In fact, using “array columns” in PostgreSQL has allowed me to avoid even becoming especially proficient at “join” queries. address the other requirements. Using Node.js to glue it all together seemed like what the cool kids would do.

I’ve taken a few more lessons from Jim and tried to build a robust development environment with Gradle and Docker containers. (Lech suggested I write about that, and I will.)

If I had not already decided which ISP I was going to use, I could have gone even further and made the docker containers the actual production deployment vehicle, but I didn’t.

I’m learning some new skills and I’m pleased with the results. At the same time, I realize I’ve now deployed the same technologies as everyone else, which means there will be regular security advisories and patches to apply. (Not that MarkLogic didn’t have the occasional security patch, but it doesn’t have the army of black hats poking at it from every angle.)