Paperless-ngx

Volume 7, Issue 43; 03 Nov 2023

Migrating my homebrew-document management system has been painless and fruitful.

Shortly after I started at MarkLogic, in about 2008, I guess, I undertook to build “something real” so that I’d get some practical experience with the database. The thing I chose to build was a document management system. Not a content management system, not a CMS, but a document management system, some way of tracking all the documents that floated around my office: travel itineraries, business cards, bank statements, credit card statements…statements of all kinds, magazine clippings, web clippings. All of it.

The idea was simple: scan it if necessary and pour it into MarkLogic, which would happily index PDFs and other things. Add some metadata, put a simple web front end on it, and never have to deal with all that paper again.

It worked a treat. And I used it for years. When I left MarkLogic, I ported it over to new technologies: PostgresSQL, Node.js, SaxonJS, and a bit of duct tape and bailing wire to get OCR working for new documents.

Mostly successful. But never very polished. And without a good sharing story. For example, there are lots of recipes in there, and there’s a tablet in the kitchen, but I don’t necessarily want to share all of my tax returns and financial records with anyone who’s doing a little cooking in our kitchen.

Enter Paperless-ngx.

Paperless-ngx is a community-supported open-source document management system that transforms your physical documents into a searchable online archive so you can keep, well, less paper.

It’s (almost) everything that my notes application was, and a whole lot more, with polish and a REST API. Very impressive. And their model of using an “archive serial number” for managing the actual pieces of paper that you do need to keep is very clever.

I have thousands and thousands of documents in my system, but the REST API and a bit of Python scripting made short work of migrating the data. There were a couple of stumbling blocks:

My system was perfectly happy storing Markdown and HTML documents. Paperless really isn’t. It will, I think, store some image formats and text documents, but it’s happiest with PDF.
My system supported editing documents. I could open up say, a recipe stored in Markdown, and adjust the ingredients or change the oven temperatures from Fahrenheit to Celsius. Not so with Paperless; it’s an archiving system.

I decided that neither of these were show-stoppers given how much better most everything else was. I tweaked the migration script to generate PDF from Markdown and HTML files. I discovered some cases where that didn’t work quite right, so I wrote another script to “replace” a document with a new PDF. That is: create a new document with all the same metadata but using a new PDF and then delete the old document.

Beyond owner, title, tags, and date, Paperless-ngx has the notion of a correspondent, a document type, and a storage path. That was sufficient for (almost) all of the metadata in my system.

I had some additional metadata that doesn’t fit anywhere: pointers to the original URIs for web clippings, postal addresses, and GPS coordinates geocoded from those addresses. Paperless-ngx doesn’t support custom metadata fields, so I have stored those in notes for the time being. Good enough for now. And for how often I actually care about those details.

(Being able to search by GPS coordinates is useful enough, I might write my own little UI to do that. Again, the REST API makes it easy imagine how to do that.)

Practically speaking, I have the whole thing running in the Docker containers they publish. It runs on my laptop and on a Linux box in my office. At the moment, I treat my laptop as the source of truth and back that up to the Linux box periodically. I expect I’ll want a more sophisticated mirroring strategy someday.

It’s easy to have multiple users and partition what they can and cannot see. All of the recipes (but only the recipes) are visible to the user named “kitchen”, for example. That means the tablet in the kitchen has access to all the recipes but nothing else.

There are more capabilities that I haven’t investigated: things you can setup on the dashboard and scripts you can run when documents are being ingested.

The ingestion process is fairly pedantic about the validity of the PDF files. I think that’s because they’re making PDF/A versions, but I don’t know for sure. I’ve had a few PDF files that I had to reconstruct in order to get them ingested (rather crudely, perhaps, I just used Ghostscript to make PNG images of each page and then ImageMagic to turn those pages back into a PDF.)

The system does a nice job of scanning for dates and will offer to provide the correct creation date. It’s really very accurate, although the ambiguity of a date like 07/08/2020 with respect to whether it was generated in the US or UK forces me to squint quite hard at its suggestions. At least for the backlog of US documents being ingested.

It also offers suggestions for the correspondent and document type that are very often correct.

All-in-all, very nice indeed.