OCRing financial statements

Volume 4, Issue 40; 22 Sep 2020

Published by Norman Walsh

Sometimes (always, AFAICT) OCRing a PDF is the most reliable way to get transaction details out of financial institutions.

This morning, I tweeted about using OCR on my bank statement to get details of financial transactions. Despite the fact that some banks will export XML and other electronic formats, when I tried to use those, I found them to be an utter mess. (And even if I hadn’t, in the case of my US credit cards, I want the transaction amounts in both GBP and USD which is conveniently on the PDF but not in the export formats.)

Years ago, when I gave up on the export formats, somewhat out of desperation, I tried OCRing the PDFs and I discovered that that worked surprisingly well. Perhaps it isn’t all that surprising considering that the statements are generated by batch processes from the raw transaction data without ever being touched by human hands.

In case you want to give it a try, here’s the process that works for me:

I turn the PDF into a set of PNG pages with Ghostscript:

gs -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitMap=5000000000 \
   -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha \
   -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r300x300 \
   -sOutputFile="page%02d.png" -f statement.pdf

I process each of those pages with Tesseract OCR:

tesseract page01.png page01 hocr
tesseract page02.png page02 hocr
tesseract page03.png page03 hocr
…

Then I extract a plain text file from the HOCR (which is really just XHTML) with this trivial bit of XSLT:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                xmlns:html="http://www.w3.org/1999/xhtml"
                exclude-result-prefixes="xs"
                version="2.0">

<xsl:output method="text" encoding="utf-8" indent="no"/>

<xsl:template match="html:*">
  <xsl:apply-templates select="*"/>
</xsl:template>

<xsl:template match="html:span[@class='ocr_line']">
  <xsl:apply-templates select="*"/>
  <xsl:text>&#10;</xsl:text>
</xsl:template>

<xsl:template match="html:span[@class='ocrx_word']">
  <xsl:value-of select="."/>
  <xsl:text> </xsl:text>
</xsl:template>

<xsl:template match="element()">
  <xsl:copy>
    <xsl:apply-templates select="node()"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="attribute()|text()|comment()|processing-instruction()">
  <xsl:copy/>
</xsl:template>

</xsl:stylesheet>

That produces a set of text files that contain the text from the PDF. Then you have to grovel over that using regular expressions and other text processing techniques to extract the transactions.

The bad news is that there’s no (useful) consistency between financial institutions with respect to how the data is formatted in the PDF. But the good news is that it’s very consistent for each specific financial institution. In the end, I have a slightly different “parse” script for each credit card and bank statement.

What I actually want to generate from these PDFs is Ledger CLI records. I do that with a separate little file of “rules” that map from the descriptions on the statement to less cryptic descriptions and categories.

I stick “???” in the output where I can’t make a mechanical determination. Those I patch by hand when I review all the transactions and insert them into my global ledger.