Step 3: Applying Tesseract OCR and converting to PDF


The third and last stage of the overall procedure is in turn comprised of two smaller steps: first, we need to apply to our pictures the optical character recognition (OCR) engine, so that the text contained in them will lend itself to be copied & pasted in our word processor, or highlighted and searched in our PDF viewer; secondly, we want to combine the scanned images and the OCR-ed text thus produced into a single, portable document that will retain the appearance of the images but will include an invisible layer of text underneath.

In order to recognize the text from the images, Homer employs another free software, currently sponsored by Google, called Tesseract. This OCR engine was originally developed as proprietary software at Hewlett-Packard, and has been released as open source a few years ago. It is compatible with all three major operating systems and, as of version 3.00, it supports as many as 35 languages. Did I say it’s free?

Now, the output of Tesseract are HTML files text encoded in the hOCR format, but that’s not yet a PDF. For binding together the TIFF images and the HTML files Homer uses PDFbeads, a small Ruby utility developed by Alexey Kryukov. Besides creating a searchable PDF, the program also compresses any b/w TIFFs using the JBIG2 encoder, a very efficient image compression standard: for example, a book of 384 pages is contained in a single PDF file of 11.8 MB (including front and back covers in colour, and several illustrations in ‘mixed’ mode).

Both Tesseract and PDFbeads are command-line utilities, so they are well suited to be run through a bash script. As we have already mentioned, the option number “4” in the Homer script is meant to run Tesseract OCR on the “out” folder – the one containing the TIFF images processed by Scan Tailor –, and eventually merge those images and their OCR-ed text into a searchable PDF.

This is done in the same way as for the renaming-rotating task. That is, by draggin the “out” folder on Homer’s icon (or on the Command Prompt window), and then selecting the option “4”. The script will prompt you to specify two variables before starting the process:

  1. the language in which the scanned pages are written, identified by its three-characters label. To view the complete list of Tesseract supported languages inside Homer window, leave the field blank and press ENTER (you can also read the list online). For each languange there is a corresponing three-characters label in Tesseract’s vocabulary: English is “eng”, Italian is “ita”, Portuguese is “por”, etc. We didn’t modify the labels, but used the default ones from Tesseract (don’t blame us if Spanish is “spa” and not “esp”, while French is “fra” and German is “deu”…).
  2. the name of the final PDF: you don’t need to append the “.pdf” extension, it is added automatically; the filename may contain spaces; the file will be saved on the Desktop by default.


Text recognition may take 1 or 2 seconds per page on average, even more if colour/greyscale. You can watch the ongoing process on the Terminal or Command Prompt window – but it’s a boring sequence of “Processed image 0234.tif, etc.”. When the process is finally completed, the window will come to foreground and you will be able to “press any key” to show the results!

Note that the quality of the text recognition may vary according to various factors like the quality the original shots, their resolution, the sort of typeface which is used in the printed book, and so on. If you are not satisfied with Tesseract and are willing to spend a few hundred pounds on it, we would recommend to try Adobe Acrobat (either Standard or Professional), not only for its quite good OCR engine but, above all, for its state-of-the-art ClearScan technology which turns pixelated images of text characters into smoothed “vector” curves. In other words, for each set of scanned images it creates a custom font matching the visual appearance of the printed text characters and embeds it into the PDF. Not just it looks better, but also produces smaller files that loads/scrolls faster on a PDF viewer than common “raster” graphics.

 

Back to Front Page or List of content.