Revision as of 14:52, 17 April 2017

scan pdf to file. first extract the pages and ocr them, then make one doc

pdfimages -tiff input.pdf plaatje
for i in *.tif; do tesseract $i tempje-$i; done
cat tempje-plaatje-0*.txt >> docje.txt

Use tesseract to OCR a multi-page PDF file

First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF.

Number the files with 2 digits at the end, remove the alfa-channel:

# convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff

Then, use tesseract to make it into text:

# tesseract inputfile.tiff outputfile

if you do not provide and extension for the outputfile, it will become .txt

The newer version of Tesseract (3.03 RC at the time of writing this) can do this:

free, opensource and cross-platform
starting from version 3.03 PDF output is available
CLI software
multiple languages support
unfortunately, single image input, so to make a complete document, one must create a batch script to convert each page image to searchable PDF. After that PDF pages should be combined to a single PDF using tools like pdftk.

This is the command:

tesseract -l <lang> input.tif output pdf

Note that in order to use this approach, the input PDF has to be rasterized first, since tesseract will not get PDF as input.

# pdfunite output_*.pdf result.pdf