OCR

From www.ReeltoReel.nl Wiki
Jump to navigation Jump to search

scan pdf to file. first extract the pages and ocr them, then make one doc

pdfimages -tiff input.pdf plaatje
for i in *.tif; do tesseract $i tempje-$i; done
cat tempje-plaatje-0*.txt >> docje.txt


Use tesseract to OCR a multi-page PDF file

First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF.

Number the files with 2 digits at the end, remove the alfa-channel:

# convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff

Then, use tesseract to make it into text:

# tesseract inputfile.tiff outputfile

if you do not provide and extension for the outputfile, it will become .txt