OCR: Difference between revisions
Appearance
mNo edit summary |
|||
Line 30: | Line 30: | ||
Note that in order to use this approach, the input PDF has to be rasterized first, since tesseract will not get PDF as input. | Note that in order to use this approach, the input PDF has to be rasterized first, since tesseract will not get PDF as input. | ||
=To combine multiple PDF files into one= | |||
# pdfunite output_*.pdf result.pdf |
Revision as of 14:52, 17 April 2017
scan pdf to file. first extract the pages and ocr them, then make one doc
pdfimages -tiff input.pdf plaatje for i in *.tif; do tesseract $i tempje-$i; done cat tempje-plaatje-0*.txt >> docje.txt
Use tesseract to OCR a multi-page PDF file
First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF.
Number the files with 2 digits at the end, remove the alfa-channel:
# convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff
Then, use tesseract to make it into text:
# tesseract inputfile.tiff outputfile
if you do not provide and extension for the outputfile, it will become .txt
Creating an overlay with the OCRed text
The newer version of Tesseract (3.03 RC at the time of writing this) can do this:
- free, opensource and cross-platform
- starting from version 3.03 PDF output is available
- CLI software
- multiple languages support
- unfortunately, single image input, so to make a complete document, one must create a batch script to convert each page image to searchable PDF. After that PDF pages should be combined to a single PDF using tools like pdftk.
This is the command:
tesseract -l <lang> input.tif output pdf
Note that in order to use this approach, the input PDF has to be rasterized first, since tesseract will not get PDF as input.
To combine multiple PDF files into one
# pdfunite output_*.pdf result.pdf