OCR: Difference between revisions

From www.ReeltoReel.nl Wiki
Jump to navigation Jump to search
mNo edit summary
Line 15: Line 15:
  # tesseract inputfile.tiff outputfile
  # tesseract inputfile.tiff outputfile
if you do not provide and extension for the outputfile, it will become .txt
if you do not provide and extension for the outputfile, it will become .txt
=Creating an overlay with the OCRed text=
The newer version of Tesseract (3.03 RC at the time of writing this) can do this:
*free, opensource and cross-platform
*starting from version 3.03 PDF output is available
*CLI software
*multiple languages support
*unfortunately, single image input, so to make a complete document, one must create a batch script to convert each page image to searchable PDF. After that PDF pages should be combined to a single PDF using tools like pdftk.
This is the command:
tesseract -l <lang> input.tif output pdf
Note that in order to use this approach, the input PDF has to be rasterized first, since tesseract will not get PDF as input.

Revision as of 14:35, 17 April 2017

scan pdf to file. first extract the pages and ocr them, then make one doc

pdfimages -tiff input.pdf plaatje
for i in *.tif; do tesseract $i tempje-$i; done
cat tempje-plaatje-0*.txt >> docje.txt


Use tesseract to OCR a multi-page PDF file

First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF.

Number the files with 2 digits at the end, remove the alfa-channel:

# convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff

Then, use tesseract to make it into text:

# tesseract inputfile.tiff outputfile

if you do not provide and extension for the outputfile, it will become .txt


Creating an overlay with the OCRed text

The newer version of Tesseract (3.03 RC at the time of writing this) can do this:

  • free, opensource and cross-platform
  • starting from version 3.03 PDF output is available
  • CLI software
  • multiple languages support
  • unfortunately, single image input, so to make a complete document, one must create a batch script to convert each page image to searchable PDF. After that PDF pages should be combined to a single PDF using tools like pdftk.

This is the command:

tesseract -l <lang> input.tif output pdf

Note that in order to use this approach, the input PDF has to be rasterized first, since tesseract will not get PDF as input.