OCR: Difference between revisions

From www.ReeltoReel.nl Wiki
Jump to navigation Jump to search
(Created page with "scan pdf to file. first extract the pages and ocr them, then make one doc pdfimages -tiff input.pdf plaatje for i in *.tif; do tesseract $i tempje-$i; done cat tempje-plaat...")
 
Line 7: Line 7:
=Use tesseract to OCR a multi-page PDF file=
=Use tesseract to OCR a multi-page PDF file=


First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF. Number the files with 2 digits at the end, remove the alfa-channel:
First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF.  
 
Number the files with 2 digits at the end, remove the alfa-channel:
  # convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff
  # convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff



Revision as of 09:28, 17 April 2017

scan pdf to file. first extract the pages and ocr them, then make one doc

pdfimages -tiff input.pdf plaatje
for i in *.tif; do tesseract $i tempje-$i; done
cat tempje-plaatje-0*.txt >> docje.txt


Use tesseract to OCR a multi-page PDF file

First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF.

Number the files with 2 digits at the end, remove the alfa-channel:

# convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff

Then, use tesseract to make it into text:

# tesseract inputfile.tiff outputfile

if you do not provide and extension for the outputfile, it will become .txt