Jump to content

OCR: Difference between revisions

From www.ReeltoReel.nl Wiki
Pvdm (talk | contribs)
Created page with "scan pdf to file. first extract the pages and ocr them, then make one doc pdfimages -tiff input.pdf plaatje for i in *.tif; do tesseract $i tempje-$i; done cat tempje-plaat..."
 
Pvdm (talk | contribs)
Line 7: Line 7:
=Use tesseract to OCR a multi-page PDF file=
=Use tesseract to OCR a multi-page PDF file=


First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF. Number the files with 2 digits at the end, remove the alfa-channel:
First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF.  
 
Number the files with 2 digits at the end, remove the alfa-channel:
  # convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff
  # convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff



Revision as of 09:28, 17 April 2017

scan pdf to file. first extract the pages and ocr them, then make one doc

pdfimages -tiff input.pdf plaatje
for i in *.tif; do tesseract $i tempje-$i; done
cat tempje-plaatje-0*.txt >> docje.txt


Use tesseract to OCR a multi-page PDF file

First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF.

Number the files with 2 digits at the end, remove the alfa-channel:

# convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff

Then, use tesseract to make it into text:

# tesseract inputfile.tiff outputfile

if you do not provide and extension for the outputfile, it will become .txt