OCR: Difference between revisions
Appearance
Created page with "scan pdf to file. first extract the pages and ocr them, then make one doc pdfimages -tiff input.pdf plaatje for i in *.tif; do tesseract $i tempje-$i; done cat tempje-plaat..." |
|||
Line 7: | Line 7: | ||
=Use tesseract to OCR a multi-page PDF file= | =Use tesseract to OCR a multi-page PDF file= | ||
First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF. Number the files with 2 digits at the end, remove the alfa-channel: | First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF. | ||
Number the files with 2 digits at the end, remove the alfa-channel: | |||
# convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff | # convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff | ||
Revision as of 09:28, 17 April 2017
scan pdf to file. first extract the pages and ocr them, then make one doc
pdfimages -tiff input.pdf plaatje for i in *.tif; do tesseract $i tempje-$i; done cat tempje-plaatje-0*.txt >> docje.txt
Use tesseract to OCR a multi-page PDF file
First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF.
Number the files with 2 digits at the end, remove the alfa-channel:
# convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff
Then, use tesseract to make it into text:
# tesseract inputfile.tiff outputfile
if you do not provide and extension for the outputfile, it will become .txt