OCR
Appearance
scan pdf to file. first extract the pages and ocr them, then make one doc
pdfimages -tiff input.pdf plaatje for i in *.tif; do tesseract $i tempje-$i; done cat tempje-plaatje-0*.txt >> docje.txt
Use tesseract to OCR a multi-page PDF file
First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF. Number the files with 2 digits at the end, remove the alfa-channel:
# convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff
Then, use tesseract to make it into text:
# tesseract inputfile.tiff outputfile
if you do not provide and extension for the outputfile, it will become .txt