OCR
Appearance
scan pdf to file. first extract the pages and ocr them, then make one doc
pdfimages -tiff input.pdf plaatje for i in *.tif; do tesseract $i tempje-$i; done cat tempje-plaatje-0*.txt >> docje.txt
Use tesseract to OCR a multi-page PDF file
First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF.
Number the files with 2 digits at the end, remove the alfa-channel:
# convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff
Then, use tesseract to make it into text:
# tesseract inputfile.tiff outputfile
if you do not provide and extension for the outputfile, it will become .txt
Creating an overlay with the OCRed text
The newer version of Tesseract (3.03 RC at the time of writing this) can do this:
- free, opensource and cross-platform
- starting from version 3.03 PDF output is available
- CLI software
- multiple languages support
- unfortunately, single image input, so to make a complete document, one must create a batch script to convert each page image to searchable PDF. After that PDF pages should be combined to a single PDF using tools like pdftk.
This is the command:
tesseract -l <lang> input.tif output pdf
Note that in order to use this approach, the input PDF has to be rasterized first, since tesseract will not get PDF as input.
To combine multiple PDF files into one
# pdfunite output_*.pdf result.pdf
I have created a script
This script will do the work for you. Place the script in a directory together with the PDF to be processed, and run it.
#!/bin/bash # converteert een PDF met gescande pagina's naar een # nieuwe PDF file waarin de geOCRde tekst is overlayed, # zodat het doorzoekbaar wordt. # gebruik: # doit.sh leuke.PDF # requires: tesseract-ocr, convert (ImageMagick), pdfunite, pdfinfo # 20170417 1.0 PvdM eerste versie bestand="$1" newbestand=$(echo $bestand | cut -d"." -f1)_searchable.pdf teller="0" teller2="000" aantpaginas=$(pdfinfo "$bestand" | grep 'Pages:' | awk '{ print $2 }') RESTORE='\033[0m' RED='\033[00;31m' GREEN='\033[00;32m' YELLOW='\033[00;33m' BLUE='\033[00;34m' PURPLE='\033[00;35m' CYAN='\033[00;36m' LIGHTGRAY='\033[00;37m' function check_input { if [ -z "$bestand" ]; then echo - Error. Usage: echo " ./doit.sh input.pdf"; echo exit 1 fi } function check_error { if [ $? != 0 ]; then echo == Error! There was a problem in the command. exit 1 fi } clear echo "Converting PDF to searchable (overlay) PDF." echo -e "-------------------------------------------\n" check_input echo -e " $bestand contains $RED $aantpaginas $RESTORE pages.\n" echo " - (1/3) Extracting scanned PDF to images........" convert -density 300 "$bestand" -depth 8 -alpha off temp_%03d.tiff check_error echo -e " - Done.\n" echo echo " - (2/3) Doing OCR on the images..........." for i in temp_*.tiff; do tesseract -l eng $i temp_pdf_$teller2.pdf pdf check_error ((teller++)) teller2=$(printf "%05d" $teller) echo -e " - (2/3) Doing OCR on the images. $RED Page $teller/$aantpaginas done.$RESTORE" done echo -e " - Done.\n" echo echo " - (3/3) Combining the result into 1 (searchable) PDF" pdfunite temp_pdf_*.pdf "$newbestand" check_error echo -e " - Done. $RED'$newbestand'$RESTORE created.\n" rm temp*