OCR: Difference between revisions

From www.ReeltoReel.nl Wiki
Jump to navigation Jump to search
mNo edit summary
Line 108: Line 108:
rm temp*
rm temp*
</pre>
</pre>
=examples=
==how to extraqct images from pdf==
pdfimages -all sm_td20a_very_detailed.pdf .

Revision as of 17:49, 7 October 2017

scan pdf to file. first extract the pages and ocr them, then make one doc

pdfimages -tiff input.pdf plaatje
for i in *.tif; do tesseract $i tempje-$i; done
cat tempje-plaatje-0*.txt >> docje.txt


Use tesseract to OCR a multi-page PDF file

First, convert the PDF to multiple TIFF files, because tesseract does not work with PDF.

Number the files with 2 digits at the end, remove the alfa-channel:

# convert -density 300 inputfile.pdf -depth 8 -alpha off outputfile_%02d.tiff

Then, use tesseract to make it into text:

# tesseract inputfile.tiff outputfile

if you do not provide and extension for the outputfile, it will become .txt


Creating an overlay with the OCRed text

The newer version of Tesseract (3.03 RC at the time of writing this) can do this:

  • free, opensource and cross-platform
  • starting from version 3.03 PDF output is available
  • CLI software
  • multiple languages support
  • unfortunately, single image input, so to make a complete document, one must create a batch script to convert each page image to searchable PDF. After that PDF pages should be combined to a single PDF using tools like pdftk.

This is the command:

tesseract -l <lang> input.tif output pdf

Note that in order to use this approach, the input PDF has to be rasterized first, since tesseract will not get PDF as input.

To combine multiple PDF files into one

# pdfunite output_*.pdf result.pdf

I have created a script

This script will do the work for you. Place the script in a directory together with the PDF to be processed, and run it.

#!/bin/bash

# converts a PDF containing scanned pages into a 
# new PDF file in which the OCRed text is overlayed,
# making the PDF searchable on text strings.

# use: 
#       doit.sh nice.PDF

# requires: tesseract-ocr, convert (ImageMagick), pdfunite, pdfinfo

# 20170417      1.0     PvdM    first version
# 20170418      1.1     PvdM    minor adjustment and improvements, mainly in the counter

bestand="$1"
newbestand=$(echo $bestand | cut -d"." -f1)_searchable.pdf
teller="0"
teller2="000"
aantpaginas=$(pdfinfo "$bestand" | grep 'Pages:' | awk '{ print $2 }')
RESTORE='\033[0m'
RED='\033[00;31m'
GREEN='\033[00;32m'
YELLOW='\033[00;33m'
BLUE='\033[00;34m'
PURPLE='\033[00;35m'
CYAN='\033[00;36m'
LIGHTGRAY='\033[00;37m'

function check_input {
if [ -z "$bestand" ]; then
        echo - Error. Usage:
        echo "         ./doit.sh input.pdf"; echo
        exit 1
fi
}

function check_error {
        if [ $? != 0 ]; then
                echo == Error! There was a problem in the command.
                exit 1
        fi
}

clear
echo "Converting PDF to searchable (overlay) PDF."
echo -e "-------------------------------------------\n"
check_input
echo -e " $bestand contains $RED $aantpaginas $RESTORE pages.\n"
echo " - (1/3) Extracting scanned PDF to images........"
convert -density 300 "$bestand" -depth 8 -alpha off temp_%03d.tiff
check_error
echo -e " - Done.\n"
echo

echo " - (2/3) Doing OCR on the images..........."
for i in temp_*.tiff; do
        tesseract -l eng $i temp_pdf_$teller2.pdf pdf
        check_error
        ((teller++))
        teller2=$(printf "%05d" $teller)
        echo -e " - (2/3) Doing OCR on the images. $RED Page $teller/$aantpaginas done.$RESTORE" 
done
echo -e " - Done.\n"
echo

echo " - (3/3) Combining the result into 1 (searchable) PDF"
pdfunite temp_pdf_*.pdf "$newbestand"
check_error
echo -e " - Done. $RED'$newbestand'$RESTORE created.\n"
rm temp*

examples

how to extraqct images from pdf

pdfimages -all sm_td20a_very_detailed.pdf .