Use Tesseract to OCR a multiple page PDF
----------------------------------------
Last edited: $Date: 2018/05/30 18:39:38 $
## Tesseract to OCR
Tesseract is a well known open source OCR engine. It is famous for
its quality.
## ghostscript to convert PDF to tiff
First we must convert our PDF to a tiff file.
Tesseract requires image files, so first we have to convert the PDF
to images. When we use ghostscript for this, we will get high
quality images.
## Converting the PDF to tiff:
gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 \
-sOutputFile=output_filename.tiff input_filename.pdf
## Convert tiff to text with Tesseract
tesseract output_filename.tiff text_file -l eng
The file text_file will become the ouput file. Tesseract will put an
".txt" extension to the filename.
$Id: ocrwithtesseract.txt,v 1.3 2018/05/30 18:39:38 matto Exp $