Pdf to Text

The bash script Pdf2txt.sh located in the same directory as this file
will do all the following steps for PDFs upto 99 pages. It is also
on test.sb.state.az.us (10.168.30.100) in /home/jimc.


Convert .pdf document to single page .tif format documents
>gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE <Input.pdf>
	-r is resoultion
	-sDEVICE for monocrome output
	-sOutputFile=outputfilename
	 	note %02d causes the page number to be inserted
	 	into the filename

next use tesseract to convert each page to text
>tesseract  inputFile outputfile -l eng
	input is the output tif files from gs
	outputfile will be given a .txt extentsion
	-l language of input file <eng>lish

reassemble the ocr'ed .txt files into a single document
>cat tess-outfile01.txt tess-outfile02.txt ... tess-outfilenn.txt > Input.txt


on test server <10.168.30.100 test.sb.state.az.us>
I have installed tesseract using the following
	yum -y install tesseract tesseract-en
using the Rpmforge repositorys
  wget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm
  rpm --import  http://apt.sw.be/RPM-GPG-KEY.dag.txt
  rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm
  rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm