Pdf to Text The bash script Pdf2txt.sh located in the same directory as this file will do all the following steps for PDFs upto 99 pages. It is also on test.sb.state.az.us (10.168.30.100) in /home/jimc. Convert .pdf document to single page .tif format documents >gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE -r is resoultion -sDEVICE for monocrome output -sOutputFile=outputfilename note %02d causes the page number to be inserted into the filename next use tesseract to convert each page to text >tesseract inputFile outputfile -l eng input is the output tif files from gs outputfile will be given a .txt extentsion -l language of input file lish reassemble the ocr'ed .txt files into a single document >cat tess-outfile01.txt tess-outfile02.txt ... tess-outfilenn.txt > Input.txt on test server <10.168.30.100 test.sb.state.az.us> I have installed tesseract using the following yum -y install tesseract tesseract-en using the Rpmforge repositorys wget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm