What I did to setup for the conversion Note I'm doing this on a CentOS 5.x system 1. Add RpmForge to the YUM repo file wget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm 2. install tesseract yum -y install tesseract tesseract-en This makes it possible to do the following; gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE where the options are -r is resoultion -sDEVICE for monocrome output -sOutputFile=outputfilename note %02d causes the page number to be inserted into the filename Followed by tesseract inputFile outputfile -l eng where the options are input is the output tif files from gs outputfile will be given a .txt extentsion -l language of input file lish And then put the pages back together by >cat tess-outfile01.txt tess-outfile02.txt ... tess-outfilenn.txt > Input.txt There will some failed conversion/bad guesses by the tesseract program so check the final output for correctness. Bash Script to do the conversion < This got reformatted and I attempted to put it back the way I remembered it.> < the tesseract step takes a while on each page> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< #!/bin/bash # # # # # # Use this script to convert a pdf formated file to text # The Input file will be split into single page tiff files # which will be run through tesseract to OCR the files into # text files. the text files will be reassimbled into a # single text file. # # NOTE: There will still be some cleanup of the text files # as the OCR is not perfect. # # # # # # # # Get Input file name, and final output filename InFile=${1:-"infile.pdf"} TIFFile="${InFile%.pdf}" OutFile=${2:-"$TIFFile.txt"} echo "Input from $InFile, OCR output to $OutFile" if [ ! -e "$InFile" ] ; then echo "$InFile not found. exiting" exit 1 elsif [ ! -r "$InFile" ] echo " Read not allowed on $InFile. exiting" exit 1 fi # setup a temp working area WrkDir="/tmp/$(date +%s)" mkdir $WrkDir echo " Working Dir = $WrkDir" cp $InFile $WrkDir/ Hdir=$(pwd) cd $WrkDir # pwd gs -r300x300 -sDEVICE=tiffgray -sOutputFile=$TIFFile%02d.tif -dBATCH -dNOPAUSE $InFile >files TifCount=$(grep "Page " files | wc -l) rm files # ls -l *.tif echo "number of pages to process = $TifCount" for wtif in $(ls *.tif); do wtxt=${wtif%.tif} tesseract "$wtif" "$wtxt" -l eng done # ls -l *.txt TxtFiles=$( ls *.txt ) touch $OutFile for Tf in $TxtFiles; do # echo "Working on $Tf, " cat "$Tf" >> $OutFile done ls -l cp $OutFile $Hdir/ cd $Hdir # once debuged enable the following rm -fr $WrkDir exit 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> James C