Fwd: RE: making PDFs workable

James Crawford jrefl5 at gmail.com
Fri Sep 14 21:00:19 MST 2012



What I did to setup for the conversion Note I'm doing this on a CentOS 
5.x system
1. Add RpmForge to the YUM repo file
         wget 
http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm 

         rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt
         rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm
         rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm
2. install tesseract
         yum -y install tesseract tesseract-en

This makes it possible to do the following;
         gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif 
-dBATCH -dNOPAUSE <Input.pdf>
     where the options are
             -r is resoultion
             -sDEVICE for monocrome output
             -sOutputFile=outputfilename note %02d causes the page 
number to be inserted into the filename
Followed by
         tesseract inputFile outputfile -l eng
     where the options are
             input is the output tif files from gs
             outputfile will be given a .txt extentsion
             -l language of input file <eng>lish
And then put the pages back together by
             >cat tess-outfile01.txt tess-outfile02.txt ... 
tess-outfilenn.txt > Input.txt

There will some failed conversion/bad guesses by the tesseract program 
so check the final output for correctness.

Bash Script to do the conversion
< This got reformatted and I attempted to put it back the way I 
remembered it.>
< the tesseract step takes a while on each page>
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
#!/bin/bash
# # # # # # Use this script to convert a pdf formated file to text
# The Input file will be split into single page tiff files
# which will be run through tesseract to OCR the files into
# text files. the text files will be reassimbled into a
# single text file.
# # NOTE: There will still be some cleanup of the text files
# as the OCR is not perfect.
# # # # # # # # Get Input file name, and final output filename
InFile=${1:-"infile.pdf"}
TIFFile="${InFile%.pdf}"
OutFile=${2:-"$TIFFile.txt"}
echo "Input from $InFile, OCR output to $OutFile"
if [ ! -e "$InFile" ] ; then
     echo "$InFile not found. exiting"
     exit 1
elsif [ ! -r "$InFile" ]
     echo " Read not allowed on $InFile. exiting"
     exit 1
fi
# setup a temp working area
WrkDir="/tmp/$(date +%s)"
mkdir $WrkDir
echo " Working Dir = $WrkDir"
cp $InFile $WrkDir/
Hdir=$(pwd)
cd $WrkDir
# pwd
gs -r300x300 -sDEVICE=tiffgray -sOutputFile=$TIFFile%02d.tif -dBATCH 
-dNOPAUSE $InFile >files
TifCount=$(grep "Page " files | wc -l)
rm files
#
ls -l *.tif
echo "number of pages to process = $TifCount"
for wtif in $(ls *.tif); do
     wtxt=${wtif%.tif}
     tesseract "$wtif" "$wtxt" -l eng
done
#
ls -l *.txt TxtFiles=$( ls *.txt )
touch $OutFile
for Tf in $TxtFiles; do
#
     echo "Working on $Tf, "
     cat "$Tf" >> $OutFile
done
ls -l
cp $OutFile $Hdir/
cd $Hdir
# once debuged enable the following
rm -fr $WrkDir
exit 0
 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

James C
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Pdf2txt.sh
Type: application/octet-stream
Size: 1362 bytes
Desc: not available
URL: <http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20120914/4a2685c1/attachment.obj>
-------------- next part --------------
Pdf to Text

The bash script Pdf2txt.sh located in the same directory as this file
will do all the following steps for PDFs upto 99 pages. It is also
on test.sb.state.az.us (10.168.30.100) in /home/jimc.


Convert .pdf document to single page .tif format documents
>gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE <Input.pdf>
	-r is resoultion
	-sDEVICE for monocrome output
	-sOutputFile=outputfilename
	 	note %02d causes the page number to be inserted
	 	into the filename

next use tesseract to convert each page to text
>tesseract  inputFile outputfile -l eng
	input is the output tif files from gs
	outputfile will be given a .txt extentsion
	-l language of input file <eng>lish

reassemble the ocr'ed .txt files into a single document
>cat tess-outfile01.txt tess-outfile02.txt ... tess-outfilenn.txt > Input.txt


on test server <10.168.30.100 test.sb.state.az.us>
I have installed tesseract using the following
	yum -y install tesseract tesseract-en
using the Rpmforge repositorys
  wget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm
  rpm --import  http://apt.sw.be/RPM-GPG-KEY.dag.txt
  rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm
  rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm




More information about the PLUG-discuss mailing list