Fwd: RE: making PDFs workable
James Crawford
jrefl5 at gmail.com
Fri Sep 14 21:00:19 MST 2012
What I did to setup for the conversion Note I'm doing this on a CentOS
5.x system
1. Add RpmForge to the YUM repo file
wget
http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm
rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt
rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm
rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm
2. install tesseract
yum -y install tesseract tesseract-en
This makes it possible to do the following;
gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif
-dBATCH -dNOPAUSE <Input.pdf>
where the options are
-r is resoultion
-sDEVICE for monocrome output
-sOutputFile=outputfilename note %02d causes the page
number to be inserted into the filename
Followed by
tesseract inputFile outputfile -l eng
where the options are
input is the output tif files from gs
outputfile will be given a .txt extentsion
-l language of input file <eng>lish
And then put the pages back together by
>cat tess-outfile01.txt tess-outfile02.txt ...
tess-outfilenn.txt > Input.txt
There will some failed conversion/bad guesses by the tesseract program
so check the final output for correctness.
Bash Script to do the conversion
< This got reformatted and I attempted to put it back the way I
remembered it.>
< the tesseract step takes a while on each page>
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
#!/bin/bash
# # # # # # Use this script to convert a pdf formated file to text
# The Input file will be split into single page tiff files
# which will be run through tesseract to OCR the files into
# text files. the text files will be reassimbled into a
# single text file.
# # NOTE: There will still be some cleanup of the text files
# as the OCR is not perfect.
# # # # # # # # Get Input file name, and final output filename
InFile=${1:-"infile.pdf"}
TIFFile="${InFile%.pdf}"
OutFile=${2:-"$TIFFile.txt"}
echo "Input from $InFile, OCR output to $OutFile"
if [ ! -e "$InFile" ] ; then
echo "$InFile not found. exiting"
exit 1
elsif [ ! -r "$InFile" ]
echo " Read not allowed on $InFile. exiting"
exit 1
fi
# setup a temp working area
WrkDir="/tmp/$(date +%s)"
mkdir $WrkDir
echo " Working Dir = $WrkDir"
cp $InFile $WrkDir/
Hdir=$(pwd)
cd $WrkDir
# pwd
gs -r300x300 -sDEVICE=tiffgray -sOutputFile=$TIFFile%02d.tif -dBATCH
-dNOPAUSE $InFile >files
TifCount=$(grep "Page " files | wc -l)
rm files
#
ls -l *.tif
echo "number of pages to process = $TifCount"
for wtif in $(ls *.tif); do
wtxt=${wtif%.tif}
tesseract "$wtif" "$wtxt" -l eng
done
#
ls -l *.txt TxtFiles=$( ls *.txt )
touch $OutFile
for Tf in $TxtFiles; do
#
echo "Working on $Tf, "
cat "$Tf" >> $OutFile
done
ls -l
cp $OutFile $Hdir/
cd $Hdir
# once debuged enable the following
rm -fr $WrkDir
exit 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
James C
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Pdf2txt.sh
Type: application/octet-stream
Size: 1362 bytes
Desc: not available
URL: <http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20120914/4a2685c1/attachment.obj>
-------------- next part --------------
Pdf to Text
The bash script Pdf2txt.sh located in the same directory as this file
will do all the following steps for PDFs upto 99 pages. It is also
on test.sb.state.az.us (10.168.30.100) in /home/jimc.
Convert .pdf document to single page .tif format documents
>gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE <Input.pdf>
-r is resoultion
-sDEVICE for monocrome output
-sOutputFile=outputfilename
note %02d causes the page number to be inserted
into the filename
next use tesseract to convert each page to text
>tesseract inputFile outputfile -l eng
input is the output tif files from gs
outputfile will be given a .txt extentsion
-l language of input file <eng>lish
reassemble the ocr'ed .txt files into a single document
>cat tess-outfile01.txt tess-outfile02.txt ... tess-outfilenn.txt > Input.txt
on test server <10.168.30.100 test.sb.state.az.us>
I have installed tesseract using the following
yum -y install tesseract tesseract-en
using the Rpmforge repositorys
wget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm
rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt
rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm
rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm
More information about the PLUG-discuss
mailing list