Fwd: RE: making PDFs workable

Attachments:
Message as email (text/plain) Pdf2txt.sh (application/octet-stream) PDF-OCR-Text.TXT (text/plain) (text/plain)

Author: James Crawford
Date:
To: plug-discuss
Subject: Fwd: RE: making PDFs workable

What I did to setup for the conversion Note I'm doing this on a CentOS 
5.x system
1. Add RpmForge to the YUM repo file
         wget 
http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm

         rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt
         rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm
         rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm
2. install tesseract
         yum -y install tesseract tesseract-en

This makes it possible to do the following;
         gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif 
-dBATCH -dNOPAUSE <Input.pdf>
     where the options are
             -r is resoultion
             -sDEVICE for monocrome output
             -sOutputFile=outputfilename note %02d causes the page 
number to be inserted into the filename
Followed by
         tesseract inputFile outputfile -l eng
     where the options are
             input is the output tif files from gs
             outputfile will be given a .txt extentsion
             -l language of input file <eng>lish
And then put the pages back together by
             >cat tess-outfile01.txt tess-outfile02.txt ... 
tess-outfilenn.txt > Input.txt

There will some failed conversion/bad guesses by the tesseract program
so check the final output for correctness.

Bash Script to do the conversion
< This got reformatted and I attempted to put it back the way I 
remembered it.>
< the tesseract step takes a while on each page>
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
#!/bin/bash
# # # # # # Use this script to convert a pdf formated file to text
# The Input file will be split into single page tiff files
# which will be run through tesseract to OCR the files into
# text files. the text files will be reassimbled into a
# single text file.
# # NOTE: There will still be some cleanup of the text files
# as the OCR is not perfect.
# # # # # # # # Get Input file name, and final output filename
InFile=${1:-"infile.pdf"}
TIFFile="${InFile%.pdf}"
OutFile=${2:-"$TIFFile.txt"}
echo "Input from $InFile, OCR output to $OutFile"
if [ ! -e "$InFile" ] ; then
     echo "$InFile not found. exiting"
     exit 1
elsif [ ! -r "$InFile" ]
     echo " Read not allowed on $InFile. exiting"
     exit 1
fi
# setup a temp working area
WrkDir="/tmp/$(date +%s)"
mkdir $WrkDir
echo " Working Dir = $WrkDir"
cp $InFile $WrkDir/
Hdir=$(pwd)
cd $WrkDir
# pwd
gs -r300x300 -sDEVICE=tiffgray -sOutputFile=$TIFFile%02d.tif -dBATCH 
-dNOPAUSE $InFile >files
TifCount=$(grep "Page " files | wc -l)
rm files
#
ls -l *.tif
echo "number of pages to process = $TifCount"
for wtif in $(ls *.tif); do
     wtxt=${wtif%.tif}
     tesseract "$wtif" "$wtxt" -l eng
done
#
ls -l *.txt TxtFiles=$( ls *.txt )
touch $OutFile
for Tf in $TxtFiles; do
#
     echo "Working on $Tf, "
     cat "$Tf" >> $OutFile
done
ls -l
cp $OutFile $Hdir/
cd $Hdir
# once debuged enable the following
rm -fr $WrkDir
exit 0

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

James C
Pdf to Text

The bash script Pdf2txt.sh located in the same directory as this file
will do all the following steps for PDFs upto 99 pages. It is also
on test.sb.state.az.us (10.168.30.100) in /home/jimc.

Convert .pdf document to single page .tif format documents
>gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE <Input.pdf>

    -r is resoultion
    -sDEVICE for monocrome output
    -sOutputFile=outputfilename
         note %02d causes the page number to be inserted
         into the filename

next use tesseract to convert each page to text
>tesseract inputFile outputfile -l eng

    input is the output tif files from gs
    outputfile will be given a .txt extentsion
    -l language of input file <eng>lish

reassemble the ocr'ed .txt files into a single document
>cat tess-outfile01.txt tess-outfile02.txt ... tess-outfilenn.txt > Input.txt

on test server <10.168.30.100 test.sb.state.az.us>
I have installed tesseract using the following
    yum -y install tesseract tesseract-en
using the Rpmforge repositorys
  wget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm
  rpm --import  http://apt.sw.be/RPM-GPG-KEY.dag.txt
  rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm
  rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm

---------------------------------------------------
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

This message is part of the following thread:
	the complete thread tree sorted by date
	James Crawford at