Document Management

Matt Graham danceswithcrows at usa.net
Mon Nov 2 08:31:56 MST 2009


From: Alex Dean <alex at crackpot.org>
> On Nov 1, 2009, at 9:24 PM, Ted Gould wrote:
>> I'd recommend gscan2pdf.  It works with SANE, but does nice things  
>> like handle double sided stuff easily.  It will also work with
>> GOCR to do OCR

That's not exactly a great thing.  GOCR is much worse than commercial
OCR engines, especially if the original image is skewed/broken.

>> but leave the text embedded in the document (not shown) so it can
>> be searched
> Is there any way for an end-user to see the text version?

What they probably do here is "text-behind-image" in the PDF.  This
is reasonably easy to do; just have the PDF library draw text that is
not visible.  pdf2txt will probably get the pure text version out.
We were doing something like this without PDF, where we had the page
TIFF combined with an XML file that had tons of elements like:

<word ulx="100" uly="100" lrx="230" lry="145">longword</word>

...so that individual words could be grepped for and highlighted on
the page TIFF if you wanted.

> Someday I'm going to start digitizing and OCR-ing the 100 years of  
> local newspapers which are gathering mold in the library basement.  I  
> really have no firm plan as to how I'm going to do it, but doing it  
> with free software would be a big plus.

I spent 3 or 4 years doing stuff like this on the NYT, Wall Street
Journal, Christian Science Monitor, and Boston Globe.  You will NOT
be able to get decent OCR with free software.  Newspapers require
a different approach than most OCR packages take; you have to split
each article up into multiple individual image files and OCR each
file separately, then stitch the results back together.  And editing
the results is totally necessary since newspaper text is so horrible
in quality.

(I can talk about this for at least half an hour; contact offlist
for more info.)

-- 
Matt G / Dances With Crows
The Crow202 Blog:  http://crow202.org/wordpress/
There is no Darkness in Eternity/But only Light too dim for us to see




More information about the PLUG-discuss mailing list