Document Management
Ted Gould
ted at gould.cx
Mon Nov 2 08:50:53 MST 2009
On Mon, 2009-11-02 at 08:31 -0700, Matt Graham wrote:
> >From: Alex Dean <alex at crackpot.org>
> > On Nov 1, 2009, at 9:24 PM, Ted Gould wrote:
> >> I'd recommend gscan2pdf. It works with SANE, but does nice things
> >> like handle double sided stuff easily. It will also work with
> >> GOCR to do OCR
>
> That's not exactly a great thing. GOCR is much worse than commercial
> OCR engines, especially if the original image is skewed/broken.
Oh, well, it's plugable, just GOCR is all I have :)
In general, it does a bunch of cleanup to make the OCR reasonable. I
wouldn't say it's anywhere near perfect, but it seems to pull most of
the keywords out of things like credit card statements. I would say
it's good enough for search, but it's not perfect by any stretch of the
imagination.
> > Someday I'm going to start digitizing and OCR-ing the 100 years of
> > local newspapers which are gathering mold in the library basement. I
> > really have no firm plan as to how I'm going to do it, but doing it
> > with free software would be a big plus.
>
> I spent 3 or 4 years doing stuff like this on the NYT, Wall Street
> Journal, Christian Science Monitor, and Boston Globe. You will NOT
> be able to get decent OCR with free software. Newspapers require
> a different approach than most OCR packages take; you have to split
> each article up into multiple individual image files and OCR each
> file separately, then stitch the results back together. And editing
> the results is totally necessary since newspaper text is so horrible
> in quality.
>
> (I can talk about this for at least half an hour; contact offlist
> for more info.)
+1, I wouldn't use it for archival things like that yet. But, you might
be able to use GOCR with the work Google is doing -- I'm not sure if
they're open sourcing all of it or not.
--Ted
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20091102/53ca0b0f/attachment.pgp
More information about the PLUG-discuss
mailing list