Document Management

Tue Nov 3 14:36:45 MST 2009

On Nov 2, 2009, at 11:32 AM, Craig White wrote:

> On Mon, 2009-11-02 at 08:31 -0700, Matt Graham wrote:
>> I spent 3 or 4 years doing stuff like this on the NYT, Wall Street
>> Journal, Christian Science Monitor, and Boston Globe.  You will NOT
>> be able to get decent OCR with free software.  Newspapers require
>> a different approach than most OCR packages take; you have to split
>> each article up into multiple individual image files and OCR each
>> file separately, then stitch the results back together.  And editing
>> the results is totally necessary since newspaper text is so horrible
>> in quality.
> ----
> I don't know anything about GOCR at all.
>
> A few years ago I set up tesseract and it worked as well as I have  
> seen
> any OCR program work (in terms of accuracy) though clearly there are
> many limitations compared to something like Omnipage. In the end it  
> was
> rather easy to install and get it working.
>
> http://code.google.com/p/tesseract-ocr/

Google uses tesseract in their ocropus project.  Ocropus seems  
promising, but is still at a fairly early stage.
http://code.google.com/p/ocropus/

alex
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 194 bytes
Desc: This is a digitally signed message part
Url : http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20091103/cb52d5fa/attachment.pgp