Document Management

Craig White craigwhite at azapple.com
Mon Nov 2 10:32:26 MST 2009


On Mon, 2009-11-02 at 08:31 -0700, Matt Graham wrote:
> I spent 3 or 4 years doing stuff like this on the NYT, Wall Street
> Journal, Christian Science Monitor, and Boston Globe.  You will NOT
> be able to get decent OCR with free software.  Newspapers require
> a different approach than most OCR packages take; you have to split
> each article up into multiple individual image files and OCR each
> file separately, then stitch the results back together.  And editing
> the results is totally necessary since newspaper text is so horrible
> in quality.
----
I don't know anything about GOCR at all.

A few years ago I set up tesseract and it worked as well as I have seen
any OCR program work (in terms of accuracy) though clearly there are
many limitations compared to something like Omnipage. In the end it was
rather easy to install and get it working.

http://code.google.com/p/tesseract-ocr/

At one time I considered trying to glue it into something like Alfresco
for document management but that seemed to be difficult and at this
point, I would probably just write a wrapper program with ruby.

Depending upon what the OP is looking for in document management,
Alfresco might just be the ticket.

http://www.alfresco.com/

Craig


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the PLUG-discuss mailing list