From: Alex Dean > On Nov 1, 2009, at 9:24 PM, Ted Gould wrote: >> I'd recommend gscan2pdf. It works with SANE, but does nice things >> like handle double sided stuff easily. It will also work with >> GOCR to do OCR That's not exactly a great thing. GOCR is much worse than commercial OCR engines, especially if the original image is skewed/broken. >> but leave the text embedded in the document (not shown) so it can >> be searched > Is there any way for an end-user to see the text version? What they probably do here is "text-behind-image" in the PDF. This is reasonably easy to do; just have the PDF library draw text that is not visible. pdf2txt will probably get the pure text version out. We were doing something like this without PDF, where we had the page TIFF combined with an XML file that had tons of elements like: longword ...so that individual words could be grepped for and highlighted on the page TIFF if you wanted. > Someday I'm going to start digitizing and OCR-ing the 100 years of > local newspapers which are gathering mold in the library basement. I > really have no firm plan as to how I'm going to do it, but doing it > with free software would be a big plus. I spent 3 or 4 years doing stuff like this on the NYT, Wall Street Journal, Christian Science Monitor, and Boston Globe. You will NOT be able to get decent OCR with free software. Newspapers require a different approach than most OCR packages take; you have to split each article up into multiple individual image files and OCR each file separately, then stitch the results back together. And editing the results is totally necessary since newspaper text is so horrible in quality. (I can talk about this for at least half an hour; contact offlist for more info.) -- Matt G / Dances With Crows The Crow202 Blog: http://crow202.org/wordpress/ There is no Darkness in Eternity/But only Light too dim for us to see --------------------------------------------------- PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us To subscribe, unsubscribe, or to change your mail settings: http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss