GCC C/C++ application to run on Windows

Joshua Zeidner jjzeidner at gmail.com
Sat Aug 16 15:00:38 MST 2008


On Sat, Aug 16, 2008 at 2:46 PM, Judd Pickell <pickell at gmail.com> wrote:

> Java is as good as any others at parsing websites.. It isn't whether
> the tool is good or not, it is how do you want it parsed? How much
> error control do you need to ensure a successful parse? etc.
>
> Really any more, a good XML parser should suffice for your needs,
> unless you are doing something beyond just parsing it.



   Most web sites do not conform to XHTML standards and are not valid XML.
You cannot rely on an XML parser to process web pages 'in the wild'.  Modern
tools are producing HTML that is more standards compliant, but the majority
remain broken as far as an XML parser is concerned.  The parsers in Mozilla
(gecko) and IE are really quite complex.  If you are doing web page parsing
with Java, your best bet is customizing Nutch.  jmz





-- 
"Never take counsel of your fears." - Andrew Jackson

- http://www.joshuazeidner.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20080816/1449bf0c/attachment.htm 


More information about the PLUG-discuss mailing list