On Sat, Aug 16, 2008 at 2:46 PM, Judd Pickell <
pickell@gmail.com> wrote:
> Java is as good as any others at parsing websites.. It isn't whether
> the tool is good or not, it is how do you want it parsed? How much
> error control do you need to ensure a successful parse? etc.
>
> Really any more, a good XML parser should suffice for your needs,
> unless you are doing something beyond just parsing it.
Most web sites do not conform to XHTML standards and are not valid XML.
You cannot rely on an XML parser to process web pages 'in the wild'. Modern
tools are producing HTML that is more standards compliant, but the majority
remain broken as far as an XML parser is concerned. The parsers in Mozilla
(gecko) and IE are really quite complex. If you are doing web page parsing
with Java, your best bet is customizing Nutch. jmz
--
"Never take counsel of your fears." - Andrew Jackson
-
http://www.joshuazeidner.com/
---------------------------------------------------
PLUG-discuss mailing list -
PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss