On Sat, Aug 16, 2008 at 2:46 PM, Judd Pickell
<pickell@gmail.com> wrote:
Java is as good as any others at parsing websites.. It isn't whether
the tool is good or not, it is how do you want it parsed? How much
error control do you need to ensure a successful parse? etc.
Really any more, a good XML parser should suffice for your needs,
unless you are doing something beyond just parsing it.
Most web sites do not conform to XHTML standards and are not valid XML. You cannot rely on an XML parser to process web pages 'in the wild'. Modern tools are producing HTML that is more standards compliant, but the majority remain broken as far as an XML parser is concerned. The parsers in Mozilla (gecko) and IE are really quite complex. If you are doing web page parsing with Java, your best bet is customizing Nutch. jmz