Java web page key word search protocal

Bryan O'Neal boneal at cornerstonehome.com
Sat Aug 1 19:30:35 MST 2009


Solr looks interesting thank you.

Why Java
1) I know Java the best, and I can make something work very very easy, I
just want something I can run continuesly with low foot print on a comodity
VM.
2) The rest of the backend is in Java and I like to have a single
development languae for any given segment of a project if posable for easy
maintenance.
3) Mostly the firt reason ;)


As for sudo code of what I could do, I could do something like the
following, I am just not happy with how it looks...

java.io.BufferedInputStream in = new java.io.BufferedInputStream(new
java.net.URL(urlVector.getValue(i)).openStream())
while (!endOfPage)
{
String line = (in).readLine();
 for ( i=0; keywordVector.getSize; i++)
{
if (line.indexOf(keywordVector.getValue(i)) > 0)
{
somecount.set(i, somecount.getValue(i)++);
}
if (line.indexOf("</body>") > 0)
{
endoOfPage = true;
}

}
}
in.close

<Please not I am not using an IDE just sudo code and no I did not bother
putting the try catches or the complex evaluation logic,
just showing basically how I could scrape a page>


On Sat, Aug 1, 2009 at 7:25 AM, Lisa Kachold <lisakachold at obnosis.com>wrote:

> Why java?
>
> Why not a simple javascript search script?
>
>
> http://stackoverflow.com/questions/141280/whats-the-best-way-to-count-keywords-in-javascript
>
> On 8/1/09, Bryan O'Neal <boneal at cornerstonehome.com> wrote:
> > Thought of that, the overhead is worse then scraping, parsing, and
> > searching.
> >
> > On Fri, Jul 31, 2009 at 7:51 AM, Lisa Kachold
> > <lisakachold at obnosis.com>wrote:
> >
> >> Try using google?
> >>
> >> On 7/31/09, Bryan O'Neal <boneal at cornerstonehome.com> wrote:
> >> > Ok, so I want to, with utmost efficacy, go through a web pages and ask
> >> how
> >> > many of a set of key words is in that web page. Does any one know of a
> >> good
> >> > open source tool for this?
> >> > I have hundreds of web pages and a near equal number of key word sets
> so
> >> > scraping each page, parsing to create a vector of strings and doing a
> a
> >> set
> >> > of nested for loop to run through each vector and compare to words in
> >> > the
> >> > key word vector is, well, FAR from efficient.
> >> > I heard of Apache velocity, but that seems to be for creating pages on
> >> the
> >> > fly. I also heard of Apache lucene, but appears to be for implementing
> >> your
> >> > own query engine on your application server (to index and query your
> >> pages)
> >> >
> >> > Also, if you know of a local ACTIVE java forum I would love to know
> >> > about
> >> > it. I have subscribed to a half dozen lists and there is nothing but
> >> > silence.
> >> >
> >> > Thanks a bunch :)
> >> >
> >>
> >>
> >> --
> >>
> >> (623)239-3392
> >> (503)754-4452 www.obnosis.com
> >> ---------------------------------------------------
> >> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
> >> To subscribe, unsubscribe, or to change your mail settings:
> >> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
> >>
> >
>
>
> --
>
> (623)239-3392
> (503)754-4452 www.obnosis.com
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20090801/38404eed/attachment.htm 


More information about the PLUG-discuss mailing list