My crusade for web content filtering

Brian Cluff brian@snaptek.com
Wed, 2 Aug 2000 14:31:40 -0700


> Brian, I would actually need an alternate page if I DO find any hits.  I'm
> sure that's what you meant.  ;-)  Anyway, the buffering is certainly a
> problem.  While I'm not developer, I can't see any way around this.

The only way that you will get around the buffering would be to actually
grab the pages twice, but that puts a strain on your bandwidth.  You could
relieve some of the strain but putting the whole thing through a squid box
in front of your filter.  That way the seccond connection would come out of
your local cache.

Reading the whole file before you let it pass will still put very
unreasonable delays between pages.  Imageing a page that is going to trickle
over to you at the speed of a 14.4 modem.  The user will have to wait for
your filter to finish reading the page before it gives it the go ahead. And
even if the page is only 30k in size. you will still have to wait for a long
time before the page even starts to show up on your browser.

> > I think that the best you can do with a content proxy is to put
> > something in
> > the datastream like "Innappropriate content found, connection severed".
> > Doing it that would would also be much easier to write, and would
> > take very
> > very little memory.
> Brian, I'm a little confused by this.  How does this avoid the buffering
> problem?  My original thought was to do exactly what you suggest.  I was
> thinking of scanning for keywords and, upon a hit, replacing all the
> characters in the page with a simple text message.  However, this has all
> the buffering and performance problems you mentioned in your post.

You avoid the buffering problems because you dont buffer at all.  You just
examine the words as they are downloaded. and as long as they dont violate
your filtering rules, you allow the datastream to continue to the user's
browser.  When the filter rules are triggered, you just send a string like
"<br><br>This page has triggered the filter<p> The Connection has been
terminated".
Its definatly not the most beautiful way to do filtering, but it gets the
job done, and doesn't affect your service much at all.

If it were me doing such a filter, I would have all filters triggered in
that manner put into a check list.  Then someone could look over the list
once a week or so and then apply the list to a list based part of the
filter.  The if the page that was previously blocked mid stream where to be
gone to again.  The list based part of the filter would catch it right away,
and you could get your custom block screen.

Brian Cluff