My crusade for web content filtering

Brian Cluff brian@snaptek.com
Tue, 1 Aug 2000 17:03:52 -0700


> I am now considering writing my own proxy that will simply pipe the
> datastream through a perl script or something before delivering it to the
> browser.  Any page that matched one of the keywords would have the entire
> contents between <html> and </html> replaced with a text message.  Simple,
> but probably very slow.  Not to mention, I have NO idea how to code this.

There is one very large problem with the way you are suggesting doing your
block.  You would have to buffer the whole page and scan if with your list
of dirty words. Then provide a completely alternate page if you dont find
anything.
The 2 big problems that pop into mind are the fact that the user will have
to wait for your program to complete looking at it before they ever see the
page.  On slow connections, their conection with the proxy could timeout
before they every see the page.
The other problem is that you probably wouldn't want to scan the entire page
while holding it in memory, like you would have to.  Because if someone were
to put up a ten meg page, it would take at very least 10 megs of server ram
to scan that page.  All it would take is for a few people to want to look at
that page , or sevral impatient reloads of the page and you have a crashed
server.

I think that the best you can do with a content proxy is to put something in
the datastream like "Innappropriate content found, connection severed".
Doing it that would would also be much easier to write, and would take very
very little memory.

Brian Cluff