automation of collecting web site size

Jason jkenner@mindspring.com
Mon, 16 Oct 2000 16:11:41 -0700


sinck@ugive.com wrote:
> 
> \_ : There must be a way to get this info, without either sucking the whole
> \_ : site down, or having access to the webserver?
> \_ :
> \_ : Anyone have any ideas, suggestions, etc.?
> \_
> \_ lynx -dump -head $URL | grep Content-Length | cut -d: -f 2
> 
> Hum, this could lead to amusing issues if it hits a cgi script (or
> other) that doesn't play nice with issuing Content-Length.
> Furthermore just getting the head doesn't tell you what else might be
> linked to it down the tree.

True.

Use wget, and reject all common largish file formats (jpg, tif, gif,
mpg, mp3, bmp, etc etc)...
you will obviously HAVE to mirror all of the HTML in order to find out
whats on the site.

After that, one can collect the URLs of the images and downloads by a
combination of sed and grep that seperates out all the URLs to the
file formats you told wget to reject, looking for img src, a href,
etc...

-- 
jkenner @ mindspring . com__
I Support Linux:           _> _  _ |_  _  _     _|
Working Together To       <__(_||_)| )| `(_|(_)(_|
To Build A Better Future.       |                   <s>