Re: wget

Attachments:
Message as email (text/plain) (text/html) (text/plain)

Author: Todd Millecam
Date:
To: Main PLUG discussion list
Subject: Re: wget

alright, you got the 20 second, the 2 minute, and now the 20 minute help
solution:

Open a terminal and do the following:

cd /tmp
mkdir images
cd images
wget --spider -r -l2 -A jpg
http://sites.google.com/site/thebookofgimp/home/chapter-2-photograph-retouching/
2>&1 | grep '^--' | awk '{print $3}' > imagesList.txt
for url in `cat imagesList.txt` ; do wget -O `date +%s%N`.jpg $url ; done

That'll download all the images. . .and some crap. Look through it and
delete the crap out of /tmp/images and you're done.

Explanation for those who want to improve their linux terminal fu:

make a temporary directory inside of tmp for cleanliness.

Now, the wget command is confusing. --spider and -r mean "just grab urls,
don't download anything, but look through and get every url you can find
from the following location. -l2 means to just look two directories deep,
and -A jpg means to only print out stuff that includes "jpg" in the url.

From there, it's the big long URL, and then some fun little unix specific
stuff.

2>&1 is an output redirect. 0 is standard in, 1 is standard out, and 2 is
standard error. wget is a weird program and prints everything to standard
error by default, so this makes it move all the data from standard error to
standard out. We want it on standard out so we can send the output of wget
(all our URLs) to a different program to filter through them. You see,
wget isn't giving us clean urls, it's giving us some crap output lines, and
the useful output lines come in a string like:

--2015-01-27 14:38:33--
https://e572cad7-a-62cb3a1a-s-sites.googlegroups.com/site/thebookofgimp/home/chapter-2-photograph-retouching/2.100.jpg?attachauth=ANoY7cqIGpuUDdagEljYRFF7WMX2G3rAxez0XLIAOW9cXpAnjqilN4X2HyaRWIblk29ORjgMg28jrQuQmBisXSw0d3gYh912nr4DtRyT5Jqk0KVEfJRqC2u92vG7TlxK75odZ1uWVaUrpEvUw1A52TZbuU7Dju7DIPQzou3dskyDSRrh0VAPHrI-znqeKeJ7NuzJqEc8WcLl4MnUpO-dgUZB7i8Eq_z3FFstaXyhjQGcbht8xZ0cBPFvBgw2gWYhuDQ4lqDHJSru&attredirects=0

we don't care about the --<date >-- and want to get rid of it, so we have
to filter down. That's why we do the output redirect first, so we can use
some Linux filter programs, specifically grep and awk.

grep is a regular expression tool, which means it's a very powerful way to
find text. The regular expression I wanted to pass in was '^--' which
means: find all lines that start with the characters --. awk will take
regular expressions too, so I could change the command to look like:
. . .2>&1 | awk '$0 ~/^--/ {print $3}' > imagesList.txt
and that would work too.

The coolest description of awk I ever got was "basically excel with no gui"
awk splits all your text up into fields--the default dividing character is
a space, so if I want the first thing in the line, I use a $1 to say get up
to the first space. There's a space in the date here, so the actual URL is
in field 3 which is why I tell awk to execute the command {print $3}
I could also say, "get the last field in the line" since that's the URL too
by using a $NF

The last bit, the > imagesList.txt says to make or overwrite the file named
imagesList.txt with whatever awk outputs (which is our filtered urls).

The last line is:
for url in `cat imagesList.txt` ; do wget -O `date +%s%N`.jpg $url ; done

this is saying, give me the text on each line in imagesList.txt, and store
them in the variable $url, and then execute the command group between do
and done until we've gone through every line in the file.

The command between do and done is our regular old download a file with
wget, with one small modification:

wget -O `date +%s%N`.jpg

Anything in back-ticks (the ` character right next to 1 on most keyboards)
is an encapsulated command, and everything inside the back-ticks will be
executed as a command. Well, the command date +%s%N means give me the
current time in nanoseconds. So, each time wget is run, it'll rename the
download file to the current time in nanoseconds.jpg and then the for loop
takes over and grabs the next one.

On Tue, Jan 27, 2015 at 2:17 PM, Stephen Partington <cryptworks@gmail.com>
wrote:

> you can write a script to yank out the jpg links. or just use something
> like https://addons.mozilla.org/en-US/firefox/addon/downthemall/
>
> On Tue, Jan 27, 2015 at 12:12 PM, Michael Havens <bmike1@gmail.com> wrote:
>
>> H0w can I us wget to retrieve the photos here
>> <http://the-book-of-gimp.blogspot.com/p/chapter-2-photograph-retouching.html>.
>> I tried:
>>
>> wget -r
>> http://the-book-of-gimp.blogspot.com/p/chapter-2-photograph-retouching.html
>>
>> but it didn't download the pictures. It downloaded a bunch of web pages.
>> :-)~MIKE~(-:
>>
>> ---------------------------------------------------
>> PLUG-discuss mailing list - PLUG-discuss@lists.phxlinux.org
>> To subscribe, unsubscribe, or to change your mail settings:
>> http://lists.phxlinux.org/mailman/listinfo/plug-discuss
>>
>
>
>
> --
> A mouse trap, placed on top of your alarm clock, will prevent you from
> rolling over and going back to sleep after you hit the snooze button.
>
> Stephen
>
>
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss@lists.phxlinux.org
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.phxlinux.org/mailman/listinfo/plug-discuss
>

--
Todd Millecam
---------------------------------------------------
PLUG-discuss mailing list - PLUG-discuss@lists.phxlinux.org
To subscribe, unsubscribe, or to change your mail settings:
http://lists.phxlinux.org/mailman/listinfo/plug-discuss

This message is part of the following thread:
	the complete thread tree sorted by date
	Stephen Partington at
	James Dugger at