After I got home I thought I could improve on the script. The following script pulls down the urls and passes them through a while loop that reduces the name of the url down to the name of the .jpg given in front of the query string. There are a lot of things that could be refactored to clean it up but it works:

#1/usr/env/bin/ bash

# Crawl the site, build the url list, and pass it into the variable url.

url=$(wget --spider --force-html -r -l2 "http://sites.google.com/site/thebookofgimp/home/chapter-2-photograph-retouching/" 2>&1 | grep '^--' | awk '{ print $3 }')

#Set how many characters in front of ".jpg" to start building the name string.

front_int=6

printf %s "$url" | while IFS= read -r raw_url

#Cut the string a the characters ".jpg"

pos=${raw_url%%.jpg?*}

#determine the character position of the cut string.

pos_int=$((${#pos}))

#reduce the number by the value of $front_int.

(( pos_int -= $front_int))

#build a new string based on the pos the range provided by front_int

temp_name=${raw_url:pos_int:front_int}

#Clean up the image name.

image_name=$(echo "$temp_name" | sed 's#.*/$.*$$#\1#g')

#get the images.

wget -O "${image_name}.jpg" "$raw_url"

done

On Tue, Jan 27, 2015 at 2:48 PM, Todd Millecam <tyggna@gmail.com> wrote:

alright, you got the 20 second, the 2 minute, and now the 20 minute help solution:

Open a terminal and do the following:

cd /tmp
mkdir images
cd images
wget --spider -r -l2 -A jpg http://sites.google.com/site/thebookofgimp/home/chapter-2-photograph-retouching/ 2>&1 | grep '^--' | awk '{print $3}' > imagesList.txt
for url in `cat imagesList.txt` ; do wget -O `date +%s%N`.jpg $url ; done

That'll download all the images. . .and some crap. Look through it and delete the crap out of /tmp/images and you're done.

Explanation for those who want to improve their linux terminal fu:

make a temporary directory inside of tmp for cleanliness.

Now, the wget command is confusing. --spider and -r mean "just grab urls, don't download anything, but look through and get every url you can find from the following location. -l2 means to just look two directories deep, and -A jpg means to only print out stuff that includes "jpg" in the url.

From there, it's the big long URL, and then some fun little unix specific stuff.

2>&1 is an output redirect. 0 is standard in, 1 is standard out, and 2 is standard error. wget is a weird program and prints everything to standard error by default, so this makes it move all the data from standard error to standard out. We want it on standard out so we can send the output of wget (all our URLs) to a different program to filter through them. You see, wget isn't giving us clean urls, it's giving us some crap output lines, and the useful output lines come in a string like:

--2015-01-27 14:38:33-- https://e572cad7-a-62cb3a1a-s-sites.googlegroups.com/site/thebookofgimp/home/chapter-2-photograph-retouching/2.100.jpg?attachauth=ANoY7cqIGpuUDdagEljYRFF7WMX2G3rAxez0XLIAOW9cXpAnjqilN4X2HyaRWIblk29ORjgMg28jrQuQmBisXSw0d3gYh912nr4DtRyT5Jqk0KVEfJRqC2u92vG7TlxK75odZ1uWVaUrpEvUw1A52TZbuU7Dju7DIPQzou3dskyDSRrh0VAPHrI-znqeKeJ7NuzJqEc8WcLl4MnUpO-dgUZB7i8Eq_z3FFstaXyhjQGcbht8xZ0cBPFvBgw2gWYhuDQ4lqDHJSru&attredirects=0

we don't care about the --<date >-- and want to get rid of it, so we have to filter down. That's why we do the output redirect first, so we can use some Linux filter programs, specifically grep and awk.

grep is a regular expression tool, which means it's a very powerful way to find text. The regular expression I wanted to pass in was '^--' which means: find all lines that start with the characters --. awk will take regular expressions too, so I could change the command to look like:
. . .2>&1 | awk '$0 ~/^--/ {print $3}' > imagesList.txt
and that would work too.

The coolest description of awk I ever got was "basically excel with no gui"
awk splits all your text up into fields--the default dividing character is a space, so if I want the first thing in the line, I use a $1 to say get up to the first space. There's a space in the date here, so the actual URL is in field 3 which is why I tell awk to execute the command {print $3}
I could also say, "get the last field in the line" since that's the URL too by using a $NF

The last bit, the > imagesList.txt says to make or overwrite the file named imagesList.txt with whatever awk outputs (which is our filtered urls).

The last line is:
for url in `cat imagesList.txt` ; do wget -O `date +%s%N`.jpg $url ; done

this is saying, give me the text on each line in imagesList.txt, and store them in the variable $url, and then execute the command group between do and done until we've gone through every line in the file.

The command between do and done is our regular old download a file with wget, with one small modification:

wget -O `date +%s%N`.jpg

Anything in back-ticks (the ` character right next to 1 on most keyboards) is an encapsulated command, and everything inside the back-ticks will be executed as a command. Well, the command date +%s%N means give me the current time in nanoseconds. So, each time wget is run, it'll rename the download file to the current time in nanoseconds.jpg and then the for loop takes over and grabs the next one.

On Tue, Jan 27, 2015 at 2:17 PM, Stephen Partington <cryptworks@gmail.com> wrote:
you can write a script to yank out the jpg links. or just use something like https://addons.mozilla.org/en-US/firefox/addon/downthemall/

On Tue, Jan 27, 2015 at 12:12 PM, Michael Havens <bmike1@gmail.com> wrote:
H0w can I us wget to retrieve the photos here. I tried:

wget -r http://the-book-of-gimp.blogspot.com/p/chapter-2-photograph-retouching.html

but it didn't download the pictures. It downloaded a bunch of web pages.
:-)~MIKE~(-:

---------------------------------------------------
PLUG-discuss mailing list - PLUG-discuss@lists.phxlinux.org
To subscribe, unsubscribe, or to change your mail settings:
http://lists.phxlinux.org/mailman/listinfo/plug-discuss

--
A mouse trap, placed on top of your alarm clock, will prevent you from rolling over and going back to sleep after you hit the snooze button.

Stephen

---------------------------------------------------
PLUG-discuss mailing list - PLUG-discuss@lists.phxlinux.org
To subscribe, unsubscribe, or to change your mail settings:
http://lists.phxlinux.org/mailman/listinfo/plug-discuss

--
Todd Millecam

---------------------------------------------------
PLUG-discuss mailing list - PLUG-discuss@lists.phxlinux.org
To subscribe, unsubscribe, or to change your mail settings:
http://lists.phxlinux.org/mailman/listinfo/plug-discuss

James

Linkedin