wget

James Dugger james.dugger at gmail.com
Tue Jan 27 20:39:10 MST 2015


​After I got home I thought I could improve on the script.  The following
script pulls down the urls and passes them through a while loop that
reduces the name of the url down to the name of the .jpg given in front of
the query string.  There are a lot of things that could be refactored to
clean it up but it works:


#1/usr/env/bin/ bash

# Crawl the site, build the url list, and pass it into the variable url.
url=$(wget --spider --force-html -r -l2 "
http://sites.google.com/site/thebookofgimp/home/chapter-2-photograph-retouching/"
2>&1 | grep '^--' | awk '{ print $3 }')

#Set how many characters in front of ".jpg" to start building the name
string.
front_int=6

printf %s "$url" | while IFS= read -r raw_url
    do
    #Cut the string a the characters ".jpg"
    pos=${raw_url%%.jpg?*}

    #determine the character position of the cut string.
    pos_int=$((${#pos}))

    #reduce the number by the value of $front_int.
    (( pos_int -= $front_int))

    #build a new string based on the pos the range provided by front_int
    temp_name=${raw_url:pos_int:front_int}

    #Clean up the image name.
    image_name=$(echo "$temp_name" | sed 's#.*/\(.*\)$#\1#g')

    #get the images.
    wget -O "${image_name}.jpg" "$raw_url"
done


On Tue, Jan 27, 2015 at 2:48 PM, Todd Millecam <tyggna at gmail.com> wrote:

> alright, you got the 20 second, the 2 minute, and now the 20 minute help
> solution:
>
> Open a terminal and do the following:
>
> cd /tmp
> mkdir images
> cd images
> wget --spider -r -l2 -A jpg
> http://sites.google.com/site/thebookofgimp/home/chapter-2-photograph-retouching/
> 2>&1 | grep '^--' | awk '{print $3}' > imagesList.txt
>  for url in `cat imagesList.txt` ; do wget -O `date +%s%N`.jpg $url ; done
>
> That'll download all the images. . .and some crap.  Look through it and
> delete the crap out of /tmp/images and you're done.
>
>
> Explanation for those who want to improve their linux terminal fu:
>
> make a temporary directory inside of tmp for cleanliness.
>
> Now, the wget command is confusing.  --spider and -r mean "just grab urls,
> don't download anything, but look through and get every url you can find
> from the following location.  -l2 means to just look two directories deep,
> and -A jpg means to only print out stuff that includes "jpg" in the url.
>
> From there, it's the big long URL, and then some fun little unix specific
> stuff.
>
> 2>&1 is an output redirect.  0 is standard in, 1 is standard out, and 2 is
> standard error.  wget is a weird program and prints everything to standard
> error by default, so this makes it move all the data from standard error to
> standard out.  We want it on standard out so we can send the output of wget
> (all our URLs) to a different program to filter through them.  You see,
> wget isn't giving us clean urls, it's giving us some crap output lines, and
> the useful output lines come in a string like:
>
>
> --2015-01-27 14:38:33--
> https://e572cad7-a-62cb3a1a-s-sites.googlegroups.com/site/thebookofgimp/home/chapter-2-photograph-retouching/2.100.jpg?attachauth=ANoY7cqIGpuUDdagEljYRFF7WMX2G3rAxez0XLIAOW9cXpAnjqilN4X2HyaRWIblk29ORjgMg28jrQuQmBisXSw0d3gYh912nr4DtRyT5Jqk0KVEfJRqC2u92vG7TlxK75odZ1uWVaUrpEvUw1A52TZbuU7Dju7DIPQzou3dskyDSRrh0VAPHrI-znqeKeJ7NuzJqEc8WcLl4MnUpO-dgUZB7i8Eq_z3FFstaXyhjQGcbht8xZ0cBPFvBgw2gWYhuDQ4lqDHJSru&attredirects=0
>
>
> we don't care about the --<date >-- and want to get rid of it, so we have
> to filter down.  That's why we do the output redirect first, so we can use
> some Linux filter programs, specifically grep and awk.
>
> grep is a regular expression tool, which means it's a very powerful way to
> find text.  The regular expression I wanted to pass in was '^--'  which
> means:  find all lines that start with the characters --.  awk will take
> regular expressions too, so I could change the command to look like:
> . . .2>&1 | awk '$0 ~/^--/ {print $3}' > imagesList.txt
> and that would work too.
>
> The coolest description of awk I ever got was "basically excel with no gui"
> awk splits all your text up into fields--the default dividing character is
> a space, so if I want the first thing in the line, I use a $1 to say get up
> to the first space.  There's a space in the date here, so the actual URL is
> in field 3 which is why I tell awk to execute the command {print $3}
> I could also say, "get the last field in the line" since that's the URL
> too by using a $NF
>
> The last bit, the > imagesList.txt says to make or overwrite the file
> named imagesList.txt with whatever awk outputs (which is our filtered urls).
>
> The last line is:
> for url in `cat imagesList.txt` ; do wget -O `date +%s%N`.jpg $url ; done
>
> this is saying, give me the text on each line in imagesList.txt, and store
> them in the variable $url, and then execute the command group between do
> and done until we've gone through every line in the file.
>
> The command between do and done is our regular old download a file with
> wget, with one small modification:
>
> wget -O `date +%s%N`.jpg
>
> Anything in back-ticks (the ` character right next to 1 on most keyboards)
> is an encapsulated command, and everything inside the back-ticks will be
> executed as a command.  Well, the command date +%s%N means give me the
> current time in nanoseconds.  So, each time wget is run, it'll rename the
> download file to the current time in nanoseconds.jpg and then the for loop
> takes over and grabs the next one.
>
>
>
>
>
> On Tue, Jan 27, 2015 at 2:17 PM, Stephen Partington <cryptworks at gmail.com>
> wrote:
>
>> you can write a script to yank out the jpg links. or just use something
>> like https://addons.mozilla.org/en-US/firefox/addon/downthemall/
>>
>> On Tue, Jan 27, 2015 at 12:12 PM, Michael Havens <bmike1 at gmail.com>
>> wrote:
>>
>>> H0w can I us wget to retrieve the photos here
>>> <http://the-book-of-gimp.blogspot.com/p/chapter-2-photograph-retouching.html>.
>>> I tried:
>>>
>>> wget -r
>>> http://the-book-of-gimp.blogspot.com/p/chapter-2-photograph-retouching.html
>>>
>>> but it didn't download the pictures. It downloaded a bunch of web pages.
>>> :-)~MIKE~(-:
>>>
>>> ---------------------------------------------------
>>> PLUG-discuss mailing list - PLUG-discuss at lists.phxlinux.org
>>> To subscribe, unsubscribe, or to change your mail settings:
>>> http://lists.phxlinux.org/mailman/listinfo/plug-discuss
>>>
>>
>>
>>
>> --
>> A mouse trap, placed on top of your alarm clock, will prevent you from
>> rolling over and going back to sleep after you hit the snooze button.
>>
>> Stephen
>>
>>
>> ---------------------------------------------------
>> PLUG-discuss mailing list - PLUG-discuss at lists.phxlinux.org
>> To subscribe, unsubscribe, or to change your mail settings:
>> http://lists.phxlinux.org/mailman/listinfo/plug-discuss
>>
>
>
>
> --
> Todd Millecam
>
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss at lists.phxlinux.org
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.phxlinux.org/mailman/listinfo/plug-discuss
>



-- 
James

*Linkedin <http://www.linkedin.com/pub/james-h-dugger/15/64b/74a/>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.phxlinux.org/pipermail/plug-discuss/attachments/20150127/22871111/attachment.html>


More information about the PLUG-discuss mailing list