
Michael Havens bmike1 at gmail.com
Tue Jan 27 21:34:43 MST 2015

you guys are GOOD! *wish I could do that.*


On Tue, Jan 27, 2015 at 8:39 PM, James Dugger <james.dugger at gmail.com>

> ​After I got home I thought I could improve on the script.  The following
> script pulls down the urls and passes them through a while loop that
> reduces the name of the url down to the name of the .jpg given in front of
> the query string.  There are a lot of things that could be refactored to
> clean it up but it works:
> #1/usr/env/bin/ bash
> # Crawl the site, build the url list, and pass it into the variable url.
> url=$(wget --spider --force-html -r -l2 "
> http://sites.google.com/site/thebookofgimp/home/chapter-2-photograph-retouching/"
> 2>&1 | grep '^--' | awk '{ print $3 }')
> #Set how many characters in front of ".jpg" to start building the name
> string.
> front_int=6
> printf %s "$url" | while IFS= read -r raw_url
>     do
>     #Cut the string a the characters ".jpg"
>     pos=${raw_url%%.jpg?*}
>     #determine the character position of the cut string.
>     pos_int=$((${#pos}))
>     #reduce the number by the value of $front_int.
>     (( pos_int -= $front_int))
>     #build a new string based on the pos the range provided by front_int
>     temp_name=${raw_url:pos_int:front_int}
>     #Clean up the image name.
>     image_name=$(echo "$temp_name" | sed 's#.*/\(.*\)$#\1#g')
>     #get the images.
>     wget -O "${image_name}.jpg" "$raw_url"
> done
> On Tue, Jan 27, 2015 at 2:48 PM, Todd Millecam <tyggna at gmail.com> wrote:
>> alright, you got the 20 second, the 2 minute, and now the 20 minute help
>> solution:
>> Open a terminal and do the following:
>> cd /tmp
>> mkdir images
>> cd images
>> wget --spider -r -l2 -A jpg
>> http://sites.google.com/site/thebookofgimp/home/chapter-2-photograph-retouching/
>> 2>&1 | grep '^--' | awk '{print $3}' > imagesList.txt
>>  for url in `cat imagesList.txt` ; do wget -O `date +%s%N`.jpg $url ; done
>> That'll download all the images. . .and some crap.  Look through it and
>> delete the crap out of /tmp/images and you're done.
>> Explanation for those who want to improve their linux terminal fu:
>> make a temporary directory inside of tmp for cleanliness.
>> Now, the wget command is confusing.  --spider and -r mean "just grab
>> urls, don't download anything, but look through and get every url you can
>> find from the following location.  -l2 means to just look two directories
>> deep, and -A jpg means to only print out stuff that includes "jpg" in the
>> url.
>> From there, it's the big long URL, and then some fun little unix specific
>> stuff.
>> 2>&1 is an output redirect.  0 is standard in, 1 is standard out, and 2
>> is standard error.  wget is a weird program and prints everything to
>> standard error by default, so this makes it move all the data from standard
>> error to standard out.  We want it on standard out so we can send the
>> output of wget (all our URLs) to a different program to filter through
>> them.  You see, wget isn't giving us clean urls, it's giving us some crap
>> output lines, and the useful output lines come in a string like:
>> --2015-01-27 14:38:33--
>> https://e572cad7-a-62cb3a1a-s-sites.googlegroups.com/site/thebookofgimp/home/chapter-2-photograph-retouching/2.100.jpg?attachauth=ANoY7cqIGpuUDdagEljYRFF7WMX2G3rAxez0XLIAOW9cXpAnjqilN4X2HyaRWIblk29ORjgMg28jrQuQmBisXSw0d3gYh912nr4DtRyT5Jqk0KVEfJRqC2u92vG7TlxK75odZ1uWVaUrpEvUw1A52TZbuU7Dju7DIPQzou3dskyDSRrh0VAPHrI-znqeKeJ7NuzJqEc8WcLl4MnUpO-dgUZB7i8Eq_z3FFstaXyhjQGcbht8xZ0cBPFvBgw2gWYhuDQ4lqDHJSru&attredirects=0
>> we don't care about the --<date >-- and want to get rid of it, so we have
>> to filter down.  That's why we do the output redirect first, so we can use
>> some Linux filter programs, specifically grep and awk.
>> grep is a regular expression tool, which means it's a very powerful way
>> to find text.  The regular expression I wanted to pass in was '^--'  which
>> means:  find all lines that start with the characters --.  awk will take
>> regular expressions too, so I could change the command to look like:
>> . . .2>&1 | awk '$0 ~/^--/ {print $3}' > imagesList.txt
>> and that would work too.
>> The coolest description of awk I ever got was "basically excel with no
>> gui"
>> awk splits all your text up into fields--the default dividing character
>> is a space, so if I want the first thing in the line, I use a $1 to say get
>> up to the first space.  There's a space in the date here, so the actual URL
>> is in field 3 which is why I tell awk to execute the command {print $3}
>> I could also say, "get the last field in the line" since that's the URL
>> too by using a $NF
>> The last bit, the > imagesList.txt says to make or overwrite the file
>> named imagesList.txt with whatever awk outputs (which is our filtered urls).
>> The last line is:
>> for url in `cat imagesList.txt` ; do wget -O `date +%s%N`.jpg $url ; done
>> this is saying, give me the text on each line in imagesList.txt, and
>> store them in the variable $url, and then execute the command group between
>> do and done until we've gone through every line in the file.
>> The command between do and done is our regular old download a file with
>> wget, with one small modification:
>> wget -O `date +%s%N`.jpg
>> Anything in back-ticks (the ` character right next to 1 on most
>> keyboards) is an encapsulated command, and everything inside the back-ticks
>> will be executed as a command.  Well, the command date +%s%N means give me
>> the current time in nanoseconds.  So, each time wget is run, it'll rename
>> the download file to the current time in nanoseconds.jpg and then the for
>> loop takes over and grabs the next one.
>> On Tue, Jan 27, 2015 at 2:17 PM, Stephen Partington <cryptworks at gmail.com
>> > wrote:
>>> you can write a script to yank out the jpg links. or just use something
>>> like https://addons.mozilla.org/en-US/firefox/addon/downthemall/
>>> On Tue, Jan 27, 2015 at 12:12 PM, Michael Havens <bmike1 at gmail.com>
>>> wrote:
>>>> H0w can I us wget to retrieve the photos here
>>>> <http://the-book-of-gimp.blogspot.com/p/chapter-2-photograph-retouching.html>.
>>>> I tried:
>>>> wget -r
>>>> http://the-book-of-gimp.blogspot.com/p/chapter-2-photograph-retouching.html
>>>> but it didn't download the pictures. It downloaded a bunch of web pages.
>>>> :-)~MIKE~(-:
>>>> ---------------------------------------------------
>>>> PLUG-discuss mailing list - PLUG-discuss at lists.phxlinux.org
>>>> To subscribe, unsubscribe, or to change your mail settings:
>>>> http://lists.phxlinux.org/mailman/listinfo/plug-discuss
>>> --
>>> A mouse trap, placed on top of your alarm clock, will prevent you from
>>> rolling over and going back to sleep after you hit the snooze button.
>>> Stephen
>>> ---------------------------------------------------
>>> PLUG-discuss mailing list - PLUG-discuss at lists.phxlinux.org
>>> To subscribe, unsubscribe, or to change your mail settings:
>>> http://lists.phxlinux.org/mailman/listinfo/plug-discuss
>> --
>> Todd Millecam
>> ---------------------------------------------------
>> PLUG-discuss mailing list - PLUG-discuss at lists.phxlinux.org
>> To subscribe, unsubscribe, or to change your mail settings:
>> http://lists.phxlinux.org/mailman/listinfo/plug-discuss
> --
> James
> *Linkedin <http://www.linkedin.com/pub/james-h-dugger/15/64b/74a/>*
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss at lists.phxlinux.org
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.phxlinux.org/mailman/listinfo/plug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.phxlinux.org/pipermail/plug-discuss/attachments/20150127/bfc7e5be/attachment.html>

More information about the PLUG-discuss mailing list