Re: wget

Top Page
Attachments:
Message as email
+ (text/plain)
+ (text/html)
+ (text/plain)
Delete this message
Reply to this message
Author: Michael Havens
Date:  
To: Main PLUG discussion list
Subject: Re: wget
you guys are GOOD! *wish I could do that.*

:-)~MIKE~(-:

On Tue, Jan 27, 2015 at 8:39 PM, James Dugger <>
wrote:

> ​After I got home I thought I could improve on the script. The following
> script pulls down the urls and passes them through a while loop that
> reduces the name of the url down to the name of the .jpg given in front of
> the query string. There are a lot of things that could be refactored to
> clean it up but it works:
>
>
> #1/usr/env/bin/ bash
>
> # Crawl the site, build the url list, and pass it into the variable url.
> url=$(wget --spider --force-html -r -l2 "
> http://sites.google.com/site/thebookofgimp/home/chapter-2-photograph-retouching/"
> 2>&1 | grep '^--' | awk '{ print $3 }')
>
> #Set how many characters in front of ".jpg" to start building the name
> string.
> front_int=6
>
> printf %s "$url" | while IFS= read -r raw_url
>     do
>     #Cut the string a the characters ".jpg"
>     pos=${raw_url%%.jpg?*}

>
>     #determine the character position of the cut string.
>     pos_int=$((${#pos}))

>
>     #reduce the number by the value of $front_int.
>     (( pos_int -= $front_int))

>
>     #build a new string based on the pos the range provided by front_int
>     temp_name=${raw_url:pos_int:front_int}

>
>     #Clean up the image name.
>     image_name=$(echo "$temp_name" | sed 's#.*/\(.*\)$#\1#g')

>
>     #get the images.
>     wget -O "${image_name}.jpg" "$raw_url"
> done

>
>
> On Tue, Jan 27, 2015 at 2:48 PM, Todd Millecam <> wrote:
>
>> alright, you got the 20 second, the 2 minute, and now the 20 minute help
>> solution:
>>
>> Open a terminal and do the following:
>>
>> cd /tmp
>> mkdir images
>> cd images
>> wget --spider -r -l2 -A jpg
>> http://sites.google.com/site/thebookofgimp/home/chapter-2-photograph-retouching/
>> 2>&1 | grep '^--' | awk '{print $3}' > imagesList.txt
>> for url in `cat imagesList.txt` ; do wget -O `date +%s%N`.jpg $url ; done
>>
>> That'll download all the images. . .and some crap. Look through it and
>> delete the crap out of /tmp/images and you're done.
>>
>>
>> Explanation for those who want to improve their linux terminal fu:
>>
>> make a temporary directory inside of tmp for cleanliness.
>>
>> Now, the wget command is confusing. --spider and -r mean "just grab
>> urls, don't download anything, but look through and get every url you can
>> find from the following location. -l2 means to just look two directories
>> deep, and -A jpg means to only print out stuff that includes "jpg" in the
>> url.
>>
>> From there, it's the big long URL, and then some fun little unix specific
>> stuff.
>>
>> 2>&1 is an output redirect. 0 is standard in, 1 is standard out, and 2
>> is standard error. wget is a weird program and prints everything to
>> standard error by default, so this makes it move all the data from standard
>> error to standard out. We want it on standard out so we can send the
>> output of wget (all our URLs) to a different program to filter through
>> them. You see, wget isn't giving us clean urls, it's giving us some crap
>> output lines, and the useful output lines come in a string like:
>>
>>
>> --2015-01-27 14:38:33--
>> https://e572cad7-a-62cb3a1a-s-sites.googlegroups.com/site/thebookofgimp/home/chapter-2-photograph-retouching/2.100.jpg?attachauth=ANoY7cqIGpuUDdagEljYRFF7WMX2G3rAxez0XLIAOW9cXpAnjqilN4X2HyaRWIblk29ORjgMg28jrQuQmBisXSw0d3gYh912nr4DtRyT5Jqk0KVEfJRqC2u92vG7TlxK75odZ1uWVaUrpEvUw1A52TZbuU7Dju7DIPQzou3dskyDSRrh0VAPHrI-znqeKeJ7NuzJqEc8WcLl4MnUpO-dgUZB7i8Eq_z3FFstaXyhjQGcbht8xZ0cBPFvBgw2gWYhuDQ4lqDHJSru&attredirects=0
>>
>>
>> we don't care about the --<date >-- and want to get rid of it, so we have
>> to filter down. That's why we do the output redirect first, so we can use
>> some Linux filter programs, specifically grep and awk.
>>
>> grep is a regular expression tool, which means it's a very powerful way
>> to find text. The regular expression I wanted to pass in was '^--' which
>> means: find all lines that start with the characters --. awk will take
>> regular expressions too, so I could change the command to look like:
>> . . .2>&1 | awk '$0 ~/^--/ {print $3}' > imagesList.txt
>> and that would work too.
>>
>> The coolest description of awk I ever got was "basically excel with no
>> gui"
>> awk splits all your text up into fields--the default dividing character
>> is a space, so if I want the first thing in the line, I use a $1 to say get
>> up to the first space. There's a space in the date here, so the actual URL
>> is in field 3 which is why I tell awk to execute the command {print $3}
>> I could also say, "get the last field in the line" since that's the URL
>> too by using a $NF
>>
>> The last bit, the > imagesList.txt says to make or overwrite the file
>> named imagesList.txt with whatever awk outputs (which is our filtered urls).
>>
>> The last line is:
>> for url in `cat imagesList.txt` ; do wget -O `date +%s%N`.jpg $url ; done
>>
>> this is saying, give me the text on each line in imagesList.txt, and
>> store them in the variable $url, and then execute the command group between
>> do and done until we've gone through every line in the file.
>>
>> The command between do and done is our regular old download a file with
>> wget, with one small modification:
>>
>> wget -O `date +%s%N`.jpg
>>
>> Anything in back-ticks (the ` character right next to 1 on most
>> keyboards) is an encapsulated command, and everything inside the back-ticks
>> will be executed as a command. Well, the command date +%s%N means give me
>> the current time in nanoseconds. So, each time wget is run, it'll rename
>> the download file to the current time in nanoseconds.jpg and then the for
>> loop takes over and grabs the next one.
>>
>>
>>
>>
>>
>> On Tue, Jan 27, 2015 at 2:17 PM, Stephen Partington <
>> > wrote:
>>
>>> you can write a script to yank out the jpg links. or just use something
>>> like https://addons.mozilla.org/en-US/firefox/addon/downthemall/
>>>
>>> On Tue, Jan 27, 2015 at 12:12 PM, Michael Havens <>
>>> wrote:
>>>
>>>> H0w can I us wget to retrieve the photos here
>>>> <http://the-book-of-gimp.blogspot.com/p/chapter-2-photograph-retouching.html>.
>>>> I tried:
>>>>
>>>> wget -r
>>>> http://the-book-of-gimp.blogspot.com/p/chapter-2-photograph-retouching.html
>>>>
>>>> but it didn't download the pictures. It downloaded a bunch of web pages.
>>>> :-)~MIKE~(-:
>>>>
>>>> ---------------------------------------------------
>>>> PLUG-discuss mailing list -
>>>> To subscribe, unsubscribe, or to change your mail settings:
>>>> http://lists.phxlinux.org/mailman/listinfo/plug-discuss
>>>>
>>>
>>>
>>>
>>> --
>>> A mouse trap, placed on top of your alarm clock, will prevent you from
>>> rolling over and going back to sleep after you hit the snooze button.
>>>
>>> Stephen
>>>
>>>
>>> ---------------------------------------------------
>>> PLUG-discuss mailing list -
>>> To subscribe, unsubscribe, or to change your mail settings:
>>> http://lists.phxlinux.org/mailman/listinfo/plug-discuss
>>>
>>
>>
>>
>> --
>> Todd Millecam
>>
>> ---------------------------------------------------
>> PLUG-discuss mailing list -
>> To subscribe, unsubscribe, or to change your mail settings:
>> http://lists.phxlinux.org/mailman/listinfo/plug-discuss
>>
>
>
>
> --
> James
>
> *Linkedin <http://www.linkedin.com/pub/james-h-dugger/15/64b/74a/>*
>
> ---------------------------------------------------
> PLUG-discuss mailing list -
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.phxlinux.org/mailman/listinfo/plug-discuss
>

---------------------------------------------------
PLUG-discuss mailing list -
To subscribe, unsubscribe, or to change your mail settings:
http://lists.phxlinux.org/mailman/listinfo/plug-discuss