Coding a program to do mass downloading help
Joseph Sinclair
plug-discussion at stcaz.net
Wed May 19 18:42:43 MST 2010
WGet works great, with a couple caveats:
1) When using wget, have something in the script check each download for sanity (size usually works, and if MP3, check that the first 3 bytes are ASCII for "ID3").
Someone I work with did a mass download like you're describing from a huge music company (75TB of MP3 files at 256KBit encoding for ~50% of the catalog), and found that about 10% failed resulting in an HTML error page in the MP3 file, instead of the actual content.
2) Authentication can be very tricky, so try it by just reading the page first, and see if it works before trying to go crazy with recursive options.
3) when using recursive, SET THE DEPTH LIMIT to something like (1). Failure to do so can result in an error response turning the download process into a crawl-the-whole-web process.
AZ RUNE wrote:
> Yes once logged in they are just links on a page.
>
> Thanks,
> Brian
>
> On Wed, May 19, 2010 at 2:24 PM, Dan Dubovik <dandubo at gmail.com> wrote:
>
>> wget?
>>
>> If there are simply links on the page to get, you can use the recursive
>> option:
>>
>> -r
>> --recursive
>> Turn on recursive retrieving.
>>
>>
>> If you have a list of the URLs for the files to get:
>> -i file
>> --input-file=file
>> Read URLs from file. If - is specified as file, URLs are read
>> from the standard input. (Use ./- to read from a file literally named -.)
>>
>> If this function is used, no URLs need be present on the command
>> line. If there are URLs both on the command line and in an input file,
>> those on the command lines will be the first ones to be retrieved. The file
>> need not be an HTML document (but no harm if it is)---it is enough if the
>> URLs are just listed sequentially.
>>
>> However, if you specify --force-html, the document will be
>> regarded as html. In that case you may have problems with relative links,
>> which you can solve either by adding "<base href="url">" to the documents
>> or by specifying --base=url on the command line.
>>
>> On Wed, May 19, 2010 at 1:44 PM, AZ RUNE <arizona.rune at gmail.com> wrote:
>>
>>> I have a friend that does DJ work with a subscription to a closed music
>>> repository.
>>>
>>> In the repository there are 4 categories of music he wants to download
>>> with 4,000+ songs per category
>>>
>>> Is there a program that will do that automated over http if given the url?
>>> Or would it have to be custom built?
>>>
>>> Any ideas?
>>>
>>> --
>>> Brian Fields
>>> arizona.rune at gmail.com
>>>
>>>
>>> ---------------------------------------------------
>>> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
>>> To subscribe, unsubscribe, or to change your mail settings:
>>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>>>
>>
>> ---------------------------------------------------
>> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
>> To subscribe, unsubscribe, or to change your mail settings:
>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>>
>
>
>
>
> ------------------------------------------------------------------------
>
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
URL: <http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20100519/a289886e/attachment.pgp>
More information about the PLUG-discuss
mailing list