View Full Version : How to backup or download this?????
Any Idea? I tried web download software, but no luck :-(
03-30-2007, 06:24 AM
I just tried it with SurfOffline 2.0 beta, and got loads of "Forbidden" errors. Maybe they recognise bot activity and block it? If so you might have to manually download the pages and modify any image tags if necessary to point to local paths (if they are absolute URLs).
03-30-2007, 06:39 AM
You might want to try wget (http://aminet.net/comm/tcp/wget-1.8.2.lha), maybe with the option --user-agent="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" to fake the user agent string. ;-)
03-30-2007, 06:41 AM
I think have spotted the problem. In the source code of dos1.html there is a tag "<BASE HREF="http://www.nethkin.com/bmori/amiga/dos1.html">". When I look in the log file for SurfOffline I see that it is trying to download, for example, http://www.nethkin.com/bmori/amiga/ados7.gif, when the file is actually located in http://web.archive.org/web/20010619122216/www.nethkin.com/bmori/amiga/ados7.gif. I think you would need to get a web spider software which ignores the BASE tag.
03-30-2007, 07:26 AM
just view source code, and use screen capture to rip the images (if needed)
03-30-2007, 08:41 AM
Have you emailed the author? He's already sharing all the info for free, I don't see why he wouldn't agree to give you a way to back it up for personal use.
This will allow you to get at least some of the files:
wget --user-agent "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705)" http://web.archive.org/web/20040415065133/www.nethkin.com/bmori/amiga/dos1.html --output-document - | perl -p -e 's/\/\/www.nethkin.com/\/\/web.archive.org\/web\/20040415065133\/www.nethkin.com/g' | wget --input-file - --force-html --user-agent "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705)" --convert-links --force-directories --no-host-directories --cut-dirs 3 --wait 20 --random-wait
The pages will appear in bmori/amiga/ directory (and subdirectories).
Note that archive.org has robots.txt file that if followed prohibits apps from recursively grabbing content. In this case I've added "--wait 20 --random-wait" to make the leeching less distruptive. Downloading takes longer, but shouldn't piss off archive.org admins.
I know this is far from perfect solution, but at least it works somewhat (without need for downloading everything by hand).
03-30-2007, 12:07 PM
or you might want to try HTTrack (http://www.httrack.com/).
vBulletin® v3.8.4, Copyright ©2000-2013, Jelsoft Enterprises Ltd.