Re: web mirroring
Reinier Post (reinpost@win.tue.nl)
Thu, 1 Dec 1994 16:52:16 +0100 (MET)
Garth Kidd wrote:
> I've been grabbing stuff
>out of the cache at work, stripping the headers, and installing them
>on APANA's web-server as mirrors.
>
>This is a little time-consuming, to say the least.
>
Why don't you run a cache at the slow end? This is exactly the kind of
thing caches were developed to do.
>What I'd like is something which takes a list of http: URLs and
>traverses to all documents referenced by them _and "underneath" any of
>the specified URLs_. The restriction is intended to limit hits to just
>the document trees I'm interested in -- I don't want to set something
>loose which will traverse endlessly.
Oscar Nierstrasz's Perl scripts may be sufficient. They're avaliable
somewhere at http://www.unige.ch/ Several other robots are known but
I can't give you a list. MOMspider may be usable although it was not
developed to do this.
>So long as the "something" can use a proxy server, I'm fine, as I can
>pull the results out of the cache. If the "something" grabs the files
>and stashes them somewhere useful, that's fine, too.
If the something you end up with does not support a proxy server, take a
look at http://www.win.tue.nl/lagoon/
>My questions are:
>
>Is mirroring of this type ethical?
As long as you do it just to improve performance, why not?
>Should the mirroring software use the robot exclusion standard?
Yes, and it probably won't.
>Should I notify the webmasters and/or document maintainers?
Not necessary, they'll contact you when they see the logs. Depends on how
badly the robot will behave. Will you be running it every night? Will it
transfer very large numbers of documents? Will it insist on always having
the latest version? Etc. It would be polite to let them know.
>Should I wait for permission before proceeding?
No.
>Does free software already exist to perform http mirroring?
Not specifically, and it doesn't seem to be worthwhile, either. Robots
+ caches are superior to mirror programs, except for very large sets of
documents.
>If so, where can it be obtained?
>If not, how could I apply libwww and libwww-compatible scripts to the task?
>Is this likely to be beyond the ability of a relative Perl novice?
No, but you certainly won't need to write your own robot.
>[Note: the answers to many of these seem obvious; I'm checking just in
>case common-sense doesn't pan out. Put another way, I'm erring on the
>side of completeness at risk of looking like a complete idiot :)]
Have you tried the robots mailing list? It's at nexor.ac.uk, see their
HTTP server.
--
Reinier Post reinpost@win.tue.nl
a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A>