Re: Web mirroring
Reinier Post (reinpost@win.tue.nl)
Mon, 5 Dec 1994 11:52:15 +0100 (MET)
You (libwww-perl-request@ics.UCI.EDU) write:
>Quoth Martijn Koster;
>
>> Hmm, that's open to debate. For example Oscar's script doesn't use
>> If-modified-since or even HEAD last time I looked, so someone
>> mirroring hundreds of documens daily means pulling them all accross
>> every time, which adds an unecesarry load to the server, which IMHO
>> is unethical.
>
>Yecch. (No offense, Oscar, but...)
No offense possible; his script predates the header by many months.
This is another reason for running the script with a cache; the cache
will do this for you.
>Well, the libwww-perl kit includes code for performing HEADs and
>if-modified-since GETs, also includes code which implements the
>robot-exclusion standard, and I can probably track down the code I saw
>for frobbing relative URLs. All I need as glue is some code to scan
>through a received HTML document and return a list of referenced URLs.
>
>Being a Perl beginner, I'm not confident of my ability to perform that
>kind of scan, especially with the number of malformed anchors (<a
>href="http://site/>) out there. If code exists, I'd love a copy.
This is a script I use to extract URLs from saved news and email messages.
It is just a hack. (MOMspider, a robot based on libwww-perl, does the same
job more thoroughly.)
#!/usr/local/bin/perl
# geturls - extract URLs from text
@urls = (); # none yet
while (<>)
{
if (/:\// && (
(/http:/ && /http:[^\n ",')>]+/) ||
(/gopher:/ && /gopher:[^\n ",')>]+/) ||
(/file:/ && /file:[^\n ",')>]+/) ||
(/ftp:/ && /ftp:[^\n ",')>]+/) ||
(/news:/ && /news:[^\n ",')>]+/) ) )
{
# yummy, looks like there's a URL in this line
push(@urls,$&);
}
}
sort(@urls);
$last = '';
foreach (@urls)
{
if ($last ne $_)
{
print "$_\n";
$last = $_;
}
}
# end of script
>That caveat aside, there shouldn't be too much in the way of me
>putting together a gratuitously friendly robot to gently update local
>copies of document trees on a regular basis.
>
>Yes, I'll test it locally :).
>
>>> Not specifically, and it doesn't seem to be worthwhile, either.
>>> Robots + caches are superior to mirror programs, except for very
>>> large sets of documents.
>
>> I still think they'd be useful on occasion.
>
>Indeed.
If a mirror program is your word for a robot, then the question is whether
or not it must be run through a properly configured cache.
>Toggling CacheUnused, KeepExpired and CacheRefreshInterval settings
>whilst running traversals means playing nasty tricks with my
>configuration files.
>
>I'd much rather just set a suitably timed crontab entry that performs
>traversals and IMS gets for a particular document tree, and leave the
>cache settings alone unless they need tuning.
A crontab is used to run the robot; the reconfiguration can be done from the
crontab, too. You'll be configuring an existing program to suit your needs;
it will be easier than developing your own program.
--
Reinier Post reinpost@win.tue.nl
a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A>
PS replies on this list are set to libwww-request@ics.UCI.EDU. This must be
a mistake; replies normally go to libwww@ics.UCI.EDU or to the original sender.