Re: Web mirroring

Reinier Post (reinpost@win.tue.nl)
Mon, 5 Dec 1994 11:52:15 +0100 (MET)


You (libwww-perl-request@ics.UCI.EDU) write:

>Quoth Martijn Koster;
>
>> Hmm, that's open to debate. For example Oscar's script doesn't use
>> If-modified-since or even HEAD last time I looked, so someone
>> mirroring hundreds of documens daily means pulling them all accross
>> every time, which adds an unecesarry load to the server, which IMHO
>> is unethical.
>
>Yecch. (No offense, Oscar, but...)

No offense possible; his script predates the header by many months.
This is another reason for running the script with a cache; the cache
will do this for you.

>Well, the libwww-perl kit includes code for performing HEADs and
>if-modified-since GETs, also includes code which implements the
>robot-exclusion standard, and I can probably track down the code I saw
>for frobbing relative URLs. All I need as glue is some code to scan
>through a received HTML document and return a list of referenced URLs.
>
>Being a Perl beginner, I'm not confident of my ability to perform that
>kind of scan, especially with the number of malformed anchors (<a
>href="http://site/>) out there. If code exists, I'd love a copy.

This is a script I use to extract URLs from saved news and email messages.
It is just a hack.  (MOMspider, a robot based on libwww-perl, does the same
job more thoroughly.)

#!/usr/local/bin/perl

# geturls - extract URLs from text

@urls = ();  # none yet

while (<>)
{
  if (/:\// && (
    (/http:/   && /http:[^\n ",')>]+/)   ||
    (/gopher:/ && /gopher:[^\n ",')>]+/) ||
    (/file:/   && /file:[^\n ",')>]+/)   ||
    (/ftp:/    && /ftp:[^\n ",')>]+/)    ||
    (/news:/   && /news:[^\n ",')>]+/) ) )
  {
    # yummy, looks like there's a URL in this line
    push(@urls,$&);
  }
}

sort(@urls);

$last = '';
foreach (@urls)
{
  if ($last ne $_)
  {
    print "$_\n";
    $last = $_;
  }
}

# end of script

>That caveat aside, there shouldn't be too much in the way of me
>putting together a gratuitously friendly robot to gently update local
>copies of document trees on a regular basis.
>
>Yes, I'll test it locally :).
>
>>> Not specifically, and it doesn't seem to be worthwhile, either.
>>> Robots + caches are superior to mirror programs, except for very
>>> large sets of documents.
>
>> I still think they'd be useful on occasion.
>
>Indeed.

If a mirror program is your word for a robot, then the question is whether
or not it must be run through a properly configured cache.

>Toggling CacheUnused, KeepExpired and CacheRefreshInterval settings
>whilst running traversals means playing nasty tricks with my
>configuration files.
>
>I'd much rather just set a suitably timed crontab entry that performs
>traversals and IMS gets for a particular document tree, and leave the
>cache settings alone unless they need tuning.

A crontab is used to run the robot; the reconfiguration can be done from the
crontab, too.  You'll be configuring an existing program to suit your needs;
it will be easier than developing your own program.

-- 
Reinier Post						 reinpost@win.tue.nl
a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A>

PS replies on this list are set to libwww-request@ics.UCI.EDU.  This must be
a mistake; replies normally go to libwww@ics.UCI.EDU or to the original sender.