RobotRules.pm
Martijn Koster (mak@victor.nexor.co.uk)
Wed, 09 Aug 1995 22:00:21 BST
Just when all the names had settled down... Where shall we put: "Robot
protocol as implemented in libwww-perl 0.40" from the TODO list?
Tim's module list suggested WWW::Robots, but that may be a bit
general. It's not specific to LWP, and we shouldn't really restrict
it to HTTP either. WWW::RobotRules may be appropriate.
Anyway, I came up with something that does:
$robotrules = new RobotRules('MOMspider/1.0');
$robotrules->parse($url, $content);
if($robotsrules->allowed($url)) {
...
}
Is that OK for an interface? .pm and 26 tests appended for feedback...
Unrelated comment: Wow the Tk:: stuff is cool :-) Anyone done an MH
reader in it yet?
-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
WWW: http://web.nexor.co.uk/mak/mak.html
=head1 NAME
WWW::RobotRules
=head1 SYNOPSIS
$robotrules = new WWW::RobotRules('MOMspider/1.0');
$robotrules->parse($url, $content);
if($robotrules->allowed($url)) {
...
}
=head1 DESCRIPTION
This module parses a "/robots.txt" file as specified in
"A Standard for Robot Exclusion", described in
http://web.nexor.co.uk/users/mak/doc/robots/norobots.html.
Webmasters can use this file to disallow conforming robots access to
parts of their WWW server.
The parsed file is kept as a Perl object that support methods to
check if a given URL is prohibited.
Note that the same RobotRules object can parse multiple files.
=cut
package RobotRules;
use URI::URL;
use strict;
=head2 new RobotRules('MOMspider/1.0')
The argument given to C<new()> is the name of the robot.
=cut
sub new {
my($class, $ua) = @_;
$ua =~ s!/?\s*\d+.\d+\s*$!! if (defined $ua); # lose version
bless {
'ua' => $ua,
'rules' => undef,
}, $class;
}
=head2 parse($url, $content)
Parse takes the URL that was used to retrieve the /robots.txt
file, and the contents of the file.
=cut
sub parse {
my($self, $url, $txt) = @_;
$url = new URI::URL($url) unless ref($url); # make it URL
my $hostport = $url->host . ':' . $url->port;
delete $self->{'rules'}{$hostport} if
exists $self->{'rules'}{$hostport};
$txt =~ s/\015\012/\n/mg; # fix weird line endings
my $isMe = 0; # 1 iff this record is for me
my $isAnon = 0; # 1 iff this record is for *
my @meDisallowed = (); # rules disallowed for me
my @anonDisallowed = (); # rules disallowed for *
for(split(/\n/, $txt)) {
s/\s*\#.*//;
if (/^\s*$/) { # blank line
if ($isMe) {
# That was our record. No need to read the rest.
last;
}
$isMe = $isAnon = 0;
@meDisallowed = ();
}
elsif (/^User-agent:\s*(.*)\s*$/i) {
$ua = $1;
if ($isMe) {
# This record already had a User-agent that
# we matched, so just continue.
}
elsif($self->isMe($ua)) {
$isMe = 1;
}
elsif ($ua eq '*') {
$isAnon = 1;
}
}
elsif (/^Disallow:\s*(.*)\s*$/i) {
warn "Disallow without preceding User-agent" unless
defined $ua;
my $full_path;
if ($1 eq '') {
$full_path = '';
}
else {
my $abs = new URI::URL($1, $url);
$full_path = $abs->full_path();
}
if ($isMe) {
push(@meDisallowed, $full_path);
}
elsif ($isAnon) {
push(@anonDisallowed, $full_path);
}
}
else {
warn "Unexpected line: $_\n";
}
}
if ($isMe) {
$self->{'rules'}{$hostport} = \@meDisallowed;
}
elsif (@anonDisallowed) {
$self->{'rules'}{$hostport} = \@anonDisallowed;
}
}
# isMe()
#
# Returns TRUE if the given name matches the
# name of this robot
#
sub isMe {
my($self, $ua) = @_;
my $me = $self->{'ua'};
return $ua =~ /$me/i;
}
=head1 allowed($url)
Returns TRUE if this robot is allowed to retrieve this URL.
=cut
sub allowed {
my($self, $url) = @_;
$url = new URI::URL($url) unless ref($url); # make it URL
my $hostport = $url->host . ':' . $url->port;
return 1 unless exists $self->{'rules'}{$hostport};
my $str = $url->full_path;
for $rule (@{ $self->{'rules'}{$hostport}}) {
return 1 if ($rule eq '');
return 0 if ($str =~ /^$rule/);
}
return 1;
}
1;
=head1 NAME
RobotRules.t
=head1 DESCRIPTION
Test a number of different A</robots.txt> files against a number
of different User-agents.
=cut
require WWW::RobotRules;
use Carp;
use strict;
print "1..26\n"; # for Test::Harness
# We test a number of different /robots.txt files,
#
my $content1 = <<EOM;
# http://foo/robots.txt
User-agent: *
Disallow: /private
Disallow: http://foo/also_private
User-agent: MOMspider
Disallow:
EOM
my $content2 = <<EOM;
# http://foo/robots.txt
User-agent: MOMspider
Disallow: /private
EOM
my $content3 = <<EOM;
# http://foo/robots.txt
EOM
my $content4 = <<EOM;
# http://foo/robots.txt
User-agent: *
Disallow: /private
User-agent: MOMspider
Disallow: /this
User-agent: Another
Disallow: /that
EOM
# and a number of different robots:
my @tests1 = (
[$content1, 'MOMspider' =>
1 => 'http://foo/private' => 1,
2 => 'http://foo/also_private' => 1,
],
[$content1, 'Wubble' =>
3 => 'http://foo/private' => 0,
4 => 'http://foo/also_private' => 0,
5 => 'http://foo/other' => 1,
],
[$content2, 'MOMspider' =>
6 => 'http://foo/private' => 0,
7 => 'http://foo/other' => 1,
],
[$content2, 'Wubble' =>
8 => 'http://foo/private' => 1,
9 => 'http://foo/also_private' => 1,
10 => 'http://foo/other' => 1,
],
[$content3, 'MOMspider' =>
11 => 'http://foo/private' => 1,
12 => 'http://foo/other' => 1,
],
[$content3, 'Wubble' =>
13 => 'http://foo/private' => 1,
14 => 'http://foo/other' => 1,
],
[$content4, 'MOMspider' =>
15 => 'http://foo/private' => 1,
16 => 'http://foo/this' => 0,
17 => 'http://foo/that' => 1,
],
[$content4, 'Another' =>
18 => 'http://foo/private' => 1,
19 => 'http://foo/this' => 1,
20 => 'http://foo/that' => 0,
],
[$content4, 'Wubble' =>
21 => 'http://foo/private' => 0,
22 => 'http://foo/this' => 1,
23 => 'http://foo/that' => 1,
],
[$content4, 'Another/1.0' =>
24 => 'http://foo/private' => 1,
25 => 'http://foo/this' => 1,
26 => 'http://foo/that' => 0,
],
# when adding tests, remember to increase
# the maximum at the top
);
my $t;
for $t (@tests1) {
my ($content, $ua) = splice(@$t, 0, 2);
my $robotsrules = new RobotRules($ua);
$robotsrules->parse('http://foo/robots.txt', $content);
my ($num, $path, $expected);
while(($num, $path, $expected) = splice(@$t, 0, 3)) {
if($robotsrules->allowed($path) != $expected) {
confess "Test Failed: $ua => $path";
}
print "ok $num\n";
}
}