RobotRules.pm

Martijn Koster (mak@victor.nexor.co.uk)
Wed, 09 Aug 1995 22:00:21 BST


Just when all the names had settled down... Where shall we put: "Robot
protocol as implemented in libwww-perl 0.40" from the TODO list?

Tim's module list suggested WWW::Robots, but that may be a bit
general.  It's not specific to LWP, and we shouldn't really restrict
it to HTTP either. WWW::RobotRules may be appropriate.

Anyway, I came up with something that does:

 $robotrules = new RobotRules('MOMspider/1.0');
 $robotrules->parse($url, $content);
 if($robotsrules->allowed($url)) {
     ...
 }

Is that OK for an interface? .pm and 26 tests appended for feedback...


Unrelated comment: Wow the Tk:: stuff is cool :-) Anyone done an MH
reader in it yet?

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
WWW: http://web.nexor.co.uk/mak/mak.html

=head1 NAME

WWW::RobotRules

=head1 SYNOPSIS

 $robotrules = new WWW::RobotRules('MOMspider/1.0');

 $robotrules->parse($url, $content);
    
 if($robotrules->allowed($url)) {
     ...
 }

=head1 DESCRIPTION

This module parses a "/robots.txt" file as specified in
"A Standard for Robot Exclusion", described in
http://web.nexor.co.uk/users/mak/doc/robots/norobots.html.

Webmasters can use this file to disallow conforming robots access to
parts of their WWW server.

The parsed file is kept as a Perl object that support methods to
check if a given URL is prohibited.

Note that the same RobotRules object can parse multiple files.

=cut

package RobotRules;

use URI::URL;
use strict;

=head2 new RobotRules('MOMspider/1.0')

The argument given to C<new()> is the name of the robot.

=cut

sub new {
    my($class, $ua) = @_;

    $ua =~ s!/?\s*\d+.\d+\s*$!! if (defined $ua);	# lose version

    bless {
	'ua' => $ua,
	'rules' => undef,
    }, $class;
}

=head2 parse($url, $content)

Parse takes the URL that was used to retrieve the /robots.txt
file, and the contents of the file.

=cut

sub parse {
    my($self, $url, $txt) = @_;

    $url = new URI::URL($url) unless ref($url);	# make it URL

    my $hostport = $url->host . ':' . $url->port;

    delete $self->{'rules'}{$hostport} if
	exists $self->{'rules'}{$hostport};

    $txt =~ s/\015\012/\n/mg;	# fix weird line endings

    my $isMe = 0;		# 1 iff this record is for me
    my $isAnon = 0;		# 1 iff this record is for *
    my @meDisallowed = ();	# rules disallowed for me
    my @anonDisallowed = ();	# rules disallowed for *

    for(split(/\n/, $txt)) {
	s/\s*\#.*//;

	if (/^\s*$/) {		# blank line
	    if ($isMe) {
		# That was our record. No need to read the rest.
		last;
	    }
	    $isMe = $isAnon = 0;
	    @meDisallowed = ();
	}
	elsif (/^User-agent:\s*(.*)\s*$/i) {
	    $ua = $1;
	    if ($isMe) {
		# This record already had a User-agent that
		# we matched, so just continue.
	    }
	    elsif($self->isMe($ua)) {
		$isMe = 1;
	    }
	    elsif ($ua eq '*') {
		$isAnon = 1;
	    }
	}
	elsif (/^Disallow:\s*(.*)\s*$/i) {
	    warn "Disallow without preceding User-agent" unless 
		defined $ua;

	    my $full_path;
	    if ($1 eq '') {
		$full_path = '';
	    }
	    else {
		my $abs = new URI::URL($1, $url);
		$full_path = $abs->full_path();
	    }

	    if ($isMe) {
		push(@meDisallowed, $full_path);
	    }
	    elsif ($isAnon) {
		push(@anonDisallowed, $full_path);
	    }
	}
	else {
	    warn "Unexpected line: $_\n";
	}
    }

    if ($isMe) {
	$self->{'rules'}{$hostport} = \@meDisallowed;
    }
    elsif (@anonDisallowed) {
	$self->{'rules'}{$hostport} = \@anonDisallowed;
    }
}

# isMe()
#
# Returns TRUE if the given name matches the
# name of this robot
#
sub isMe {
    my($self, $ua) = @_;

    my $me = $self->{'ua'};
    return $ua =~ /$me/i;
}

=head1 allowed($url)

Returns TRUE if this robot is allowed to retrieve this URL.

=cut

sub allowed {
    my($self, $url) = @_;

    $url = new URI::URL($url) unless ref($url);	# make it URL

    my $hostport = $url->host . ':' . $url->port;

    return 1 unless exists $self->{'rules'}{$hostport};

    my $str = $url->full_path;

    for $rule (@{ $self->{'rules'}{$hostport}}) {
	return 1 if ($rule eq '');
	return 0 if ($str =~ /^$rule/);
    }
    return 1;
}

1;

=head1 NAME

RobotRules.t

=head1 DESCRIPTION

Test a number of different A</robots.txt> files against a number
of different User-agents.

=cut

require WWW::RobotRules;
use Carp;
use strict;

print "1..26\n"; # for Test::Harness

# We test a number of different /robots.txt files,
# 

my $content1 = <<EOM;
# http://foo/robots.txt
User-agent: *
Disallow: /private
Disallow: http://foo/also_private

User-agent: MOMspider
Disallow: 
EOM

my $content2 = <<EOM;
# http://foo/robots.txt
User-agent: MOMspider
Disallow: /private
EOM

my $content3 = <<EOM;
# http://foo/robots.txt
EOM

my $content4 = <<EOM;
# http://foo/robots.txt
User-agent: *
Disallow: /private

User-agent: MOMspider
Disallow: /this

User-agent: Another
Disallow: /that
EOM

# and a number of different robots:

my @tests1 = (
	   [$content1, 'MOMspider' =>
	    1 => 'http://foo/private' => 1,
	    2 => 'http://foo/also_private' => 1,
	   ],

	   [$content1, 'Wubble' => 
	    3 => 'http://foo/private' => 0,
	    4 => 'http://foo/also_private' => 0,
	    5 => 'http://foo/other' => 1,
	   ],

	   [$content2, 'MOMspider' =>
	    6 => 'http://foo/private' => 0,
	    7 => 'http://foo/other' => 1,
	   ],

	   [$content2, 'Wubble' => 
	    8  => 'http://foo/private' => 1,
	    9  => 'http://foo/also_private' => 1,
	    10 => 'http://foo/other' => 1,
	   ],

	   [$content3, 'MOMspider' =>
	    11 => 'http://foo/private' => 1,
	    12 => 'http://foo/other' => 1,
	   ],

	   [$content3, 'Wubble' => 
	    13 => 'http://foo/private' => 1,
	    14 => 'http://foo/other' => 1,
	   ],

	   [$content4, 'MOMspider' =>
	    15 => 'http://foo/private' => 1,
	    16 => 'http://foo/this' => 0,
	    17 => 'http://foo/that' => 1,
	   ],

	   [$content4, 'Another' => 
	    18 => 'http://foo/private' => 1,
	    19 => 'http://foo/this' => 1,
	    20 => 'http://foo/that' => 0,
	   ],

	   [$content4, 'Wubble' => 
	    21 => 'http://foo/private' => 0,
	    22 => 'http://foo/this' => 1,
	    23 => 'http://foo/that' => 1,
	   ],

	   [$content4, 'Another/1.0' => 
	    24 => 'http://foo/private' => 1,
	    25 => 'http://foo/this' => 1,
	    26 => 'http://foo/that' => 0,
	   ],

           # when adding tests, remember to increase
           # the maximum at the top

	  );

my $t;

for $t (@tests1) {
    my ($content, $ua) = splice(@$t, 0, 2);

    my $robotsrules = new RobotRules($ua);
    $robotsrules->parse('http://foo/robots.txt', $content);

    my ($num, $path, $expected);
    while(($num, $path, $expected) = splice(@$t, 0, 3)) {
	
	if($robotsrules->allowed($path) != $expected) {
	    confess "Test Failed: $ua => $path";
	}
	print "ok $num\n";
    }
}