Problem with WWW::RobotRules

Paul J. Schinder (schinder@leprss.gsfc.nasa.gov)
Fri, 25 Jul 1997 09:47:48 -0400


This is another one of those \n != \012 problems, although in this case the
root cause might be that my Apache server is misconfigured.  I was running
the following script under MacPerl with various $url's:

#!perl
use LWP::RobotUA;
use HTTP::Request;
use HTTP::Response;
use LWP::Debug qw(+);

$url = 'http://mors.gsfc.nasa.gov/index.html';
my $ua = new LWP::RobotUA('My_RobotUA', 'schinder@leprss.gsfc.nasa.gov');

$ua->use_alarm(1);
my $request = new HTTP::Request('GET', $url);
my $response = $ua->request($request);
print $ua->as_string;
if ($response->is_success) {
	print $response->headers_as_string;
	($stuff = $response->content) =~ s/\015\012|\015|\012/\015/g;
    print $stuff;
    } else {
    print $response->error_as_HTML;
    }

and discovered that it was happily grabbing pages that were disallowed by
robots.txt.

The problem is in WWW::RobotRules, line 97:

    for(split(/\n/, $txt)) {

where it splits on the local newline. In this case, $txt contained a \012
delimited robots.txt, and so it wasn't being split at all.

To deal with the major possibilities, I've changed this to

    for(split/\015\012|\012|\015/,$txt)) {

which, if I read the Camel correctly, should deal with the newlines no
matter what they are.

(My Mac HTTP server, which I checked, returned robots.txt delimited with
\015\012, but of course, the original WWW::RobotRules would have left the
\012 at the beginning of the strings following the first \015 and would
have misparsed it anyway under MacPerl.)

---
Paul J. Schinder
NASA Goddard Space Flight Center
Code 693, Greenbelt, MD 20771
schinder@leprss.gsfc.nasa.gov