Problem with WWW::RobotRules
Paul J. Schinder (schinder@leprss.gsfc.nasa.gov)
Fri, 25 Jul 1997 09:47:48 -0400
This is another one of those \n != \012 problems, although in this case the
root cause might be that my Apache server is misconfigured. I was running
the following script under MacPerl with various $url's:
#!perl
use LWP::RobotUA;
use HTTP::Request;
use HTTP::Response;
use LWP::Debug qw(+);
$url = 'http://mors.gsfc.nasa.gov/index.html';
my $ua = new LWP::RobotUA('My_RobotUA', 'schinder@leprss.gsfc.nasa.gov');
$ua->use_alarm(1);
my $request = new HTTP::Request('GET', $url);
my $response = $ua->request($request);
print $ua->as_string;
if ($response->is_success) {
print $response->headers_as_string;
($stuff = $response->content) =~ s/\015\012|\015|\012/\015/g;
print $stuff;
} else {
print $response->error_as_HTML;
}
and discovered that it was happily grabbing pages that were disallowed by
robots.txt.
The problem is in WWW::RobotRules, line 97:
for(split(/\n/, $txt)) {
where it splits on the local newline. In this case, $txt contained a \012
delimited robots.txt, and so it wasn't being split at all.
To deal with the major possibilities, I've changed this to
for(split/\015\012|\012|\015/,$txt)) {
which, if I read the Camel correctly, should deal with the newlines no
matter what they are.
(My Mac HTTP server, which I checked, returned robots.txt delimited with
\015\012, but of course, the original WWW::RobotRules would have left the
\012 at the beginning of the strings following the first \015 and would
have misparsed it anyway under MacPerl.)
---
Paul J. Schinder
NASA Goddard Space Flight Center
Code 693, Greenbelt, MD 20771
schinder@leprss.gsfc.nasa.gov