Re: comments in robots.txt - bug in RobotRules.pm??

Andrew Daviel (andrew@andrew.triumf.ca)
Tue, 28 Jan 1997 14:44:35 -0800 (PST)


On 26 Jan 1997, Gisle Aas wrote:

> > Also, I think in LWP4 that getting robots.txt was a "freebie" (not counted
> > against waiting time), but in LWP5 it isn't free. Since $ua->host_wait
> > returns zero for an unvisited site, this is somewhat irritating and
> > I've changed it (if host_wait > 0, go somewhere else first).
> 
> I don't follow you.  Care to explain it once more?


In LWP4 wwwbot.pl it says:
"# (Note - when the program retrieves /robots.txt, the program
# is not penalized and can perform a immediate request.  Retrieving
# /robots.txt (if it exists) is a freebee..)"

In my code I check host_wait. If it's zero, I get a page. If not, I do
something else for a bit. However, when I used robot_ua in LWP5 to get a page
I hadn't been to before (at least this session - I'm not using the
robot-visited database yet), it made me wait 60 seconds, even though the
value of host_wait on entry was zero, because it first did a GET for 
robots.txt and "charged" me for it. If I do set up the database, it's not 
so much of a problem (except it's another chunk of disk space on top of 
the existing stuff for content, URLs etc.).

What are peoples thoughts on the robot wait time, using HEAD vs. GET, 
etc. ??
It seems to me that the robot rules were written back in the days 
of single-threaded httpd like NCSA 1.1, and that agents now might not 
unreasonably send a small flurry of requests (like browsers do).
As I recall, I do HEADs to check timestamps and MIME type and GETs to 
update, and wait longer between GETs than HEADs.

Andrew