Re: comments in robots.txt - bug in RobotRules.pm??
Randy Fischer (fischer@ucet.ufl.edu)
Thu, 30 Jan 1997 12:03:20 -0500
David L. Sifry wrote:
> >I agree with you. I do a HEAD request before any GET to check MIME type
> >and timestamp,
>
Martijn Koster <m.koster@webcrawler.com> ponders:
> Ehr... why? If you do a GET with relevant If-modified-since and Accept
> headers you achieve the same with a single transaction.
>
I've found that there are enough web servers out there that don't do the
Right Thing, making the HEAD-request-first method more reliable. I've
done Accept headers for text/html and gotten back GIFs (don't remember the
server).
You have to check everything.
As it is, there are so many servers out there that do server side
includes or some other dynamic method (usually for inserting the next
advert gifs) that If-Modified-Since doesn't really help. Work-arounds
are usually required.
I have primarily written robots for retrieving recent material from
the large news sites.
Randy Fischer