Lessons learned: writing a linkcheck script w/ LWP
Phil Mitchell (philip_mitchell@harvard.edu)
Tue, 20 Feb 2001 17:19:29 -0500
I have recently written a script to validate a list of about 10,000 urls
that are embedded in the Harvard Library catalog. Although LWP out of the
box will do fine on the vast majority of these, it misses a few percent --
which in my case added up to hundreds of spurious bad url reports. Here are
the things that I learned in the course of trying to chase down this few
percent -- thought others might find it useful:
1. As previously posted to this list, there is some kind of interaction
between Solaris and certain HTTP servers by which the termination character
of the HTTP response is dropped. (I have posted about this to the LWP list
previously.) To handle this, you need some way to flush the response buffer
when LWP times out (it's waiting for the termination character). What I did
was use Net::Telnet in these cases to re-send the GET, b/c Telnet exposes
the input_log even when it times out.
2. It took me a while to realize that when you create a GET request using
the HTTP module, the default is not HTTP/1.0. A fair number of spurious
errors result from not using HTTP/1.0.
3. I spent some effort determining the best combination of timeout and
retry parameters. My conclusions are that your agent timeout should be
about 30 sec. Increasing it to, say, 60 sec doesn't really help, and it can
add a lot to the running time of your script. OTOH, setting it as low as 10
sec will cause a lot of spurious errors. It is important to do retries
spread over a fairly wide amount of time -- preferably more than 24 hours.
The current settings on my script are to recheck errors about twenty times,
spread over about 24 hours. This provides a very high degree of protection
from spurious error reports -- at the risk of increasing unreported errors
-- the tradeoff is inevitable.
4. Although the standard LWP user agent request will follow redirects for
you, b/c of the problem mentioned in (1), I wound up using simple requests
and handling all redirection myself. This proved non-trivial. There two
different types of redirects and both can lead to page cycles (ie closed
loops of urls). I identified five cases:
a. simple redirect using HTTP header;
b. redirect to same page, using HTTP header, to set cookies;
c. simple redirect using <meta http-equiv="refresh"...> tag;
d. refresh to same page using <meta http-equiv="refresh"...> tag;
e. refresh to a series of pages using <meta http-equiv="refresh"...> tag;