Re: URI::URL->abs bug? (libwww-Perl5)

Boris Statnikov (boris@blaze.cs.jhu.edu)
Wed, 16 Jul 1997 12:23:55 -0400 (EDT)


As Mr. Scwartz pointed out, this error is not, strictly speaking, due to
libwww.  Once again, my problem would sometimes occur with a relative path
such as
../../../index.html for a base url of
http://someserver.com/~user/directoryname/
which is transformed into 
http://someserver.com/../index.html 
by invoking abs() method on URL.  The request to this URL should fail (as
it makes no sense), but not on NCSA servers, as I found out at closer
investigation.  This is true for every version between 1.3 and 1.5.2, for 
every server out of four or five which I have tried.
They treat this request as if it were
http://someserver.com/index.html

This can lead to infinite loops if your robot uses only URL to determine
visited sites, as the robot and the server are in an implicit disagreement
over the real root of the search tree.  

As I see it, there are three solutions (short of fixing (or breaking -
depending on point of view) the laxity of NCSA).

1. To change abs() method of URI::URL to return
http://someserver.com/index.html 
(making it consistent with the server).

2. Same as above, except in an extra method, keeping abs() as it is now.

3. To use some other scheme for determining which URLs have been visited -
such as CRC-32 - on a document's link contents or in some other fashion.
This is not bulletproof, but as good as, and seems like a good solution to
me.  

Considering that I need to do something now, I'll probably use this method
myself, but I will appreciate any suggestions or hints.  If I am blatantly
wrong about anything, please let me know.

For a server which exhibits this behavior (if you can't repeat this on
your favorite NCSA server), try 'http://hopkins.med.jhu.edu/../top.html'
Any number of ../ at the tail will work just as well (my robot went up to
40-50 :-).

Boris