Re: URI and spidering unique docs

Gisle Aas (gisle@activestate.com)
29 Mar 2001 09:59:06 -0800


Bill Moseley <moseley@hank.org> writes:

> Oh my, I'm writing yet another spider, for some reason.
> 
> I'd like to only spider documents one time.  So I'm using a hash of
> URI->canonical keys.
> 
> Although I realize these *could* be two different docs, they are not on our
> server:
>      http://localhost/path/to/my/file.html
>      http://localhost/path/to/../to/my/file.html
> 
> Any (URI?) tricks to seeing those as the same document?

Untested extra canonicalization:

my $p = $uri->path;
$p =~ s,/\./,/,;
$p =~ s,[^/]*/../,/;
$uri->path($p);

Regards,
Gisle