Re: URI and spidering unique docs

Stephen R. Wilcoxon (wilcoxon@bridge.com)
Thu, 29 Mar 2001 08:16:32 -0600


On Wed 2001/03/28 23:58:45 PST, Bill Moseley <moseley@hank.org> writes:
> Oh my, I'm writing yet another spider, for some reason.
> 
> I'd like to only spider documents one time.  So I'm using a hash of
> URI->canonical keys.
> 
> Although I realize these *could* be two different docs, they are not on our
> server:
>      http://localhost/path/to/my/file.html
>      http://localhost/path/to/../to/my/file.html
> 
> Any (URI?) tricks to seeing those as the same document?

I was just looking at this problem for File::Spec::Unix::canonical.  There 
is no perfect solution.  The simple solution if you know there are no sym
links or such is to s|/\w[^/]*/\.\./|/|g (of course, you may have to run 
over that regex a few times if there is something like some/path/../../etc).