Re: character set of scheme names?

Gisle Aas (aas@bergen.sn.no)
16 Apr 1997 06:27:24 +0200


Thomas Richter <richter@chemie.fu-berlin.de> writes:

> I am confused about the correct character set for scheme names.
> The confusion started when I tried to use LinkExtor on a (foreign)
> document containing references like this one:
> 
> <a href="95_11_02_04:16:30.20">95_11_02_04:16:30.20 (5326 bytes)</a>

They should have used "./95_11_02_04:16:30.20" or encoded the ":" with
"%3A".

> This was obviously meant to be a relative URL.  But URI::URL uses
> [.+\-\w]+: in varying forms to recognize scheme names, so it sees a
> scheme of 95_11_02_04: and treats this as an absolute URL.
> 
> Ok, trying to learn something I searched the relevant RFCs (1738 and
> 1808) and the related draft.  And in fact I learned something.  They
> told me that the allowed character set for a scheme name would be
> [.+\-A-Za-z0-9]+:  Note, that \w differs from A-Za-z0-9 because of the
> inclusion of the underscore '_' and the locale treatment.

Yes, this is a bug.

> I started to write a bug report, when I remembered that I better look
> my archive of the libwww mailing list.  Guess what I found?  A regular
> expression by Roy T. Fielding, author of RFC1808, to parse URLs, which
> on a second look also appears in the latest URL drafts:
> 
>   ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
> 
> Now, [^:/?#]+: is obviously quite different from both other expressions,
> which makes my confusion complete.
> 
> Hoping for someone to enlighten me.

The new regexp is an attempt to capture current practice and to
simplify URL parsing.

-- 
Gisle Aas <aas@sn.no>