character set of scheme names?
Thomas Richter (richter@chemie.fu-berlin.de)
Tue, 15 Apr 1997 22:11:12 +0200
I am confused about the correct character set for scheme names.
The confusion started when I tried to use LinkExtor on a (foreign)
document containing references like this one:
<a href="95_11_02_04:16:30.20">95_11_02_04:16:30.20 (5326 bytes)</a>
This was obviously meant to be a relative URL. But URI::URL uses
[.+\-\w]+: in varying forms to recognize scheme names, so it sees a
scheme of 95_11_02_04: and treats this as an absolute URL.
Ok, trying to learn something I searched the relevant RFCs (1738 and
1808) and the related draft. And in fact I learned something. They
told me that the allowed character set for a scheme name would be
[.+\-A-Za-z0-9]+: Note, that \w differs from A-Za-z0-9 because of the
inclusion of the underscore '_' and the locale treatment.
I started to write a bug report, when I remembered that I better look
my archive of the libwww mailing list. Guess what I found? A regular
expression by Roy T. Fielding, author of RFC1808, to parse URLs, which
on a second look also appears in the latest URL drafts:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
Now, [^:/?#]+: is obviously quite different from both other expressions,
which makes my confusion complete.
Hoping for someone to enlighten me.
Thomas Richter
--
<richter@chemie.fu-berlin.de>