Further enhancements to the URL module

Tim Bunce (Tim.Bunce@ig.co.uk)
Tue, 21 Mar 1995 01:53:40 +0000


	=head1 HOW AND WHEN TO ESCAPE
	
	An edited extract from a URI specification: [...skipped...]
	
	Notes:
	
	    A URL string *must*, by definition, consist of escaped components.
	
	    The components of a URL string must be *individually* escaped.
	    Each component of a URL may have a separate requirements regarding
	    what must be escaped, and those requirements are also dependent on
	    the URL scheme.
	
	    Never escape an already escaped component string.
	
	This implementation expects an escaped URL string to be passed to
	C<new> and will return an escaped URL string from C<str>.
	
	Internally the URL is stored in it's component parts. These parts are
	*not* escaped. Individual components must be manipulated in unescaped
	form (this is most natural anyway).
	
	The escaping applied to a URL when it is constructed by C<str> (or
	C<full_path>) can be controlled by using the C<unsafe> method to
	specify which characters should be treated as unsafe.

Also:

	$url->str now caches the last url string produced. Setting a
	component value via $url->elem (etc) or altering the unsafe
	chars clears the cached value. This virtually eliminates the
	cost of escaping and formatting the URL after the first time.

	A base URL passed to new() will now only be parsed into a URL
	object if required. Otherwise it's just stored as a string.
	This reduces the cost of adding a fall-back base URL of just
	'http:' (for example) to many calls to new().

	All the tests have been reintegrated and tidied up via the
	new _check method. They now automatically give a useful
	message, a URL dump and a stack dump.


Can we freeze this soon ?


I've taken the liberty of calling it URI::URL :-)


Enjoy!

Regards,
Tim.
----------
X-Sun-Data-Type: default-app
X-Sun-Data-Description: default
X-Sun-Data-Name: URL.pm
X-Sun-Content-Lines: 1191

#!/usr/local/bin/perl -w

package URI::URL;

$rcsid = '$Id: URL.pm,v 1.10 1995/03/21 01:49:09 timbo Exp timbo $';
$rcsid = $rcsid; # shut up -w

#####################################################################

=head1 NAME

URI::URL - Uniform Resource Locators (absolute and relative)

=head1 SYNOPSIS

    use URI::URL;

    # Constructors

    $url1 = new URI::URL 'http://www.nr.no/%7Euser/gisle.gif';
    $url2 = $url1->clone;

    $url3 = new URI::URL 'gisle.gif';
    $url3->base('http://www.nr.no/%7Euser');

    $url4 = $url3->abs;	# get absolute url using base

    $url5 = $url3->abs('http://a.n.other.com/');

    $url6 = newlocal URI::URL;                # pwd
    $url7 = newlocal URI::URL '/tmp';         # dir
    $url8 = newlocal URI::URL '/etc/motd';    # file

    # Printing

    $urlstr = $url->str;	# complete escaped URL string
    $fp = $url->full_path;  # escaped path+query+params+frag

    # Retrieving Generic-RL components:

    $scheme   = $url->scheme;
    $netloc   = $url->netloc;  # see user,password,host,port below
    $path     = $url->path;
    $params   = $url->params;
    $query    = $url->query;
    $frag     = $url->frag;

    # Retrieving Network location (netloc) components:

    $user     = $url->user;
    $password = $url->password;
    $host     = $url->host;
    $port     = $url->port;	# returns default if not defined

    # Retrieving other attributes:

    $base     = $url->base;

    # Setting fields:
    # All methods above can set field values for example:
    $url->scheme('http');
    $url->host('www.w3.org');
    $url->path('/welcome.html');
    $url->query('protocol info');

    # Specify unsafe characters to be escaped for this url
    $url->unsafe('\x00-\x20"\$#%;<>?\x7E-\xFF');

    # General method to get/set field values:

    $value  = $url->elem($name [, $new_value ]);

    $defport= $url->default_port;  # default port for scheme


    # Escaping functions (See 'HOW AND WHEN TO ESCAPE' below)

    $escaped = uri_escape($component);
    $component = uri_unescape($escaped);


=head1 DESCRIPTION

URL objects represent URLs (RFC 1738). Both absolute and relative
URL's are supported.

URL objects are created by C<new>, which takes a string
representation of a URL or an existing URL object reference to be
cloned. Specific individual elements can then be accessed via the
C<scheme>, C<user>, C<password>, C<host>, C<port>, C<path>,
 C<params>, C<query> and C<frag> methods. These methods can be 
called with a value to set the element to that value, and always 
return the old value. The C<elem> method provides a general
interface to access any element by name but it should be used with
caution: the effect of using incorrect spelling and case is
undefined.

The C<abs> function attempts to return a new absolute URL object
for a given URL.  In order to convert a relative URL into an absolute
one a I<base> URL is required. You can associate a default base with
a URL either by passing a C<base> to the C<new> method when a URL is
created or using the C<base> method on the object later.
Alternatively you can specify a once-off base as a parameter to the
C<abs> method.

The C<new> method must be able to determine the scheme for the URL.
If a scheme is not specified in the URL it will use the scheme
specified by the parent URL. If no parent URL scheme is defined
then the C<new> will fail (die).


=head1 HOW AND WHEN TO ESCAPE

An edited extract from a URI specification:

    The printability requirement has been met by specifing a safe set of
    characters, and a general escaping scheme for encoding "unsafe"
    characters. This "safe" set is suitable, for example, for use in
    electronic mail.  This is the canonical form of a URI.

    There is a conflict between the need to be able to represent many
    characters including spaces within a URI directly, and the need to
    be able to use a URI in environments which have limited character
    sets or in which certain characters are prone to corruption. This
    conflict has been resolved by use of an hexadecimal escaping method
    which may be applied to any characters forbidden in a given context.
    When URLs are moved between contexts, the set of characters escaped
    may be enlarged or reduced unambiguously.  The canonical form for
    URIs has all white spaces encoded.

Notes:

    A URL string *must*, by definition, consist of escaped components.

    The components of a URL string must be *individually* escaped.
    Each component of a URL may have a separate requirements regarding
    what must be escaped, and those requirements are also dependent on
    the URL scheme.

    Never escape an already escaped component string.

This implementation expects an escaped URL string to be passed to
C<new> and will return an escaped URL string from C<str>.

Internally the URL is stored in it's component parts. These parts are
*not* escaped. Individual components must be manipulated in unescaped
form (this is most natural anyway).

The escaping applied to a URL when it is constructed by C<str> (or
C<full_path>) can be controlled by using the C<unsafe> method to
specify which characters should be treated as unsafe.


=head1 ADDING NEW URL SCHEMES

New URL schemes can be added as follows. Create a C<URL> directory
in the the same directory as C<URL.pm> (if it's not already there).
In this new directory, create a C<scheme.pm> file for your scheme
(e.g. C<foo.pm>). In this module, declare a new subclass of
URI::URL::generic like this:

    package URI::URL::foo;		
    @ISA = qw(URI::URL::generic);

    sub default_port { 9999 };

and override any additional methods (e.g., C<_parse()>) as required.


=head1 WHAT A URL IS NOT

URL objects do not, and should not, know how to 'get' or 'put' the
resources they specify locations for, anymore than a postal address
'knows' anything about the postal system. The actual access/transfer
should be achieved by some form of transport agent class. The agent
class can use the URL class, but should not be a subclass of URL.


=head1 OUTSTANDING ISSUES

Class naming. Inheritance (use and abuse).

Does the current escaping implementation (e.g., keeping the components
unescaped) cause problems?


=head1 AUTHORS / ACKNOWLEDGMENTS

This module is (distantly) based on the C<wwwurl.pl> code in the
libwww-perl distribution developed by Roy Fielding
<fielding@ics.uci.edu>, as part of the Arcadia project at the
University of California, Irvine, with contributions from Brooks
Cutter.

Gisle Aas <aas@nr.no>, Tim Bunce <Tim.Bunce@ig.co.uk>, and Martijn
Koster <m.koster@nexor.co.uk> have collaborated on the Perl 5 version.

If you have any suggestions, bug reports, fixes, or enhancements, send
them to the libwww-perl mailing list at <libwww-perl@ics.uci.edu>.

=head1 BUGS

None known

Running the module standalone will execute a substantial self test.

=cut

#####################################################################

# Perl resumes here

use Carp;
use Cwd;

require Exporter;
@ISA = qw(Exporter);
@EXPORT_OK = qw(uri_escape uri_unescape);

# Define default unsafe characters (should we include '~' or leave
# it to applications to add that if required?). Note that you cannot
# reliably change this at runtime because the substitutions which
# use it use the /o flag.
$DefaultUnsafe = '\x00-\x20"#%;<>?\x7F-\xFF';

$Debug = 0;
$UseCache = 1;		# see str method
my %KnownSchemes;	# schemes we have initialised

use strict qw(subs refs);

#####################################################################

# methods/functions

# new()
#
# object's constructor
#
sub new
{
    my($class, $init, $base) = @_;
    my $this;
    if (ref $init) {
	$this = $init->clone;
    } else {
	$init =~ s/^\s+//;  # remove leading space
	$init =~ s/\s.*//;  # remove anything after first word
	# We need a scheme to determine which class to use
	my($scheme) = $init =~ m/^([.+\-\w]+):/;
        if (!$scheme and $base){ # get scheme from base
	    if (ref $base){ # may be object or just a string
		$scheme = $base->scheme;
	    } else {
		$scheme = $1 if $base =~ m/^([.+\-\w]+):/;
	    }
        }
        croak "Unable to determine scheme for '$init'"
            unless $scheme;
        $scheme = _check_scheme($scheme); 
            # _check_scheme dies on failure (eg unsupportable)
	# then hand-off to scheme specific sub-class
	$this = "URI::URL::$scheme"->new($init, $base);
    }
    $this->base($base) if $base;
    return $this;
}


# clone()
#
# copy constructor
#
sub clone
{
    my $self = shift;
    bless { %$self }, ref $self;
}


# newlocal()
#
# return a URL object that denotes a path on the
# local filesystem (current directory by default).
#
sub newlocal
{
    my($self, $path) = @_;
    my $url = new URI::URL "file://localhost/";
    unless (defined $path){
	$path = fastcwd();
	$path =~ s:/?$:/:; # force trailing slash on dir
    }
    $url->path($path);
    $url;
}


# dump()
#
# print the contents of the object
#
sub dump
{
    my $self = shift;
    my($k, $v);
    print "Dump of $self...\n";
    foreach $k (sort keys %$self){
	$v = $self->{$k};
	$v = 'UNDEF' unless defined $v;
	print "  $k\t'$v'\n";
    }
}


# _check_scheme()
#
# Checks scheme and loads its implementation
#
sub _check_scheme
{
    my $scheme = lc(shift); # ensure lowercase and then validate
    no strict qw(refs);
    return $scheme if exists $KnownSchemes{$scheme};	# already setup
    # load support for unknown scheme, die on failure
    my $module = "URI::URL::${scheme}";
    unless (defined @{"${module}::ISA"}){
	require "URI/URL/${scheme}.pm"
    } else {
        # silently require anyway so external files can update
        # our code
	eval { local($^W)=0; require "URI/URL/${scheme}.pm" };
        undef $@;
    }
    # setup overloading - experimental
    %{"${module}::OVERLOAD"} = ( '""'=>'str', 'fallback'=>1);
    $KnownSchemes{$scheme} = 1;
    $scheme;
}


#####################################################################

# constants ?


#####################################################################

# Methods to handle URL's elements

# These methods always return the current value,
# so you can use $url->scheme to read the current value.
# If a new value is passed, e.g. $url->scheme('http'),
# it also sets the new value, and returns the previous value.
# Use $url->scheme(undef) to set the value to undefined.

# Generic-RL components:
sub scheme	{ shift->elem('scheme', @_);  }
sub netloc	{ shift->elem('netloc', @_);  }
sub path	{ shift->elem('path',   @_);  }
sub params	{ shift->elem('params', @_);  }
sub query	{ shift->elem('query',  @_);  }
sub frag	{ shift->elem('frag',   @_);  }

# Fields derived from generic netloc:
sub user	{ shift->elem('user',   @_);  }
sub password	{ shift->elem('password',@_); }
sub host        { shift->elem('host',   @_);  }
sub port        { shift->elem('port',   @_);  }

sub elem {
    my($self, $element, @val) = @_;
    my $old = $self->{$element};
    return $old unless @val;

    $self->{$element} = $val[0]; # general case
    $self->{'_'} = '';		 # void cached string
    
    # netloc includes user, password, host and port
    if ($element eq 'netloc') {
	$self->_read_netloc();	# update parts from whole
    }
    elsif ($element =~ m/^(user|password|host|port)$/) {
	$self->_write_netloc();	# update whole from parts
    }
    return $old;
}

# Other attributes of a URL object:
# (These may happen to use elem() but applications should not
# use elem() to access them).

sub base {
    my $self = shift;
    return $self->elem('_base', @_) if @_;	# set
    # The base attribute supports 'lazy' conversion from URL strings
    # to URL objects. Strings may be stored by if a string is fetched
    # it will automatically be converted to a URL object.
    # The main benefit is to make it much cheaper to say:
    #	new URI::URL $random_url_string 'http:'
    my $base = $self->elem('_base');		# get
    return undef unless defined $base;
    unless (ref $base){
	$base = new URI::URL $base;
	$self->elem('_base', $base); # set new object
    }
    $base;
}

sub unsafe {
    shift->elem('_unsafe', @_);
}


#####################################################################

# escape()
# unescape()
#
#  Generic escaping ('this has spaces' -> 'this%20has%20spaces')
#    and unescaping ('this%20has%20spaces' -> 'this has spaces')
#  Overridden by subclasses which need more control.
#  See notes on escaping at top of module.
#
sub escape
{
    my($self, $text, $patn) = @_;
    if ($patn){
	$text =~ s/([$patn])/sprintf("%%%02x",ord($1))/eg;
	return $text;
    }
    # let perl pre-compile this default for max speed
    $text =~ s/([$URI::URL::DefaultUnsafe])/sprintf("%%%02x",ord($1))/oeg;
    $text;
}
sub unescape
{
    my($self, $text) = @_;
    return undef unless defined $text;
    local($_) = $text;
    s/%([\dA-Fa-f][\dA-Fa-f])/pack('C',hex($1))/eg; # 'C' or 'c' ?
    $_;
}


#####################################################################

# Miscellaneous functions (non-methods)

# uri_escape()
#
# Apply URI character escaping rules to some text.
# Note that it is generally better to do something like this:
#	$url = new URI::URL 'http:';
#	$url->path($random_query);
# See the 'HOW AND WHEN TO ESCAPE' section in the pod text above.
#
sub uri_escape
{
    URI::URL->escape(@_);
}

# uri_unescape()
#
# Unescape some text destined to be a component of a URL.
# Note that it is generally better to do something like this:
#	$url->path(uri_unescape($pre_escaped_path));
# See the 'HOW AND WHEN TO ESCAPE' section in the pod text above.
#
sub uri_unescape
{
    URI::URL->unescape(@_);
}


#####################################################################
#
#	Internal pre-defined generic scheme support
#
# In this implementation all schemes are subclassed from
# URI::URL::generic. This turns out to have reasonable mileage.
# See also draft-ietf-uri-relative-url-05.txt

package URI::URL::generic;		# base support for generic-RL's

use Carp;
@ISA = qw(URI::URL);

sub new {				# inherited by subclasses
    my($class, $init, $base) = @_;
    my $url = bless {}, $class;		# create empty object
    $url->_parse($init);		# parse $init into components
    $url->dump if $URI::URL::Debug;
    $url;
}


# Generic-RL parser
# See draft-ietf-uri-relative-url-05.txt Section 2

sub _parse {
    my($self, $u) = @_;
    $self->{'_orig_url'} = $u if $URI::URL::Debug;	
    # draft-ietf-uri-relative-url-05.txt Section 2.4
    # 2.4.1
    $self->{'frag'}   = $self->unescape($1)  if $u =~ s/#(.*)$//;
    # 2.4.2
    $self->{'scheme'} = lc($1)   if $u =~ s/^\s*([\w\+\.\-]+)://;
    # 2.4.3
    $self->{'netloc'} = $self->unescape($1)  if $u =~ s!^//([^/]*)!!;
    # 2.4.4
    $self->{'query'}  = $self->unescape($1)  if $u =~ s/\?(.*)//;
    # 2.4.5
    $self->{'params'} = $self->unescape($1)  if $u =~ s/;(.*)//;
    # 2.4.6
    $self->{'path'}   = $self->unescape($u);
    # read netloc components: "<user>:<password>@<host>:<port>"
    $self->_read_netloc;
}

sub _read_netloc {	# netloc -> user, password, host, post
    my($self) = @_;
    my $nl = $self->{'netloc'} || '';
    $self->{'_'} = '';	# void cache
    if ($nl =~ s/^([^:@]*):?([^@]*)?@//){
	$self->{'user'}     = $self->unescape($1);
	$self->{'password'} = $self->unescape($2) if $2;
    }
    if ($nl =~ s/^([^:]*):?(\d*)?//){
	$self->{'host'} = $self->unescape($1);
	$self->{'port'} = $self->unescape($2) if $2;
    }
}

sub _write_netloc {	# user, password, host, post -> netloc
    my($self) = @_;
    my $tmp;
    my $nl = $self->{'user'} || '';
    $nl .= ":$self->{'password'}" if $nl and $self->{'password'};
    $nl .= "\@" if $nl;
    $nl .= ($tmp = $self->{'host'});
    $nl .= ":$tmp" if ($tmp && ($tmp=$self->{'port'})
			    && $tmp != $self->default_port);
    $self->{'netloc'} = $nl;
}



# Generic-RL stringify
#
sub str
{
    my $self = shift;
    return $self->{'_'} if $self->{'_'} and $URI::URL::UseCache;

    # use @ here to avoid undef warnings and allow $self->escape
    # to use optimised pattern if no override has been set.
    my @unsafe = shift || $self->unsafe || ();
    my($scheme, $netloc, $port) = @{$self}{qw(scheme netloc port)};

    # full_path() -> path+query+params+frag (escaped)
    my $path  = $self->full_path(@unsafe);

    if ($netloc){
	$path = "/$path" unless $path =~ m:^/:;
	$path = "//".$self->escape($netloc, @unsafe).$path;
    }
    my $urlstr = ($scheme) ? "$scheme:$path" : $path;
    $self->{'_'} = $urlstr; 	# set cache
    return $urlstr;
}


# Generic-RL stringify full path (path+query+params+frag)
#
sub full_path
{
    my $self = shift;
    # use @ here to avoid undef warnings and allow $self->escape
    # to use optimised pattern if no override has been set.
    my @unsafe = shift || $self->unsafe || ();
    my($path, $params, $query, $frag)
	= @{$self}{qw(path params query frag) };
    my $u = '';
    $u .=     $self->escape($path,   @unsafe) if $path;
    $u .= ";".$self->escape($params, @unsafe) if $params;
    $u .= "?".$self->escape($query,  @unsafe) if $query;
    $u .= "#".$self->escape($frag,   @unsafe) if $frag;
    return $u;
}


# Generic-RL: Resolving Relative URL into an Absolute URL
#
# Based on draft-ietf-uri-relative-url-05.txt Section 4
#
sub abs
{
    my($self, $base) = @_;
    my $embed = $self->clone;

    $base = $self->base unless $base;      # default to default base
    return $embed unless $base;			# we have no base (step1)

    $base = new URI::URL $base unless ref $base; # make obj if required

    my @u = @{$embed}{qw(scheme host port path params query frag) };

    # just use base if we are empty		(2a)
    return $base->clone if (scalar(grep($_, @u)) == 0);

    my($scheme, $host, $port, $path, $params, $query, $frag) = @u;

    # if we have a scheme we must already be absolute	(2b)
    return $embed if $scheme;

    $embed->{'scheme'} = $base->{'scheme'};     # (2c)

    return $embed if $embed->{'netloc'};	# (3)
    $embed->{'netloc'} = $base->{'netloc'};	# (3)
    $embed->_read_netloc();

    return $embed if $embed->{'path'} =~ m:^/:;	# (4)
    
    unless ($embed->{'path'}){			# (5)
	$embed->{'path'} = $base->{'path'};	# (5)

        return $embed if $embed->params;	# (5a)
	$embed->{'params'} = $base->{'params'};	# (5a)

        return $embed if $embed->query;		# (5b)
	$embed->{'query'} = $base->{'query'};	# (5b)
        return $embed;
    }

    # (Step 6)	# draft 6 suggests stack based approach

    my $basepath = $base->{'path'};
    my $relpath  = $embed->{'path'};

    $basepath =~ s!^/!!;
    $basepath =~ s!/$!/.!;              # prevent empty segment
    my @path = split('/', $basepath);   # base path into segments
    pop(@path);			        # remove last segment

    $relpath =~ s!/$!/.!;               # prevent empty segment

    push(@path, split('/', $relpath));  # append relative segments

    my @newpath = ();
    my $isdir = 0;
    foreach $segment (@path) {	# left to right
#	warn '> ', join('/', @newpath), ": $segment\n";
        if ($segment eq '.') {	# ignore "same" directory
	    $isdir = 1;
	}
        elsif ($segment eq '..') {
	    $isdir = 1;
	    my $last = pop(@newpath);
	    if (!defined $last) { # nothing to pop
		push(@newpath, $segment); # so must append
	    }
	    elsif ($last eq '..') { # '..' cannot match '..'
		# so put pack again, and append
		push(@newpath, $last, $segment);
	    }
	    else {
		# it was a component, 
		# keep popped
	    }
	} else {
	    $isdir = 0;
	    push(@newpath, $segment);
        }
    }

    $embed->{'path'} = join('/', @newpath) . ($isdir ? '/' : '');
    
    return $embed;
}


# default_port()
#
# subclasses will usually want to override this
#
sub default_port {
    0;
}


# _expect()
#
# Handy low-level object method tester. See test code at end.
#
sub _expect {
    my($self, $method, $expect, @args) = @_;
    my $result = $self->$method(@args);
    $expect = 'UNDEF' unless defined $expect;
    $result = 'UNDEF' unless defined $result;
    return 1 if $expect eq $result;
    warn "'$self'->$method(@args) = '$result' (expected '$expect')\n";
    $self->dump;
    confess "Test Failed";
}


####################################################################
#
#	Internal pre-defined basic scheme support

# Define the default ports for major net services 
# From RFC 1738 "Uniform Resource Locators (URL)"
# Note that other Well Known Port Numbers are defined in
# the "Assigned Numbers" RFC (1700).
# XXX the rfc1700 ones should arguably use getservbyname
# or be in a Etc/Services.pm or something


package URI::URL::file;		@ISA = qw(URI::URL::generic);



package URI::URL::ftp;		@ISA = qw(URI::URL::generic);

sub default_port { 21 };



package URI::URL::telnet;	@ISA = qw(URI::URL::generic);

sub default_port { 23 };



package URI::URL::whois;	@ISA = qw(URI::URL::generic);

sub default_port { 43 };



package URI::URL::gopher;	@ISA = qw(URI::URL::generic);

sub default_port { 70 };
sub gtype    { shift->elem('gtype', @_); }

sub _parse {
    my($self, $url) = @_;
    $self->{'scheme'} = lc($1) if $url =~ s/^\s*([\w\+\.\-]+)://;
    $self->{'netloc'} = $self->unescape($1) if $url =~ s!^//([^/]*)!!;
    $self->{'gtype'}  = $self->unescape($1) if $url =~ s!^/(.)!!;
    my @sss = split(/%09/, $url, 3);
    $self->{'selector'} = $self->unescape(shift @sss);
    $self->{'search'}   = $self->unescape(shift @sss);
    $self->{'string'}   = $self->unescape(shift @sss);
}



package URI::URL::finger;	@ISA = qw(URI::URL::generic);

sub default_port { 79 };



package URI::URL::http;		@ISA = qw(URI::URL::generic);

sub default_port { 80 };



package URI::URL::nntp;		@ISA = qw(URI::URL::generic);

sub default_port { 119 };

sub _parse {
    my($self, $init) = @_;
    $self->URI::URL::generic::_parse($init);
    my @parts = split(/\//, $self->{'path'});
    $self->{'group'} = $self->unescape($parts[1]);
    $self->{'digits'}= $self->unescape($parts[2]);
}



package URI::URL::news;		@ISA = qw(URI::URL::generic);

sub _parse {
    my($self, $init) = @_;
    $self->{'scheme'}  = lc($1) if ($init =~ s/^\s*([\w\+\.\-]+)://);
    my $tmp = $self->unescape($init);
    $self->{'grouppart'} = $tmp;
    $self->{ ($tmp =~ m/\@/) ? 'article' : 'group' } = $tmp;
}



package URI::URL::wais;		@ISA = qw(URI::URL::generic);

sub default_port { 210 };

sub _parse {
    my($self, $init) = @_;
    $self->URI::URL::generic::_parse($init);
    my @parts = split(/\//, $self->{'path'});
    $self->{'database'} = $self->unescape($parts[1]);
    $self->{'wtype'}    = $self->unescape($parts[2]);
    $self->{'wpath'}    = $self->unescape($parts[3]);
}



package URI::URL::webster;	@ISA = qw(URI::URL::generic);

sub default_port { 765 };



package URI::URL::prospero;	@ISA = qw(URI::URL::generic);

sub default_port { 1525 };	# says rfc1738, section 3.11



package URI::URL::mailto;	@ISA = qw(URI::URL::generic);

sub _parse {
    my($self, $init) = @_;
    $self->{'scheme'}  = lc($1) if ($init =~ s/^\s*([\w\+\.\-]+)://);
    $self->{'encoded822addr'} = $self->unescape($init);
}



package URI::URL::rlogin;	@ISA = qw(URI::URL::generic);



package URI::URL::tn3270;	@ISA = qw(URI::URL::generic);



#####################################################################
#
# If we're not use'd or require'd execute self-test.
# Handy for regression testing and as a quick reference :)
#
# Test is kept behind __END__ so it doesn't take uptime
# and memory  unless explicitly required. If you're working
# on the code you might find it easier to comment out the
# eval and __END__ so that error line numbers make more sense.

eval join('',<main::DATA>) || die $@ unless caller();

1;

package main;			# the __END__ is in main (why?)
__END__


package main;

use Carp;
import URI::URL qw(uri_escape uri_unescape);
$| = 1;


# Do basic tests first.
# Dies if an error has been detected, prints "ok" otherwise.

print "Self tests...\n";

    &scheme_parse_test;

    &parts_test;

    &escape_test;

    &newlocal_test;

    &absolute_test;

print "$URI::URL::rcsid ok\n";

exit 0;




#####################################################################
#
# scheme_parse_test()
#
# test parsing and retrieval methods

sub scheme_parse_test {

    print "scheme_parse_test:\n";

    $tests = {
	'hTTp://web1.net/a/b/c/welcome#intro'
	=> {    'scheme'=>'http', 'host'=>'web1.net', 'port'=>undef,
		'path'=>'/a/b/c/welcome', 'frag'=>'intro', 'query'=>undef },

	'http://web:1/a?query+text'
	=> {    'scheme'=>'http', 'host'=>'web', 'port'=>1,
		'path'=>'/a', 'frag'=>undef, 'query'=>'query+text' },

	'http://web.net'
	=> {    'scheme'=>'http', 'host'=>'web.net', 'port'=>undef,
		'path'=>'', 'frag'=>undef, 'query'=>undef, 'user'=>undef },

	'ftp://usr:pswd@web:1234/a/b;type=i'
	=> {    'host'=>'web', 'port'=>1234, 'path'=>'/a/b',
		'user'=>'usr', 'password'=>'pswd', 'params'=>'type=i' },

	'gopher://web/2a_selector'
	=> {    'gtype'=>'2', 'selector'=>'a_selector' },

	'mailto:libwww-perl@ics.uci.edu'
	=> {    'encoded822addr'=>'libwww-perl@ics.uci.edu' },

	'news:*'		 
	=> {	'grouppart'=>'*' },
	'news:comp.lang.perl' 
	=> {    'group'=>'comp.lang.perl' },
	'news:perl-faq/module-list-1-794455075@ig.co.uk'
	=> {    'article'=>'perl-faq/module-list-1-794455075@ig.co.uk' },

	'nntp://news.com/comp.lang.perl/42'
	=> {    'group'=>'comp.lang.perl', 'digits'=>42 },

	'telnet://usr:pswd@web:12345/'
	=> {    'user'=>'usr', 'password'=>'pswd' },

	'wais://web.net/db'       
	=> { 'database'=>'db' },
	'wais://web.net/db?query' 
	=> { 'database'=>'db', 'query'=>'query' },
	'wais://usr:pswd@web.net/db/wt/wp'
	=> {    'database'=>'db', 'wtype'=>'wt', 'wpath'=>'wp',
		'password'=>'pswd' },
    };

    foreach $url_str (sort keys %$tests ){
	warn "Testing '$url_str'\n";
	my($url) = new URI::URL $url_str;
	$url_tests = $tests->{$url_str};
	while( ($method, $exp) = each %$url_tests ){
	    $exp = 'UNDEF' unless defined $exp;
	    $url->_expect('elem', $exp, $method);
	}
    }
    print "scheme parse tests ok\n";
}






#####################################################################
#
# parts_test()		(calls netloc_test test)
#
# Test individual component part access functions
#
sub parts_test {
    print "parts_test:\n";

    # test storage part access/edit methods (netloc, user, password,
    # host and port are tested by &netloc_test)

    $url = new URI::URL 'http://web/orig/path';
    $url->scheme('gopher');
    $url->path('1info');
    $url->frag('this');
    $url->query('key+word');
    $url->_expect('str', 'gopher://web/1info?key+word#this');

    &netloc_test;

    $url->query(undef);
    $url->_expect('query', undef);
}

#
# netloc_test()
#
# Test automatic netloc synchronisation
#
sub netloc_test {
    my $url = new URI::URL 'http://anonymous:p%61ss@hst:12345';
    $url->_expect('user', 'anonymous');
    $url->_expect('password', 'pass');
    $url->_expect('host', 'hst');
    $url->_expect('port', 12345);
    $url->_expect('netloc', 'anonymous:pass@hst:12345');

    $url->user('nemo');
    $url->password('p2');
    $url->host('hst2');
    $url->port(2);
    $url->_expect('netloc', 'nemo:p2@hst2:2');

    $url->user(undef);
    $url->password(undef);
    $url->port(undef);
    $url->_expect('netloc', 'hst2');
}



#####################################################################
#
# escape_test()
#
# escaping functions

sub escape_test {
    print "escape_test:\n";

    # supply escaped URL
    $url = new URI::URL 'http://web/this%20has%20spaces';
    # check component is unescaped
    $url->_expect('path', '/this has spaces');

    # modify the unescaped form
    $url->path('this ALSO has spaces');
    # check whole url is escaped
    $url->_expect('str', 'http://web/this%20ALSO%20has%20spaces');

    # now make 'A' an unsafe character :-)
    $url->unsafe('A\x00-\x20"#%;<>?\x7F-\xFF');
    $url->_expect('str', 'http://web/this%20%41LSO%20has%20spaces');

    $url = new URI::URL uri_escape('http://web/this % needs escaping');
    $url->_expect('str', 'http://web/this%20%25%20needs%20escaping');

    my $all = pack('c*',0..255);
    my $esc = uri_escape($all);
    my $new = uri_unescape($esc);
    die "uri_escape->uri_unescape mismatch" unless $all eq $new;

    # need more tests here
}


#####################################################################
#
# newlocal_test()
#

sub newlocal_test {
    print "newlocal_test:\n";

    chdir('/tmp') or die $!;
    $url = newlocal URI::URL;
    $url->_expect('str', 'file://localhost/tmp/');

    chdir('/') or die $!;
    $url = newlocal URI::URL '/usr/';
    $url->_expect('str', 'file://localhost/usr/');
}


#####################################################################
#
# absolute_test()
#
sub absolute_test {

    print "Test relative/absolute URI::URL parsing:\n";

    # Tests from draft-ietf-uri-relative-url-06.txt
    # Copied verbatim from the draft, parsed below

    @URI::URL::g::ISA = qw(URI::URL::generic); # for these tests

    my $base = 'http://a/b/c/d;p?q#f';

    $absolute_tests = <<EOM;
5.1.  Normal Examples

      g:h        = <URL:g:h>
      g          = <URL:http://a/b/c/g>
      ./g        = <URL:http://a/b/c/g>
      g/         = <URL:http://a/b/c/g/>
      /g         = <URL:http://a/g>
      //g        = <URL:http://g>
      ?y         = <URL:http://a/b/c/d;p?y>
      g?y        = <URL:http://a/b/c/g?y>
      g?y/./x    = <URL:http://a/b/c/g?y/./x>
      #s         = <URL:http://a/b/c/d;p?q#s>
      g#s        = <URL:http://a/b/c/g#s>
      g#s/./x    = <URL:http://a/b/c/g#s/./x>
      g?y#s      = <URL:http://a/b/c/g?y#s>
      ;x         = <URL:http://a/b/c/d;x>
      g;x        = <URL:http://a/b/c/g;x>
      g;x?y#s    = <URL:http://a/b/c/g;x?y#s>
      .          = <URL:http://a/b/c/>
      ./         = <URL:http://a/b/c/>
      ..         = <URL:http://a/b/>
      ../        = <URL:http://a/b/>
      ../g       = <URL:http://a/b/g>
      ../..      = <URL:http://a/>
      ../../     = <URL:http://a/>
      ../../g    = <URL:http://a/g>

5.2.  Abnormal Examples

   Although the following abnormal examples are unlikely to occur
   in normal practice, all URL parsers should be capable of resolving
   them consistently.  Each example uses the same base as above.

   An empty reference resolves to the complete base URL:

      <>         = <URL:http://a/b/c/d;p?q#f>

   Parsers must be careful in handling the case where there are more
   relative path ".." segments than there are hierarchical levels in
   the base URL's path.  Note that the ".." syntax cannot be used to
   change the <net_loc> of a URL.

     ../../../g = <URL:http://a/../g>
     ../../../../g = <URL:http://a/../../g>

   Similarly, parsers must avoid treating "." and ".." as special when
   they are not complete components of a relative path.

      /./g       = <URL:http://a/./g>
      /../g      = <URL:http://a/../g>
      g.         = <URL:http://a/b/c/g.>
      .g         = <URL:http://a/b/c/.g>
      g..        = <URL:http://a/b/c/g..>
      ..g        = <URL:http://a/b/c/..g>

   Less likely are cases where the relative URL uses unnecessary or
   nonsensical forms of the "." and ".." complete path segments.

      ./../g     = <URL:http://a/b/g>
      ./g/.      = <URL:http://a/b/c/g/>
      g/./h      = <URL:http://a/b/c/g/h>
      g/../h     = <URL:http://a/b/c/h>

   Finally, some older parsers allow the scheme name to be present in
   a relative URL if it is the same as the base URL scheme.  This is
   considered to be a loophole in prior specifications of partial
   URLs [1] and should be avoided by future parsers.

      http:g     = <URL:http:g>
      http:      = <URL:http:>
EOM
    # convert text to list like
    # @absolute_tests = ( ['g:h' => 'g:h'], ...)

    for $line (split("\n", $absolute_tests)) {
	next unless $line =~ /^\s{6}/;
	if ($line =~ /^\s+(\S+)\s*=\s*<URL:([^>]*)>/) {
	    my($rel, $abs) = ($1, $2);
	    $rel = '' if $rel eq '<>';
	    push(@absolute_tests, [$rel, $abs]);
	}
	else {
	    warn "illegal line '$line'";
	}
    }

    # add some extra ones for good measure

    push(@absolute_tests, ['x/y//../z', => 'http://a/b/c/x/y/z']);

    print "  Relative    +  Base  =>  Expected Absolute URL\n";
    print "================================================\n";
    for $test (@absolute_tests) {
	my($rel, $abs) = @$test;
	my $abs_url = new URI::URL $abs;
	my $abs_str = $abs_url->str;

	printf("  %-10s  +  $base  =>  $abs\n", $rel);
	my $u   = new URI::URL $rel, $base;
	my $got = $u->abs;
        $got->_expect('str', $abs_str);
    }
    print "absolute test ok\n";
}


1;
exit 0;