Further enhancements to the URL module
Tim Bunce (Tim.Bunce@ig.co.uk)
Tue, 21 Mar 1995 01:53:40 +0000
=head1 HOW AND WHEN TO ESCAPE
An edited extract from a URI specification: [...skipped...]
Notes:
A URL string *must*, by definition, consist of escaped components.
The components of a URL string must be *individually* escaped.
Each component of a URL may have a separate requirements regarding
what must be escaped, and those requirements are also dependent on
the URL scheme.
Never escape an already escaped component string.
This implementation expects an escaped URL string to be passed to
C<new> and will return an escaped URL string from C<str>.
Internally the URL is stored in it's component parts. These parts are
*not* escaped. Individual components must be manipulated in unescaped
form (this is most natural anyway).
The escaping applied to a URL when it is constructed by C<str> (or
C<full_path>) can be controlled by using the C<unsafe> method to
specify which characters should be treated as unsafe.
Also:
$url->str now caches the last url string produced. Setting a
component value via $url->elem (etc) or altering the unsafe
chars clears the cached value. This virtually eliminates the
cost of escaping and formatting the URL after the first time.
A base URL passed to new() will now only be parsed into a URL
object if required. Otherwise it's just stored as a string.
This reduces the cost of adding a fall-back base URL of just
'http:' (for example) to many calls to new().
All the tests have been reintegrated and tidied up via the
new _check method. They now automatically give a useful
message, a URL dump and a stack dump.
Can we freeze this soon ?
I've taken the liberty of calling it URI::URL :-)
Enjoy!
Regards,
Tim.
----------
X-Sun-Data-Type: default-app
X-Sun-Data-Description: default
X-Sun-Data-Name: URL.pm
X-Sun-Content-Lines: 1191
#!/usr/local/bin/perl -w
package URI::URL;
$rcsid = '$Id: URL.pm,v 1.10 1995/03/21 01:49:09 timbo Exp timbo $';
$rcsid = $rcsid; # shut up -w
#####################################################################
=head1 NAME
URI::URL - Uniform Resource Locators (absolute and relative)
=head1 SYNOPSIS
use URI::URL;
# Constructors
$url1 = new URI::URL 'http://www.nr.no/%7Euser/gisle.gif';
$url2 = $url1->clone;
$url3 = new URI::URL 'gisle.gif';
$url3->base('http://www.nr.no/%7Euser');
$url4 = $url3->abs; # get absolute url using base
$url5 = $url3->abs('http://a.n.other.com/');
$url6 = newlocal URI::URL; # pwd
$url7 = newlocal URI::URL '/tmp'; # dir
$url8 = newlocal URI::URL '/etc/motd'; # file
# Printing
$urlstr = $url->str; # complete escaped URL string
$fp = $url->full_path; # escaped path+query+params+frag
# Retrieving Generic-RL components:
$scheme = $url->scheme;
$netloc = $url->netloc; # see user,password,host,port below
$path = $url->path;
$params = $url->params;
$query = $url->query;
$frag = $url->frag;
# Retrieving Network location (netloc) components:
$user = $url->user;
$password = $url->password;
$host = $url->host;
$port = $url->port; # returns default if not defined
# Retrieving other attributes:
$base = $url->base;
# Setting fields:
# All methods above can set field values for example:
$url->scheme('http');
$url->host('www.w3.org');
$url->path('/welcome.html');
$url->query('protocol info');
# Specify unsafe characters to be escaped for this url
$url->unsafe('\x00-\x20"\$#%;<>?\x7E-\xFF');
# General method to get/set field values:
$value = $url->elem($name [, $new_value ]);
$defport= $url->default_port; # default port for scheme
# Escaping functions (See 'HOW AND WHEN TO ESCAPE' below)
$escaped = uri_escape($component);
$component = uri_unescape($escaped);
=head1 DESCRIPTION
URL objects represent URLs (RFC 1738). Both absolute and relative
URL's are supported.
URL objects are created by C<new>, which takes a string
representation of a URL or an existing URL object reference to be
cloned. Specific individual elements can then be accessed via the
C<scheme>, C<user>, C<password>, C<host>, C<port>, C<path>,
C<params>, C<query> and C<frag> methods. These methods can be
called with a value to set the element to that value, and always
return the old value. The C<elem> method provides a general
interface to access any element by name but it should be used with
caution: the effect of using incorrect spelling and case is
undefined.
The C<abs> function attempts to return a new absolute URL object
for a given URL. In order to convert a relative URL into an absolute
one a I<base> URL is required. You can associate a default base with
a URL either by passing a C<base> to the C<new> method when a URL is
created or using the C<base> method on the object later.
Alternatively you can specify a once-off base as a parameter to the
C<abs> method.
The C<new> method must be able to determine the scheme for the URL.
If a scheme is not specified in the URL it will use the scheme
specified by the parent URL. If no parent URL scheme is defined
then the C<new> will fail (die).
=head1 HOW AND WHEN TO ESCAPE
An edited extract from a URI specification:
The printability requirement has been met by specifing a safe set of
characters, and a general escaping scheme for encoding "unsafe"
characters. This "safe" set is suitable, for example, for use in
electronic mail. This is the canonical form of a URI.
There is a conflict between the need to be able to represent many
characters including spaces within a URI directly, and the need to
be able to use a URI in environments which have limited character
sets or in which certain characters are prone to corruption. This
conflict has been resolved by use of an hexadecimal escaping method
which may be applied to any characters forbidden in a given context.
When URLs are moved between contexts, the set of characters escaped
may be enlarged or reduced unambiguously. The canonical form for
URIs has all white spaces encoded.
Notes:
A URL string *must*, by definition, consist of escaped components.
The components of a URL string must be *individually* escaped.
Each component of a URL may have a separate requirements regarding
what must be escaped, and those requirements are also dependent on
the URL scheme.
Never escape an already escaped component string.
This implementation expects an escaped URL string to be passed to
C<new> and will return an escaped URL string from C<str>.
Internally the URL is stored in it's component parts. These parts are
*not* escaped. Individual components must be manipulated in unescaped
form (this is most natural anyway).
The escaping applied to a URL when it is constructed by C<str> (or
C<full_path>) can be controlled by using the C<unsafe> method to
specify which characters should be treated as unsafe.
=head1 ADDING NEW URL SCHEMES
New URL schemes can be added as follows. Create a C<URL> directory
in the the same directory as C<URL.pm> (if it's not already there).
In this new directory, create a C<scheme.pm> file for your scheme
(e.g. C<foo.pm>). In this module, declare a new subclass of
URI::URL::generic like this:
package URI::URL::foo;
@ISA = qw(URI::URL::generic);
sub default_port { 9999 };
and override any additional methods (e.g., C<_parse()>) as required.
=head1 WHAT A URL IS NOT
URL objects do not, and should not, know how to 'get' or 'put' the
resources they specify locations for, anymore than a postal address
'knows' anything about the postal system. The actual access/transfer
should be achieved by some form of transport agent class. The agent
class can use the URL class, but should not be a subclass of URL.
=head1 OUTSTANDING ISSUES
Class naming. Inheritance (use and abuse).
Does the current escaping implementation (e.g., keeping the components
unescaped) cause problems?
=head1 AUTHORS / ACKNOWLEDGMENTS
This module is (distantly) based on the C<wwwurl.pl> code in the
libwww-perl distribution developed by Roy Fielding
<fielding@ics.uci.edu>, as part of the Arcadia project at the
University of California, Irvine, with contributions from Brooks
Cutter.
Gisle Aas <aas@nr.no>, Tim Bunce <Tim.Bunce@ig.co.uk>, and Martijn
Koster <m.koster@nexor.co.uk> have collaborated on the Perl 5 version.
If you have any suggestions, bug reports, fixes, or enhancements, send
them to the libwww-perl mailing list at <libwww-perl@ics.uci.edu>.
=head1 BUGS
None known
Running the module standalone will execute a substantial self test.
=cut
#####################################################################
# Perl resumes here
use Carp;
use Cwd;
require Exporter;
@ISA = qw(Exporter);
@EXPORT_OK = qw(uri_escape uri_unescape);
# Define default unsafe characters (should we include '~' or leave
# it to applications to add that if required?). Note that you cannot
# reliably change this at runtime because the substitutions which
# use it use the /o flag.
$DefaultUnsafe = '\x00-\x20"#%;<>?\x7F-\xFF';
$Debug = 0;
$UseCache = 1; # see str method
my %KnownSchemes; # schemes we have initialised
use strict qw(subs refs);
#####################################################################
# methods/functions
# new()
#
# object's constructor
#
sub new
{
my($class, $init, $base) = @_;
my $this;
if (ref $init) {
$this = $init->clone;
} else {
$init =~ s/^\s+//; # remove leading space
$init =~ s/\s.*//; # remove anything after first word
# We need a scheme to determine which class to use
my($scheme) = $init =~ m/^([.+\-\w]+):/;
if (!$scheme and $base){ # get scheme from base
if (ref $base){ # may be object or just a string
$scheme = $base->scheme;
} else {
$scheme = $1 if $base =~ m/^([.+\-\w]+):/;
}
}
croak "Unable to determine scheme for '$init'"
unless $scheme;
$scheme = _check_scheme($scheme);
# _check_scheme dies on failure (eg unsupportable)
# then hand-off to scheme specific sub-class
$this = "URI::URL::$scheme"->new($init, $base);
}
$this->base($base) if $base;
return $this;
}
# clone()
#
# copy constructor
#
sub clone
{
my $self = shift;
bless { %$self }, ref $self;
}
# newlocal()
#
# return a URL object that denotes a path on the
# local filesystem (current directory by default).
#
sub newlocal
{
my($self, $path) = @_;
my $url = new URI::URL "file://localhost/";
unless (defined $path){
$path = fastcwd();
$path =~ s:/?$:/:; # force trailing slash on dir
}
$url->path($path);
$url;
}
# dump()
#
# print the contents of the object
#
sub dump
{
my $self = shift;
my($k, $v);
print "Dump of $self...\n";
foreach $k (sort keys %$self){
$v = $self->{$k};
$v = 'UNDEF' unless defined $v;
print " $k\t'$v'\n";
}
}
# _check_scheme()
#
# Checks scheme and loads its implementation
#
sub _check_scheme
{
my $scheme = lc(shift); # ensure lowercase and then validate
no strict qw(refs);
return $scheme if exists $KnownSchemes{$scheme}; # already setup
# load support for unknown scheme, die on failure
my $module = "URI::URL::${scheme}";
unless (defined @{"${module}::ISA"}){
require "URI/URL/${scheme}.pm"
} else {
# silently require anyway so external files can update
# our code
eval { local($^W)=0; require "URI/URL/${scheme}.pm" };
undef $@;
}
# setup overloading - experimental
%{"${module}::OVERLOAD"} = ( '""'=>'str', 'fallback'=>1);
$KnownSchemes{$scheme} = 1;
$scheme;
}
#####################################################################
# constants ?
#####################################################################
# Methods to handle URL's elements
# These methods always return the current value,
# so you can use $url->scheme to read the current value.
# If a new value is passed, e.g. $url->scheme('http'),
# it also sets the new value, and returns the previous value.
# Use $url->scheme(undef) to set the value to undefined.
# Generic-RL components:
sub scheme { shift->elem('scheme', @_); }
sub netloc { shift->elem('netloc', @_); }
sub path { shift->elem('path', @_); }
sub params { shift->elem('params', @_); }
sub query { shift->elem('query', @_); }
sub frag { shift->elem('frag', @_); }
# Fields derived from generic netloc:
sub user { shift->elem('user', @_); }
sub password { shift->elem('password',@_); }
sub host { shift->elem('host', @_); }
sub port { shift->elem('port', @_); }
sub elem {
my($self, $element, @val) = @_;
my $old = $self->{$element};
return $old unless @val;
$self->{$element} = $val[0]; # general case
$self->{'_'} = ''; # void cached string
# netloc includes user, password, host and port
if ($element eq 'netloc') {
$self->_read_netloc(); # update parts from whole
}
elsif ($element =~ m/^(user|password|host|port)$/) {
$self->_write_netloc(); # update whole from parts
}
return $old;
}
# Other attributes of a URL object:
# (These may happen to use elem() but applications should not
# use elem() to access them).
sub base {
my $self = shift;
return $self->elem('_base', @_) if @_; # set
# The base attribute supports 'lazy' conversion from URL strings
# to URL objects. Strings may be stored by if a string is fetched
# it will automatically be converted to a URL object.
# The main benefit is to make it much cheaper to say:
# new URI::URL $random_url_string 'http:'
my $base = $self->elem('_base'); # get
return undef unless defined $base;
unless (ref $base){
$base = new URI::URL $base;
$self->elem('_base', $base); # set new object
}
$base;
}
sub unsafe {
shift->elem('_unsafe', @_);
}
#####################################################################
# escape()
# unescape()
#
# Generic escaping ('this has spaces' -> 'this%20has%20spaces')
# and unescaping ('this%20has%20spaces' -> 'this has spaces')
# Overridden by subclasses which need more control.
# See notes on escaping at top of module.
#
sub escape
{
my($self, $text, $patn) = @_;
if ($patn){
$text =~ s/([$patn])/sprintf("%%%02x",ord($1))/eg;
return $text;
}
# let perl pre-compile this default for max speed
$text =~ s/([$URI::URL::DefaultUnsafe])/sprintf("%%%02x",ord($1))/oeg;
$text;
}
sub unescape
{
my($self, $text) = @_;
return undef unless defined $text;
local($_) = $text;
s/%([\dA-Fa-f][\dA-Fa-f])/pack('C',hex($1))/eg; # 'C' or 'c' ?
$_;
}
#####################################################################
# Miscellaneous functions (non-methods)
# uri_escape()
#
# Apply URI character escaping rules to some text.
# Note that it is generally better to do something like this:
# $url = new URI::URL 'http:';
# $url->path($random_query);
# See the 'HOW AND WHEN TO ESCAPE' section in the pod text above.
#
sub uri_escape
{
URI::URL->escape(@_);
}
# uri_unescape()
#
# Unescape some text destined to be a component of a URL.
# Note that it is generally better to do something like this:
# $url->path(uri_unescape($pre_escaped_path));
# See the 'HOW AND WHEN TO ESCAPE' section in the pod text above.
#
sub uri_unescape
{
URI::URL->unescape(@_);
}
#####################################################################
#
# Internal pre-defined generic scheme support
#
# In this implementation all schemes are subclassed from
# URI::URL::generic. This turns out to have reasonable mileage.
# See also draft-ietf-uri-relative-url-05.txt
package URI::URL::generic; # base support for generic-RL's
use Carp;
@ISA = qw(URI::URL);
sub new { # inherited by subclasses
my($class, $init, $base) = @_;
my $url = bless {}, $class; # create empty object
$url->_parse($init); # parse $init into components
$url->dump if $URI::URL::Debug;
$url;
}
# Generic-RL parser
# See draft-ietf-uri-relative-url-05.txt Section 2
sub _parse {
my($self, $u) = @_;
$self->{'_orig_url'} = $u if $URI::URL::Debug;
# draft-ietf-uri-relative-url-05.txt Section 2.4
# 2.4.1
$self->{'frag'} = $self->unescape($1) if $u =~ s/#(.*)$//;
# 2.4.2
$self->{'scheme'} = lc($1) if $u =~ s/^\s*([\w\+\.\-]+)://;
# 2.4.3
$self->{'netloc'} = $self->unescape($1) if $u =~ s!^//([^/]*)!!;
# 2.4.4
$self->{'query'} = $self->unescape($1) if $u =~ s/\?(.*)//;
# 2.4.5
$self->{'params'} = $self->unescape($1) if $u =~ s/;(.*)//;
# 2.4.6
$self->{'path'} = $self->unescape($u);
# read netloc components: "<user>:<password>@<host>:<port>"
$self->_read_netloc;
}
sub _read_netloc { # netloc -> user, password, host, post
my($self) = @_;
my $nl = $self->{'netloc'} || '';
$self->{'_'} = ''; # void cache
if ($nl =~ s/^([^:@]*):?([^@]*)?@//){
$self->{'user'} = $self->unescape($1);
$self->{'password'} = $self->unescape($2) if $2;
}
if ($nl =~ s/^([^:]*):?(\d*)?//){
$self->{'host'} = $self->unescape($1);
$self->{'port'} = $self->unescape($2) if $2;
}
}
sub _write_netloc { # user, password, host, post -> netloc
my($self) = @_;
my $tmp;
my $nl = $self->{'user'} || '';
$nl .= ":$self->{'password'}" if $nl and $self->{'password'};
$nl .= "\@" if $nl;
$nl .= ($tmp = $self->{'host'});
$nl .= ":$tmp" if ($tmp && ($tmp=$self->{'port'})
&& $tmp != $self->default_port);
$self->{'netloc'} = $nl;
}
# Generic-RL stringify
#
sub str
{
my $self = shift;
return $self->{'_'} if $self->{'_'} and $URI::URL::UseCache;
# use @ here to avoid undef warnings and allow $self->escape
# to use optimised pattern if no override has been set.
my @unsafe = shift || $self->unsafe || ();
my($scheme, $netloc, $port) = @{$self}{qw(scheme netloc port)};
# full_path() -> path+query+params+frag (escaped)
my $path = $self->full_path(@unsafe);
if ($netloc){
$path = "/$path" unless $path =~ m:^/:;
$path = "//".$self->escape($netloc, @unsafe).$path;
}
my $urlstr = ($scheme) ? "$scheme:$path" : $path;
$self->{'_'} = $urlstr; # set cache
return $urlstr;
}
# Generic-RL stringify full path (path+query+params+frag)
#
sub full_path
{
my $self = shift;
# use @ here to avoid undef warnings and allow $self->escape
# to use optimised pattern if no override has been set.
my @unsafe = shift || $self->unsafe || ();
my($path, $params, $query, $frag)
= @{$self}{qw(path params query frag) };
my $u = '';
$u .= $self->escape($path, @unsafe) if $path;
$u .= ";".$self->escape($params, @unsafe) if $params;
$u .= "?".$self->escape($query, @unsafe) if $query;
$u .= "#".$self->escape($frag, @unsafe) if $frag;
return $u;
}
# Generic-RL: Resolving Relative URL into an Absolute URL
#
# Based on draft-ietf-uri-relative-url-05.txt Section 4
#
sub abs
{
my($self, $base) = @_;
my $embed = $self->clone;
$base = $self->base unless $base; # default to default base
return $embed unless $base; # we have no base (step1)
$base = new URI::URL $base unless ref $base; # make obj if required
my @u = @{$embed}{qw(scheme host port path params query frag) };
# just use base if we are empty (2a)
return $base->clone if (scalar(grep($_, @u)) == 0);
my($scheme, $host, $port, $path, $params, $query, $frag) = @u;
# if we have a scheme we must already be absolute (2b)
return $embed if $scheme;
$embed->{'scheme'} = $base->{'scheme'}; # (2c)
return $embed if $embed->{'netloc'}; # (3)
$embed->{'netloc'} = $base->{'netloc'}; # (3)
$embed->_read_netloc();
return $embed if $embed->{'path'} =~ m:^/:; # (4)
unless ($embed->{'path'}){ # (5)
$embed->{'path'} = $base->{'path'}; # (5)
return $embed if $embed->params; # (5a)
$embed->{'params'} = $base->{'params'}; # (5a)
return $embed if $embed->query; # (5b)
$embed->{'query'} = $base->{'query'}; # (5b)
return $embed;
}
# (Step 6) # draft 6 suggests stack based approach
my $basepath = $base->{'path'};
my $relpath = $embed->{'path'};
$basepath =~ s!^/!!;
$basepath =~ s!/$!/.!; # prevent empty segment
my @path = split('/', $basepath); # base path into segments
pop(@path); # remove last segment
$relpath =~ s!/$!/.!; # prevent empty segment
push(@path, split('/', $relpath)); # append relative segments
my @newpath = ();
my $isdir = 0;
foreach $segment (@path) { # left to right
# warn '> ', join('/', @newpath), ": $segment\n";
if ($segment eq '.') { # ignore "same" directory
$isdir = 1;
}
elsif ($segment eq '..') {
$isdir = 1;
my $last = pop(@newpath);
if (!defined $last) { # nothing to pop
push(@newpath, $segment); # so must append
}
elsif ($last eq '..') { # '..' cannot match '..'
# so put pack again, and append
push(@newpath, $last, $segment);
}
else {
# it was a component,
# keep popped
}
} else {
$isdir = 0;
push(@newpath, $segment);
}
}
$embed->{'path'} = join('/', @newpath) . ($isdir ? '/' : '');
return $embed;
}
# default_port()
#
# subclasses will usually want to override this
#
sub default_port {
0;
}
# _expect()
#
# Handy low-level object method tester. See test code at end.
#
sub _expect {
my($self, $method, $expect, @args) = @_;
my $result = $self->$method(@args);
$expect = 'UNDEF' unless defined $expect;
$result = 'UNDEF' unless defined $result;
return 1 if $expect eq $result;
warn "'$self'->$method(@args) = '$result' (expected '$expect')\n";
$self->dump;
confess "Test Failed";
}
####################################################################
#
# Internal pre-defined basic scheme support
# Define the default ports for major net services
# From RFC 1738 "Uniform Resource Locators (URL)"
# Note that other Well Known Port Numbers are defined in
# the "Assigned Numbers" RFC (1700).
# XXX the rfc1700 ones should arguably use getservbyname
# or be in a Etc/Services.pm or something
package URI::URL::file; @ISA = qw(URI::URL::generic);
package URI::URL::ftp; @ISA = qw(URI::URL::generic);
sub default_port { 21 };
package URI::URL::telnet; @ISA = qw(URI::URL::generic);
sub default_port { 23 };
package URI::URL::whois; @ISA = qw(URI::URL::generic);
sub default_port { 43 };
package URI::URL::gopher; @ISA = qw(URI::URL::generic);
sub default_port { 70 };
sub gtype { shift->elem('gtype', @_); }
sub _parse {
my($self, $url) = @_;
$self->{'scheme'} = lc($1) if $url =~ s/^\s*([\w\+\.\-]+)://;
$self->{'netloc'} = $self->unescape($1) if $url =~ s!^//([^/]*)!!;
$self->{'gtype'} = $self->unescape($1) if $url =~ s!^/(.)!!;
my @sss = split(/%09/, $url, 3);
$self->{'selector'} = $self->unescape(shift @sss);
$self->{'search'} = $self->unescape(shift @sss);
$self->{'string'} = $self->unescape(shift @sss);
}
package URI::URL::finger; @ISA = qw(URI::URL::generic);
sub default_port { 79 };
package URI::URL::http; @ISA = qw(URI::URL::generic);
sub default_port { 80 };
package URI::URL::nntp; @ISA = qw(URI::URL::generic);
sub default_port { 119 };
sub _parse {
my($self, $init) = @_;
$self->URI::URL::generic::_parse($init);
my @parts = split(/\//, $self->{'path'});
$self->{'group'} = $self->unescape($parts[1]);
$self->{'digits'}= $self->unescape($parts[2]);
}
package URI::URL::news; @ISA = qw(URI::URL::generic);
sub _parse {
my($self, $init) = @_;
$self->{'scheme'} = lc($1) if ($init =~ s/^\s*([\w\+\.\-]+)://);
my $tmp = $self->unescape($init);
$self->{'grouppart'} = $tmp;
$self->{ ($tmp =~ m/\@/) ? 'article' : 'group' } = $tmp;
}
package URI::URL::wais; @ISA = qw(URI::URL::generic);
sub default_port { 210 };
sub _parse {
my($self, $init) = @_;
$self->URI::URL::generic::_parse($init);
my @parts = split(/\//, $self->{'path'});
$self->{'database'} = $self->unescape($parts[1]);
$self->{'wtype'} = $self->unescape($parts[2]);
$self->{'wpath'} = $self->unescape($parts[3]);
}
package URI::URL::webster; @ISA = qw(URI::URL::generic);
sub default_port { 765 };
package URI::URL::prospero; @ISA = qw(URI::URL::generic);
sub default_port { 1525 }; # says rfc1738, section 3.11
package URI::URL::mailto; @ISA = qw(URI::URL::generic);
sub _parse {
my($self, $init) = @_;
$self->{'scheme'} = lc($1) if ($init =~ s/^\s*([\w\+\.\-]+)://);
$self->{'encoded822addr'} = $self->unescape($init);
}
package URI::URL::rlogin; @ISA = qw(URI::URL::generic);
package URI::URL::tn3270; @ISA = qw(URI::URL::generic);
#####################################################################
#
# If we're not use'd or require'd execute self-test.
# Handy for regression testing and as a quick reference :)
#
# Test is kept behind __END__ so it doesn't take uptime
# and memory unless explicitly required. If you're working
# on the code you might find it easier to comment out the
# eval and __END__ so that error line numbers make more sense.
eval join('',<main::DATA>) || die $@ unless caller();
1;
package main; # the __END__ is in main (why?)
__END__
package main;
use Carp;
import URI::URL qw(uri_escape uri_unescape);
$| = 1;
# Do basic tests first.
# Dies if an error has been detected, prints "ok" otherwise.
print "Self tests...\n";
&scheme_parse_test;
&parts_test;
&escape_test;
&newlocal_test;
&absolute_test;
print "$URI::URL::rcsid ok\n";
exit 0;
#####################################################################
#
# scheme_parse_test()
#
# test parsing and retrieval methods
sub scheme_parse_test {
print "scheme_parse_test:\n";
$tests = {
'hTTp://web1.net/a/b/c/welcome#intro'
=> { 'scheme'=>'http', 'host'=>'web1.net', 'port'=>undef,
'path'=>'/a/b/c/welcome', 'frag'=>'intro', 'query'=>undef },
'http://web:1/a?query+text'
=> { 'scheme'=>'http', 'host'=>'web', 'port'=>1,
'path'=>'/a', 'frag'=>undef, 'query'=>'query+text' },
'http://web.net'
=> { 'scheme'=>'http', 'host'=>'web.net', 'port'=>undef,
'path'=>'', 'frag'=>undef, 'query'=>undef, 'user'=>undef },
'ftp://usr:pswd@web:1234/a/b;type=i'
=> { 'host'=>'web', 'port'=>1234, 'path'=>'/a/b',
'user'=>'usr', 'password'=>'pswd', 'params'=>'type=i' },
'gopher://web/2a_selector'
=> { 'gtype'=>'2', 'selector'=>'a_selector' },
'mailto:libwww-perl@ics.uci.edu'
=> { 'encoded822addr'=>'libwww-perl@ics.uci.edu' },
'news:*'
=> { 'grouppart'=>'*' },
'news:comp.lang.perl'
=> { 'group'=>'comp.lang.perl' },
'news:perl-faq/module-list-1-794455075@ig.co.uk'
=> { 'article'=>'perl-faq/module-list-1-794455075@ig.co.uk' },
'nntp://news.com/comp.lang.perl/42'
=> { 'group'=>'comp.lang.perl', 'digits'=>42 },
'telnet://usr:pswd@web:12345/'
=> { 'user'=>'usr', 'password'=>'pswd' },
'wais://web.net/db'
=> { 'database'=>'db' },
'wais://web.net/db?query'
=> { 'database'=>'db', 'query'=>'query' },
'wais://usr:pswd@web.net/db/wt/wp'
=> { 'database'=>'db', 'wtype'=>'wt', 'wpath'=>'wp',
'password'=>'pswd' },
};
foreach $url_str (sort keys %$tests ){
warn "Testing '$url_str'\n";
my($url) = new URI::URL $url_str;
$url_tests = $tests->{$url_str};
while( ($method, $exp) = each %$url_tests ){
$exp = 'UNDEF' unless defined $exp;
$url->_expect('elem', $exp, $method);
}
}
print "scheme parse tests ok\n";
}
#####################################################################
#
# parts_test() (calls netloc_test test)
#
# Test individual component part access functions
#
sub parts_test {
print "parts_test:\n";
# test storage part access/edit methods (netloc, user, password,
# host and port are tested by &netloc_test)
$url = new URI::URL 'http://web/orig/path';
$url->scheme('gopher');
$url->path('1info');
$url->frag('this');
$url->query('key+word');
$url->_expect('str', 'gopher://web/1info?key+word#this');
&netloc_test;
$url->query(undef);
$url->_expect('query', undef);
}
#
# netloc_test()
#
# Test automatic netloc synchronisation
#
sub netloc_test {
my $url = new URI::URL 'http://anonymous:p%61ss@hst:12345';
$url->_expect('user', 'anonymous');
$url->_expect('password', 'pass');
$url->_expect('host', 'hst');
$url->_expect('port', 12345);
$url->_expect('netloc', 'anonymous:pass@hst:12345');
$url->user('nemo');
$url->password('p2');
$url->host('hst2');
$url->port(2);
$url->_expect('netloc', 'nemo:p2@hst2:2');
$url->user(undef);
$url->password(undef);
$url->port(undef);
$url->_expect('netloc', 'hst2');
}
#####################################################################
#
# escape_test()
#
# escaping functions
sub escape_test {
print "escape_test:\n";
# supply escaped URL
$url = new URI::URL 'http://web/this%20has%20spaces';
# check component is unescaped
$url->_expect('path', '/this has spaces');
# modify the unescaped form
$url->path('this ALSO has spaces');
# check whole url is escaped
$url->_expect('str', 'http://web/this%20ALSO%20has%20spaces');
# now make 'A' an unsafe character :-)
$url->unsafe('A\x00-\x20"#%;<>?\x7F-\xFF');
$url->_expect('str', 'http://web/this%20%41LSO%20has%20spaces');
$url = new URI::URL uri_escape('http://web/this % needs escaping');
$url->_expect('str', 'http://web/this%20%25%20needs%20escaping');
my $all = pack('c*',0..255);
my $esc = uri_escape($all);
my $new = uri_unescape($esc);
die "uri_escape->uri_unescape mismatch" unless $all eq $new;
# need more tests here
}
#####################################################################
#
# newlocal_test()
#
sub newlocal_test {
print "newlocal_test:\n";
chdir('/tmp') or die $!;
$url = newlocal URI::URL;
$url->_expect('str', 'file://localhost/tmp/');
chdir('/') or die $!;
$url = newlocal URI::URL '/usr/';
$url->_expect('str', 'file://localhost/usr/');
}
#####################################################################
#
# absolute_test()
#
sub absolute_test {
print "Test relative/absolute URI::URL parsing:\n";
# Tests from draft-ietf-uri-relative-url-06.txt
# Copied verbatim from the draft, parsed below
@URI::URL::g::ISA = qw(URI::URL::generic); # for these tests
my $base = 'http://a/b/c/d;p?q#f';
$absolute_tests = <<EOM;
5.1. Normal Examples
g:h = <URL:g:h>
g = <URL:http://a/b/c/g>
./g = <URL:http://a/b/c/g>
g/ = <URL:http://a/b/c/g/>
/g = <URL:http://a/g>
//g = <URL:http://g>
?y = <URL:http://a/b/c/d;p?y>
g?y = <URL:http://a/b/c/g?y>
g?y/./x = <URL:http://a/b/c/g?y/./x>
#s = <URL:http://a/b/c/d;p?q#s>
g#s = <URL:http://a/b/c/g#s>
g#s/./x = <URL:http://a/b/c/g#s/./x>
g?y#s = <URL:http://a/b/c/g?y#s>
;x = <URL:http://a/b/c/d;x>
g;x = <URL:http://a/b/c/g;x>
g;x?y#s = <URL:http://a/b/c/g;x?y#s>
. = <URL:http://a/b/c/>
./ = <URL:http://a/b/c/>
.. = <URL:http://a/b/>
../ = <URL:http://a/b/>
../g = <URL:http://a/b/g>
../.. = <URL:http://a/>
../../ = <URL:http://a/>
../../g = <URL:http://a/g>
5.2. Abnormal Examples
Although the following abnormal examples are unlikely to occur
in normal practice, all URL parsers should be capable of resolving
them consistently. Each example uses the same base as above.
An empty reference resolves to the complete base URL:
<> = <URL:http://a/b/c/d;p?q#f>
Parsers must be careful in handling the case where there are more
relative path ".." segments than there are hierarchical levels in
the base URL's path. Note that the ".." syntax cannot be used to
change the <net_loc> of a URL.
../../../g = <URL:http://a/../g>
../../../../g = <URL:http://a/../../g>
Similarly, parsers must avoid treating "." and ".." as special when
they are not complete components of a relative path.
/./g = <URL:http://a/./g>
/../g = <URL:http://a/../g>
g. = <URL:http://a/b/c/g.>
.g = <URL:http://a/b/c/.g>
g.. = <URL:http://a/b/c/g..>
..g = <URL:http://a/b/c/..g>
Less likely are cases where the relative URL uses unnecessary or
nonsensical forms of the "." and ".." complete path segments.
./../g = <URL:http://a/b/g>
./g/. = <URL:http://a/b/c/g/>
g/./h = <URL:http://a/b/c/g/h>
g/../h = <URL:http://a/b/c/h>
Finally, some older parsers allow the scheme name to be present in
a relative URL if it is the same as the base URL scheme. This is
considered to be a loophole in prior specifications of partial
URLs [1] and should be avoided by future parsers.
http:g = <URL:http:g>
http: = <URL:http:>
EOM
# convert text to list like
# @absolute_tests = ( ['g:h' => 'g:h'], ...)
for $line (split("\n", $absolute_tests)) {
next unless $line =~ /^\s{6}/;
if ($line =~ /^\s+(\S+)\s*=\s*<URL:([^>]*)>/) {
my($rel, $abs) = ($1, $2);
$rel = '' if $rel eq '<>';
push(@absolute_tests, [$rel, $abs]);
}
else {
warn "illegal line '$line'";
}
}
# add some extra ones for good measure
push(@absolute_tests, ['x/y//../z', => 'http://a/b/c/x/y/z']);
print " Relative + Base => Expected Absolute URL\n";
print "================================================\n";
for $test (@absolute_tests) {
my($rel, $abs) = @$test;
my $abs_url = new URI::URL $abs;
my $abs_str = $abs_url->str;
printf(" %-10s + $base => $abs\n", $rel);
my $u = new URI::URL $rel, $base;
my $got = $u->abs;
$got->_expect('str', $abs_str);
}
print "absolute test ok\n";
}
1;
exit 0;