*** ../libwww-perl-0.12/Changes.txt Fri Jul 8 01:16:17 1994 --- Changes.txt Thu Jul 21 01:59:06 1994 *************** *** 1,12 **** Changes to libwww-perl ====================== ! # $Id: Changes.txt,v 0.12 1994/07/08 08:08:14 fielding Exp $ See the files README.html and Artistic.txt for licensing and distribution info. See the file INSTALL.txt for installation information. If you have any suggestions, bug reports, fixes, or enhancements, ! send them to Roy Fielding at . Known problems --- 1,13 ---- Changes to libwww-perl ====================== ! # $Id: Changes.txt,v 0.15 1994/07/21 08:59:01 fielding Exp $ See the files README.html and Artistic.txt for licensing and distribution info. See the file INSTALL.txt for installation information. If you have any suggestions, bug reports, fixes, or enhancements, ! send them to Roy Fielding at or join the libwww-perl ! mailing list at . Known problems *************** *** 14,21 **** --- 15,55 ---- although the code itself is fairly easy to read and understand. Things that need to be done (let us know if you are working on something good) + A safe configuration program to automate the install process. Interfaces to FTP, Gopher, WAIS, ... A real HTML (or SGML) parser. + + NOTE: Version numbers increment according to the significance of the new + changes. The major number is incremented only for large overhauls + of the code or changes in the basic architecture/interface which + makes if incompatible with prior releases. The first minor number + reflects a change in the interface (such as a new library or a new + method for making requests) which is still compatible with the old. + The last number reflects minor bug fixes and documentation updates. + + + Version 0.20 July 20, 1994 + Added the wwwbot.pl library and testbot program (by Brooks Cutter) + for implementing the robot exclusion protocol. + Added the testlinks program for yet another example of how useful + programs can be easily implemented on top of libwww-perl -- it also + tests just about every aspect of the request libraries. + Added &www'set_def_header() and check_defaults() so that protocol + header defaults (such as the HTTP From: header) can be set within + the library and other default request headers can be set + once by the client and effect all requests (e.g. User-Agent). + Fixed the source of an annoying warning from "perl -w" in wwwhttp.pl. + Moved some existing code in wwwmime.pl into a separate function + set_content() which can set the "content-type" header for any given + file extension. + Added &wwwurl'get_site() for extracting the site name (server:port) + from a given URL. + Updated the get program to make use of the new interface changes. + Changed the eval of &wwwscheme'request to a simpler &$routine call + after a suggestion from Brooks. + Fixed a bug in &wwwhtml'extract_links() which was causing a segmentation + fault when a completely -free file (i.e. a text file) was + mistakenly extracted. Version 0.12 July 8, 1994 *** ../libwww-perl-0.12/INSTALL.txt Fri Jul 8 01:17:01 1994 --- INSTALL.txt Wed Jul 20 20:55:41 1994 *************** *** 1,6 **** libwww-perl Installation Information ==================================== ! # $Id: INSTALL.txt,v 0.12 1994/07/08 08:08:14 fielding Exp $ See the files README.html and Artistic.txt for licensing and distribution info. See the file Changes.txt for a complete list of changes and version history. --- 1,6 ---- libwww-perl Installation Information ==================================== ! # $Id: INSTALL.txt,v 0.13 1994/07/21 03:55:33 fielding Exp $ See the files README.html and Artistic.txt for licensing and distribution info. See the file Changes.txt for a complete list of changes and version history. *************** *** 21,38 **** it will be in the form of a compressed unix tar file. If it has not already been decompressed by your WWW client, then do one of: ! % uncompress libwww-perl-0.12.tar.Z ! % gunzip libwww-perl-0.12.tar.gz depending on which compressed version you downloaded. ! 2. Move the resulting libwww-perl-0.12.tar file to the directory above where you want to install libwww-perl, cd to that directory, and do ! % tar xvf libwww-perl-0.12.tar ! to create the directory ./libwww-perl-0.12 containing the following: Artistic.txt -- the Artistic License governing redistribution of the libwww-perl package. --- 21,38 ---- it will be in the form of a compressed unix tar file. If it has not already been decompressed by your WWW client, then do one of: ! % uncompress libwww-perl-0.20.tar.Z ! % gunzip libwww-perl-0.20.tar.gz depending on which compressed version you downloaded. ! 2. Move the resulting libwww-perl-0.20.tar file to the directory above where you want to install libwww-perl, cd to that directory, and do ! % tar xvf libwww-perl-0.20.tar ! to create the directory ./libwww-perl-0.20 containing the following: Artistic.txt -- the Artistic License governing redistribution of the libwww-perl package. *************** *** 48,59 **** --- 48,64 ---- mime.types -- the standard MIME content-types and default filename extensions in the same format as that used by NCSA httpd_1.3 and many WWW clients. + testbot -- a simple program for testing the wwwbot.pl package. testdates -- a simple program for testing the wwwdates.pl package. testescapes -- a program for testing wwwurl'escape and unescape. + testlinks -- a simple program for testing HTML link extraction and + combinations of GET and HEAD requests. www.pl -- the primary entry point for WWW requests -- give it any absolute URL and a request method and it will try to perform the method using the URL's protocol scheme (or a proxy). + wwwbot.pl -- a package for implementing the robot exclusion + protocol. wwwdates.pl -- a package of library utilities for reading, manipulating, and writing dates as they are formatted by most World-Wide Web software and protocols. *************** *** 77,83 **** 3. You may need to change the following (with any text editor). ! The first line of each program (get, testdates, and testescapes) should point to your perl executable: #!/usr/local/bin/perl --- 82,88 ---- 3. You may need to change the following (with any text editor). ! The first line of each program (get and all test*) should point to your perl executable: #!/usr/local/bin/perl *************** *** 86,92 **** 4. The LIBWWW_PERL environment variable must be set to point to the libwww-perl directory, e.g. ! % setenv LIBWWW_PERL /usr/local/lib/libwww-perl-0.12 This allows clients like "get" to place the libwww-perl on their @INC path and also allows wwwmime.pl to find the standard mime.types file. --- 91,97 ---- 4. The LIBWWW_PERL environment variable must be set to point to the libwww-perl directory, e.g. ! % setenv LIBWWW_PERL /usr/local/lib/libwww-perl-0.20 This allows clients like "get" to place the libwww-perl on their @INC path and also allows wwwmime.pl to find the standard mime.types file. *************** *** 94,100 **** 5. Make sure the programs are executable: ! % chmod 755 get testdates testescapes % ln -s get HEAD HEAD is in uppercase only because there already exists a unix head command. --- 99,105 ---- 5. Make sure the programs are executable: ! % chmod 755 get test* % ln -s get HEAD HEAD is in uppercase only because there already exists a unix head command. *************** *** 107,114 **** ========================================================================== Usage: ! See the "get" program for an example of how to interface with libwww-perl. ! More documentation will be available later this summer (Northern Hemisphere ;-) ========================================================================== --- 112,120 ---- ========================================================================== Usage: ! See the "get" and test* programs for examples of how to interface with ! libwww-perl. More documentation will be available later this summer ! (Northern Hemisphere ;-) ========================================================================== *************** *** 128,135 **** 2. How do I contribute my changes to the standard distribution? ! Just send them (preferably as a context diff suitable for reading by ! the patch program) to Roy Fielding at . If you have RCS (or CVS), you can use the included RCS repository to keep track of your changes and merge them with later distributions. You are also free to send changes to others by mail or news (or even disk), --- 134,146 ---- 2. How do I contribute my changes to the standard distribution? ! First, you should join the libwww-perl mailing list ! by sending a subscribe request, including your name and preferred e-mail ! address, to . You will be sent a welcome ! message when you are placed on the list. ! ! After that, send a mail message describing your changes or suggestions ! to and we can all talk about them. If you have RCS (or CVS), you can use the included RCS repository to keep track of your changes and merge them with later distributions. You are also free to send changes to others by mail or news (or even disk), *** ../libwww-perl-0.12/README.html Fri Jul 8 01:18:45 1994 --- README.html Wed Jul 20 20:55:17 1994 *************** *** 1,5 **** ! libwww-perl: Distribution Information --- 1,5 ---- ! libwww-perl: Distribution Information *************** *** 28,33 **** --- 28,45 ---- distribution with full citation to the developers. See below for the list of past and current contributors.

+ A mailing list has been + established for technical discussion about libwww-perl, + including problem reports, interim fixes, suggestions for features, + and contributions. The mailing list address is +

+      libwww-perl@ics.uci.edu
+ 
+ and administrivia (including subscribe requests) should be sent to +
+      libwww-perl-request@ics.uci.edu
+ 
+ Support for the initial development and distribution of libwww-perl has been provided by the Arcadia Project at UCI, part of the larger *************** *** 48,55 **** --- 60,77 ----
FILE
Support for GET and HEAD requests on file://localhost URLs is provided, with results translated to HTTP responses as if they were handled by an HTTP gateway. + +

Tools based on libwww-perl

+ + The following tools use libwww-perl and are available at their + own distribution sites: +
+
MOMspider +
The Multi-Owner Maintenance spider created by Roy Fielding.
+ +

Contributors

*************** *** 63,71 **** The following developers have contributed (either directly or indirectly) to the libwww-perl distribution:
!
Brooks Cutter, AT&T Paradyne Corporation, USA !
Contributed code for escaping and unescaping URLs, www'stat(), ! and several bug fixes.
Roy Fielding, University of California, Irvine, USA
Architect and primary developer of the library. --- 85,94 ---- The following developers have contributed (either directly or indirectly) to the libwww-perl distribution:
!
Brooks ! Cutter, STUFF.com, USA !
Contributed wwwbot.pl, testbot, code for escaping and unescaping URLs, ! www'stat(), and several bug fixes.
Roy Fielding, University of California, Irvine, USA
Architect and primary developer of the library. *************** *** 91,99 ****

The Distribution

For easy distribution, libwww-perl is available as a ! gzip'd tar file or as a ! compress'd tar file. It is also available via anonymous ftp from liege.ics.uci.edu in the directory --- 114,122 ----

The Distribution

For easy distribution, libwww-perl is available as a ! gzip'd tar file or as a ! compress'd tar file. It is also available via anonymous ftp from liege.ics.uci.edu in the directory *************** *** 126,131 **** --- 149,157 ---- and default filename extensions in the same format as that used by NCSA httpd_1.3 and many WWW clients. +
testbot +
a simple program for testing the wwwbot.pl package. +
testdates
a simple program for testing the wwwdates.pl package. *************** *** 132,142 **** --- 158,177 ----
testescapes
a simple program for testing the wwwurl'escape and unescape routines. +
testlinks +
a simple program for testing HTML link extraction and + combinations of GET and HEAD requests. +
www.pl
the primary entry point for WWW requests -- give it any absolute URL and a request method and it will try to perform the method using the URL's protocol scheme (or a proxy). +
wwwbot.pl +
a package for implementing the + + robot exclusion protocol. +
wwwdates.pl
a package of library utilities for reading, manipulating, and writing dates as they are formatted by most World-Wide Web software *************** *** 173,188 ****

Version History

If you have any suggestions, bug reports, fixes, or enhancements, send ! them to Roy ! Fielding at (fielding@ics.uci.edu). Also, we would like to ask ! anyone who uses libwww-perl on a regular basis, or intends to ! include it as part of their own tool, to please send ! us an e-mail message which indicates how and where it is being used. ! This is, of course, only voluntary and we don't want anyone to divulge ! private information, but please understand that such acknowledgements ! allow free-software authors like us to justify the time and effort ! needed to build quality tools and development environments.


Please see the file Artistic.txt for complete --- 208,232 ----

Version History

+ Current version is 0.20. + + If you have any suggestions, bug reports, fixes, or enhancements, send ! them to ! Roy Fielding at (fielding@ics.uci.edu) or, preferably, join the ! libwww-perl mailing list as described above. ! Also, we would like to ask anyone who uses libwww-perl on a ! regular basis, or intends to include it as part of their own tool, to ! please send us an e-mail message which indicates how ! and where it is being used. This is, of course, only voluntary and we ! don't want anyone to divulge private information, but please ! understand that such acknowledgements allow free-software authors like ! us to justify the time and effort needed to build quality tools and ! development environments.


Please see the file Artistic.txt for complete *************** *** 207,211 **** Department of Information and Computer Science,
University of California, Irvine, CA 92717-3425 ! Last modified: Fri Jul 8 07:48:21 GMT 1994 --- 251,255 ---- Department of Information and Computer Science,
University of California, Irvine, CA 92717-3425 ! Last modified: Wed Jul 20 20:51:23 1994 *** ../libwww-perl-0.12/get Fri Jul 8 01:14:27 1994 --- get Wed Jul 20 11:10:47 1994 *************** *** 1,5 **** #!/usr/public/bin/perl ! # $Id: get,v 0.12 1994/07/08 08:08:14 fielding Exp $ #----------------------------------------------------------------- # Perform a WWW request on a (set of) absolute or relative URL(s). # The URL(s) may be on the command line or passed via a pipe. --- 1,5 ---- #!/usr/public/bin/perl ! # $Id: get,v 0.14 1994/07/20 18:10:37 fielding Exp $ #----------------------------------------------------------------- # Perform a WWW request on a (set of) absolute or relative URL(s). # The URL(s) may be on the command line or passed via a pipe. *************** *** 16,21 **** --- 16,23 ---- # 14 Jun 1994 (RTF): Changed env variable to LIBWWW_PERL # Fixed the method name to remove any path garbage # 06 Jul 1994 (RTF): Added extra fallback code from Martijn Koster + # 20 Jul 1994 (RTF): The default From header is now set by www.pl + # and &www'set_def_header() is called to set User-Agent # # Created by Roy Fielding to test the libwww-perl system #----------------------------------------------------------------- *************** *** 29,46 **** $method =~ s#^.*/([^/]+)$#$1#; # lose the path $method =~ tr/a-z/A-Z/; # uppercase it ! $UserAgent = "$method/0.3 $www'Library"; # Set up User-Agent: header - $user = ( $ENV{'USER'} || $ENV{'LOGNAME'} || 'unknown' ); - - chop($host = `hostname`); # Get default address - if (index($host,'.') == -1) - { - $host = join('.', $host, `domainname`); - chop($host); - } - $ReplyTo = join('@', $user, $host); # Set up From: header - $pwd = ( $ENV{'PWD'} || $ENV{'cwd'} || '' ); $base = "file://localhost$pwd/"; # Set up initial Base URL --- 31,39 ---- $method =~ s#^.*/([^/]+)$#$1#; # lose the path $method =~ tr/a-z/A-Z/; # uppercase it ! &www'set_def_header('http', 'User-Agent', "$method/0.3"); ! # Set up User-Agent: header $pwd = ( $ENV{'PWD'} || $ENV{'cwd'} || '' ); $base = "file://localhost$pwd/"; # Set up initial Base URL *************** *** 77,94 **** local(%headers) = (); local($content) = ''; - $headers{'User-Agent'} = $UserAgent; # Define the HTTP request headers - $headers{'From'} = $ReplyTo; - print "$method $url HTTP/1.0\n"; # Show user what it looks like ! foreach $hd (keys(%headers)) ! { ! print "$hd: $headers{$hd}\n"; ! } ! print "\n"; # Now do the request ! $response = &www'request($method, $url, *headers, *content, 30); # And print out the result print "HTTP/1.0 $response $wwwerror'RespMessage{$response}\n"; --- 70,85 ---- local(%headers) = (); local($content) = ''; print "$method $url HTTP/1.0\n"; # Show user what it looks like ! # and then do the request $response = &www'request($method, $url, *headers, *content, 30); + foreach $hd (keys(%headers)) # This is cheating, but it shows + { # the default headers generated + next if ($hd =~ m#^[a-z]#); # by the www.pl request library. + print "$hd: $headers{$hd}\n"; + } + print "\n"; # And print out the result print "HTTP/1.0 $response $wwwerror'RespMessage{$response}\n"; *** ../libwww-perl-0.12/testbot Wed Jul 20 21:05:52 1994 --- testbot Wed Jul 20 19:47:34 1994 *************** *** 0 **** --- 1,80 ---- + #!/usr/public/bin/perl + # $Id: testbot,v 1.1 1994/07/21 02:46:45 fielding Exp $ + #----------------------------------------------------------------- + # This program tests the wwwbot.pl library and is based on the one + # written by Brooks Cutter. + # + # The program starts with the BASE URL equal to the current file directory. + # To change it, enter a URL prefixed with "base=", e.g, + # + # Enter a URL (^D to exit): base=http://www.ics.uci.edu/ + # + # The program starts with the User-Agent equal to program name. + # To change it, enter a name prefixed with "ua=", e.g, + # + # Enter a URL (^D to exit): ua=MOMspider/0.20 + # + # 20 Jul 1994 (RTF): Initial Version + # + #----------------------------------------------------------------- + if ($libloc = $ENV{'LIBWWW_PERL'}) { unshift(@INC, $libloc); } + + require "www.pl"; + require "wwwurl.pl"; + require "wwwbot.pl"; + require "dumpvar.pl"; + + $ua = $0; # Method = program name + $ua =~ s#^.*/([^/]+)$#$1#; # lose the path + + &www'set_def_header('http', 'User-Agent', "$ua/0.1"); + # Set up User-Agent: header + $pwd = ( $ENV{'PWD'} || $ENV{'cwd'} || '' ); + + $base = "file://localhost$pwd/"; # Set up initial Base URL + + #----------------------------------------------------------------- + + if ($#ARGV == 0) { # Quickie, one-line version + $url = &wwwurl'absolute($base, $ARGV[0]); + &test($url); + } + else { # Interactive version + print "Enter a URL (^D to exit): "; + while (<>) { + chop; + if (/^base=(.*)$/) { $base = $1; next; } + if (/^ua=(.*)$/) + { + $ua = $1; + &www'set_def_header('http', 'User-Agent', $ua); + next; + } + $url = &wwwurl'absolute($base, $_); + &test($url); + } + continue + { + print "===========================================================\n"; + print "Enter a URL (^D to exit): "; + } + print "\n"; + } + exit(0); + + #----------------------------------------------------------------- + sub test + { + local($url) = @_; + + $delay = &wwwbot'visitable($url); + print "delay of $delay for $url\n"; + if ($delay) { sleep($delay); } + $ok = &wwwbot'allowed($url, $ua); + print "$ok: $url\n"; + &wwwbot'visited($url); + + &dumpvar('wwwbot'); + } + + 1; *** ../libwww-perl-0.12/testlinks Wed Jul 20 21:05:52 1994 --- testlinks Wed Jul 20 12:15:52 1994 *************** *** 0 **** --- 1,153 ---- + #!/usr/public/bin/perl + # $Id: testlinks,v 1.1 1994/07/20 19:14:56 fielding Exp $ + # --------------------------------------------------------------------------- + # GET and extract the links from the URLs passed as arguments, test them + # using HEAD requests, and output an HTML index fragment describing the + # results. Relative links are resolved relative to the URL $base. + # + # Note that this is a non-recursive, completely inefficient version + # of MOMspider's index without the visual cues for problem links. See + # for more information. + # + # 21 Apr 1994 (RTF): Initial Version + # 12 Jul 1994 (RTF): Rewritten to work with libwww-perl + # 20 Jul 1994 (RTF): The default From header is now set by www.pl + # and &www'set_def_header() is called to set User-Agent. + # Added to libwww-perl distribution. + # + # Created by Roy Fielding to test MOMspider and the libwww-perl system + #----------------------------------------------------------------- + if ($libloc = $ENV{'LIBWWW_PERL'}) { unshift(@INC, $libloc); } + + require "www.pl"; + require "wwwurl.pl"; + require "wwwhtml.pl"; + require "wwwerror.pl"; + require "wwwdates.pl"; + + $pname = $0; # Method = program name + $pname =~ s#^.*/([^/]+)$#$1#; # lose the path + + &www'set_def_header('http', 'User-Agent', "$pname/0.3"); + # Set up User-Agent: header + $pwd = ( $ENV{'PWD'} || $ENV{'cwd'} || '' ); + + $base = "file://localhost$pwd/"; # Set up initial Base URL + + $vidx = 'tl0001'; + #----------------------------------------------------------------- + + while ($ARGV[0]) + { + $rel = shift; + $url = &wwwurl'absolute($base, $rel); + + $content = ''; + %headers = (); + + $response = &www'request('GET', $url, *headers, *content, 30); + + @TestLinks = (); + @TestAbs = (); + @TestOrig = (); + @TestType = (); + + &wwwhtml'extract_links($url, *headers, *content, + *TestLinks, *TestAbs, *TestOrig, *TestType); + + # Now print out the index entry for this URL + + $nextbit = ($headers{title} || $url); + print "

$nextbit

\n"; + $vidx++; + print "$response $wwwerror'RespMessage{$response}\n", + "\n$url"; + + if ($nextbit = ($headers{'uri'} || $headers{'location'})) + { + print "
\nURI: $nextbit"; + } + + if ($nextbit = $headers{'last-modified'}) + { + print "
\nLast-modified: $nextbit"; + } + + if ($nextbit = $headers{'expires'}) + { + print "
\nExpires: $nextbit"; + } + + if ($nextbit = $headers{'reply-to'}) + { + print "
\nReply-to: $nextbit"; + } + print "\n"; + + undef $content; + undef %headers; + + if ($TestLinks[0]) + { + print "
    \n"; + foreach $idx (0 .. $#TestLinks) + { + $nextbit = (($TestType[$idx] eq 'L') && 'Link') || + (($TestType[$idx] eq 'I') && 'Image') || + (($TestType[$idx] eq 'Q') && 'Query'); + + print "
  • $nextbit\n"; + $vidx++; + + &test_child($url, $TestLinks[$idx], $TestAbs[$idx], + $TestOrig[$idx]); + } + print "
\n"; + } + print "\n"; + + undef @TestLinks; + undef @TestAbs; + undef @TestOrig; + undef @TestType; + } + + exit(0); + + sub test_child + { + local($parent, $link, $labs, $lorig) = @_; + local($response, $nextbit) = 0; + + local($content) = ''; + local(%headers) = (); + + if ($parent) { $headers{'Referer'} = $parent; } + if ($link =~ /^http/) { sleep(20); } + + $response = &www'request('HEAD', $link, *headers, *content, 30); + + print " $response $wwwerror'RespMessage{$response}\n", + " \n $lorig"; + + if ($nextbit = ($headers{'uri'} || $headers{'location'})) + { + print "
\n URI: $nextbit"; + } + + if ($nextbit = $headers{'last-modified'}) + { + print "
\n Last-modified: $nextbit"; + } + + if ($nextbit = $headers{'expires'}) + { + print "
\n Expires: $nextbit"; + } + + if ($nextbit = $headers{'reply-to'}) + { + print "
\n Reply-To: $nextbit"; + } + print "\n"; + } *** ../libwww-perl-0.12/www.pl Fri Jul 8 01:09:21 1994 --- www.pl Thu Jul 21 01:49:47 1994 *************** *** 1,4 **** ! # $Id: www.pl,v 0.12 1994/07/08 08:08:14 fielding Exp $ # --------------------------------------------------------------------------- # www.pl: A package for handling requests of any World-Wide Web URL, # including requests that should be redirected to a proxy server. --- 1,4 ---- ! # $Id: www.pl,v 0.14 1994/07/21 08:49:43 fielding Exp $ # --------------------------------------------------------------------------- # www.pl: A package for handling requests of any World-Wide Web URL, # including requests that should be redirected to a proxy server. *************** *** 18,23 **** --- 18,27 ---- # 13 Jun 1994 (RTF): Initial version # 07 Jul 1994 (RTF): Added stat() code from Brooks Cutter. # Updated error messages. + # 20 Jul 1994 (RTF): Added set_def_header() and check_defaults() along with + # the DefaultHeaders arrays so that defaults can be set + # once by the client and effect all requests. + # Changed the request eval to version suggested by Brooks. # # If you have any suggestions, bug reports, fixes, or enhancements, # send them to Roy Fielding at . *************** *** 32,43 **** # and a "request" subroutine. package www; ! $Library = 'libwww-perl/0.12'; # To be appended onto client's User-Agent ! # (this must be done within the client code). if ($NoProxy = $ENV{'no_proxy'}) { @DontProxy = split(/,/, $NoProxy); } # =========================================================================== # request(): perform a WWW request using the passed method, absolute URL, # and request headers, and return the resulting response code. # The response codes for all protocols mirror those of HTTP. --- 36,67 ---- # and a "request" subroutine. package www; ! $Library = 'libwww-perl/0.20'; # To be appended onto client's User-Agent + # ========================================================================== + # Get the default From address for HTTP requests and add it to defaults. + + @DefaultHeaders = (); + @DefHeaderSchemes = (); + @DefHeaderValues = (); + + chop($host = `hostname`); + if (index($host,'.') == -1) + { + $host = join('.', $host, `domainname`); + chop($host); + } + $user = ( $ENV{'USER'} || $ENV{'LOGNAME'} || 'unknown' ); + + &set_def_header('http', 'From', join('@', $user, $host)); + + # ========================================================================== + if ($NoProxy = $ENV{'no_proxy'}) { @DontProxy = split(/,/, $NoProxy); } + # =========================================================================== + # =========================================================================== # request(): perform a WWW request using the passed method, absolute URL, # and request headers, and return the resulting response code. # The response codes for all protocols mirror those of HTTP. *************** *** 92,101 **** $object = $url; } ! $routine = '&www' . $scheme . q/'request/; $allowed = '$www' . $scheme . q/'AllowedMethods{$method}/; ! if (!((eval "defined($routine);") && (eval "$allowed;")) ) { return &wwwerror'onrequest($wwwerror'RC_not_implemented_client, $method, $scheme, $host, $port, $object, *headers, *content, --- 116,125 ---- $object = $url; } ! $routine = 'www' . $scheme . q/'request/; $allowed = '$www' . $scheme . q/'AllowedMethods{$method}/; ! if (!((eval "defined(\&$routine);") && (eval "$allowed;")) ) { return &wwwerror'onrequest($wwwerror'RC_not_implemented_client, $method, $scheme, $host, $port, $object, *headers, *content, *************** *** 102,112 **** "Request method not supported by client library"); } if (!$port) { $port = $wwwurl'DefPort{$scheme} }; ! return (eval ! "$routine(\$method,\$host,\$port,\$object,\*headers,\*content,\$timeout);" ! ); } --- 126,193 ---- "Request method not supported by client library"); } + &check_defaults($scheme, *headers); + if (!$port) { $port = $wwwurl'DefPort{$scheme} }; ! return &$routine($method,$host,$port,$object,*headers,*content,$timeout); ! } ! ! ! # =========================================================================== ! # set_def_header(): Allow the client to set a default header for a particular ! # protocol scheme. These headers can be overridden by a header of the ! # same name appearing in the request. This routine should ONLY be used ! # to set headers which will not change throughout the life of the process. ! # ! # Examples: ! # ! # &set_def_header('http', 'From', 'fielding@ics.uci.edu'); ! # &set_def_header('http', 'User-Agent', 'MOMspider/1.0'); ! # ! # Note that if a header called User-Agent is set, this routine will ! # automatically append the current library version to the name given ! # if it has not already been appended. ! # ! sub set_def_header ! { ! local($scheme, $name, $value) = @_; ! local($pos); ! ! # First, see if one has already been set ! ! undef $pos; ! for ($[ .. $#DefaultHeaders) ! { ! $pos = $_, last if (($name eq $DefaultHeaders[$_]) && ! ($scheme eq $DefHeaderSchemes[$_])); ! } ! if (!defined($pos)) { $pos = $#DefaultHeaders + 1; } ! ! if (($name =~ /^User-Agent$/io) && ($value !~ /$Library/o)) ! { ! $value .= " $Library"; ! } ! $DefaultHeaders[$pos] = $name; ! $DefHeaderValues[$pos] = $value; ! $DefHeaderSchemes[$pos] = $scheme; ! } ! ! ! # =========================================================================== ! # check_defaults(): Check the header defaults and, if a corresponding value ! # was not set in the request, add the default header to the array. ! # ! sub check_defaults ! { ! local($scheme, *headers) = @_; ! ! foreach $idx ($[ .. $#DefaultHeaders) ! { ! next unless ($scheme eq $DefHeaderSchemes[$idx]); ! next if ($headers{$DefaultHeaders[$idx]}); ! $headers{$DefaultHeaders[$idx]} = $DefHeaderValues[$idx]; ! } } *** ../libwww-perl-0.12/wwwbot.pl Wed Jul 20 21:05:52 1994 --- wwwbot.pl Wed Jul 20 19:47:13 1994 *************** *** 0 **** --- 1,442 ---- + # $Id: wwwbot.pl,v 1.1 1994/07/21 02:46:45 fielding Exp $ + # --------------------------------------------------------------------------- + # wwwbot.pl: This library implements the Robot Exclusion protocol + # (draft 6/30/94) as documented on + # + # + # wwwbot features (or the difference between this library and the two + # others I've seen): + # + # - There is one routine to call (&wwwbot'allowed) with a URL and your + # UserAgent line. You don't worry about retrieving or parsing /robots.txt + # - The retrieved /robots.txt is cached in memory (by default) so multiple + # calls to the same host:port retrieve /robots.txt once.. + # - The routines &visited() tracks the last time your robot visited a host, + # and &visitable() returns the number of seconds (if any) your robot + # should wait before visiting the site again. + # + # This package has been developed by Brooks Cutter . + # It is distributed under the Artistic License (included with your Perl + # distribution files and with the standard distribution of this package). + # + # 07 Jul 1994 (BBC): Initial version + # 08 Jul 1994 (BBC): Added routines &visit(), &visitable(), &set_host_delay() + # Improved /robots.txt parsing (more flexible) + # Wrote documentation and examples for wwwbot routines + # 20 Jul 1994 (RTF): Reformatted a bit for inclusion in standard libwww-perl. + # + # If you have any suggestions, bug reports, fixes, or enhancements, + # send them to Roy Fielding at or to the libwww-perl + # mailing list at . + # --------------------------------------------------------------------------- + # + # The following are the ten commandments for writing a Web robot + # (Sorted in the modern Letterman order.) + # + # 10. You will visit the list of known robots before writing a new one. + # Point your Web browser at the URL: + # http://web.nexor.co.uk/mak/doc/robots/active.html + # Also take a look at the URL + # http://web.nexor.co.uk/mak/doc/robots/automated-tools.html + # Look for one you can use or modify if necessary before writing a + # new one, as existing robots are probably kinder on remote servers + # than yours will (at least initially) be. + # + # 9. You will join the robots mailing list by sending mail to + # robots-request@nexor.co.uk and read the newsgroup(s) + # comp.infosystems.www.providers, comp.infosystems.www.misc and + # comp.infosystems.www.users (in the order with the greatest chance + # of robot traffic). Look for messages about about robots that + # meet your needs. + # + # 8. Post a message to comp.infosystems.www.providers and send mail to + # robots@nexor.co.uk announcing your intentions to write a robot. + # Include a brief description of the problem you intend to solve + # with this robot, and who it will benefit. (A robot that benefits the + # entire net will be tolerated a longer than one that benefits a small + # group..) Someone may already be working on a similar robot or + # one may exist and it isn't listed. (Sent the author the URL in #10). + # + # 7. You will read the Guidelines for Robot Writers at the URL + # http://web.nexor.co.uk/mak/doc/robots/guidelines.html + # These guidelines describe what your robot should and shouldn't do, + # and following them should ensure that your robot will be welcome + # (or at least not refused) at web sites all over.. + # + # 6. Make you set informative headers like 'User-Agent' with + # the name and version of your robot and 'From' with your email + # address. For more information on these headers check out the URL + # http://info.cern.ch/hypertext/WWW/Protocols/HTTP/HTTP2.html + # + # 5. Buttafuoco! (Sorry...) + # + # 4. Make the remote Web admin's glad your robot is walking their web. + # Code your robot so if it comes across a dead link it sends + # mail to either the address defined in the mailto link or the + # postmaster of that web server. + # You should also pass the HTTP/1.0 header 'Referer' when you + # access a document, with the URL of the document it was hyperlinked + # from. + # + # 3. [Shameless plug] Use wwwbot.pl in Roy Fielding's excellent libwww-perl + # package, because it implements the latest Robot exclusion protcol + # and provides all support routines necessary to keep your robot + # welcome at web sites around the world (and makes you popular + # at parties). + # + # 2. Read the other documents listed at the URL + # http://web.nexor.co.uk/mak/doc/robots/robots.html + # Home of the "World Wide Web Robots, Wanderers and Spiders" page + # + # 1. Save the results of your work and publish the results on + # the net. Publish a list of sites your robot found or provide the + # digested data that your robot creates so others don't have to write + # robots that burden remote servers, they can use your data. + # + # Of course, we have No doubts that you will joyfully provide this + # information to everyone on the net for free, since most of the + # software your using was made available to you because of others that + # feel the same way. + # + # This list was created with the aid of pages off the "World Wide Web Robots, + # Wanderers and Spiders" page written Martijn Koster. That URL again: + # + # + # This list was inspired while working on a robot perl library for + # libwww-perl, which was originally the backend of MOMspider. + # You'll find more info on libwww-perl and MOMspider at + # + # + # + # This list was written by Brooks Cutter, while I was working + # on wwwbot.pl which I wrote while working on the w3new package. + # You'll find more info on w3new at + # + # Brooks Cutter's (bcutter@stuff.com) home page is at + # + # + # =========================================================================== + require "www.pl"; + require "wwwurl.pl"; + + package wwwbot; + + # =========================================================================== + # Package / Global variables: + # ( Don't modify these directly, as there are routines to change the value ) + # ( of these variables. The subroutine API won't change, but the variable ) + # ( names or the way the data is stored may change in the future. ) + + # Flag that determines if the results of /robots.txt should be cached + # Used in &do_cache, &dont_cache, &allowed and &load_robots + + $cache_flag = 1; + + # These next two flags are related to the time a robot should wait + # before accessing a host more than one time.. + # (Note - when the program retrieves /robots.txt, the program + # is not penalized and can perform a immediate request. Retrieving + # /robots.txt (if it exists) is a freebee..) + + # Number of seconds to wait between robot visits to same site + + $bot_host_delay = 60; # wait for 1 minute between requests + + # The user can turn up the delay between host queries, but it can't go + # below $bot_host_delay_min + + $bot_host_delay_min = $bot_host_delay; + + # =========================================================================== + # =========================================================================== + # allowed(): call this routine with the URL you wish to visit and the name + # of your robot. Unless the data has been cached, it will + # call &load_robots. It returns 1 if you are allowed to retrieve + # the URL. + # + # $ok = &wwwbot'allowed($url,$user_agent); + # + # WHERE, + # + # $ok: 1 if your robot is allowed to retrieve $url, otherwise 0 + # + # $url: the WWW URL in absolute form you wish to retrieve. + # + # $user_agent: The name of your program and optionally the version number. + # Use the form "program_name/v.er" (like "roundabot/1.0") + # + # Example: + # + # @urls = ( + # 'http://www.ics.uci.edu/WebSoft/libwww-perl/', + # 'http://www.stuff.com/', + # 'http://web.nexor.co.uk/mak/doc/robots/robots.html', + # ); + # $user_agent = "roundabot/1.0"; + # foreach $url (@urls) { + # unless(&wwwbot'allowed($url,$user_agent)) { + # warn "refused: $url\n"; + # next; + # } + # &www'request('GET',$url,*headers,*content,30); + # ... + # } + # + sub allowed + { + local($url, $user_agent) = @_; + local($scheme, $address, $port, $path, $query, $frag); + local($ret, $n, $ua); + + ($scheme,$address,$port,$path,$query,$frag) = &wwwurl'parse($url); + unless($port) { $port = 80; } + + $ua = $user_agent; + $ua =~ s!/.*!!; + $ua =~ tr/A-Z/a-z/; + unless ($botcache{'_visited',$address,$port}) + { + $ret = &load_robots($address,$port,$ua); + if ($ret == -1) { return(1); } # Unable to connect/retrieve /robots.txt + } + for ($ua,'*') + { + $n = 0; + while ($botcache{$_,++$n}) + { + if (($botcache{$_,$n} eq '*') || + ($botcache{$_,$n} eq substr($path,0,length($botcache{$_,$n})))) + { return(0); } + } + } + unless ($cache_flag) { undef %botcache; } + return(1); + } + + # =========================================================================== + # load_robots(): Retrieves /robots.txt from the host and parses it. + # ( Internal routine that modifies internal variables. ) + # + # $ok = &wwwbot'load_robots($host,$port,$user_agent); + # + # WHERE, + # + # $ok: 1 if your robot is allowed to retrieve $url, otherwise 0 + # + # $host: the name of the http host you wish to retrieve /robots.txt from + # + # $port: the port of the http host you wish to retrieve /robots.txt from + # + # $user_agent: The name of your program and optionally the version number. + # Use the form "program_name/v.er" (like "roundabot/1.0") + # + sub load_robots + { + local($host, $port, $user_agent) = @_; + local(%headers, $content, $response, $url, $n, $ua, $dis); + + local($timeout) = 30; + + $port = 80 unless($port); + $botcache{'_visited',$host,$port} = 1; + $url = "http://$host:$port/robots.txt"; + + %headers = (); + $content = ''; + + $response = &www'request('GET', $url, *headers, *content, $timeout); + return(-1) unless ($response =~ /^2/); + + for (split(/\n/,$content)) + { + next if (/^\s*#/); + s/\s*#.*$//; + if (/^\s*$/) + { + if (@user_agent && @disallow) + { + for $ua (@user_agent) + { + $n = 0; + for $dis (@disallow) + { + $botcache{$ua,++$n} = $dis; + } + } + } + @user_agent = @disallow = (); + } + elsif (/^\s*(User.?Agent|Robot)s?\s*:\s+(\S+.*\S?)\s*$/i) + { + + # The above regex is very forgiving. Among others, it will recognize + # User-Agent: (correct form) + # User-Agents: + # User_Agent: + # UserAgent: + # Robot: + # Robots: + # ...etc... and any lower/upper case version of the string .. + + $ua = $2; + $ua =~ s!/.*!!; + $ua =~ tr/A-Z/a-z/; + push(@user_agent,$ua); + } + elsif (/^\s*Disallow\s*:\s*(\S+.*\S?)\s*$/i) + { + push(@disallow,$1); + } # If I don't recognize it, + } # ignore it (for future compatibility) + + if (@user_agent && @disallow) + { + for $ua (@user_agent) + { + $n = 0; + for $dis (@disallow) + { + $botcache{$ua,++$n} = $dis; + } + } + } + return(1); + } + + + # =========================================================================== + # By default, the library will cache the /robots.txt file from each + # server. The cache is not restricted by size or expired by time, so + # if you want to disable/re-enable the cache, you can use the routines + # + # &wwwbot'dont_cache; + # or + # &wwwbot'do_cache; + # + # You can flush the cache by calling both + # &wwwbot'dont_cache and &wwwbot'do_cache; + # + + sub do_cache { $cache_flag = 1; } + sub dont_cache { $cache_flag = 0; undef %botcache; } + + + # =========================================================================== + # set_host_delay(): Set the number of seconds between connections to the + # same host. the minimum value is 60 seconds. This value + # is used by &wwwbot'visited and &wwwbot'visitable + # + # $ok = &wwwbot'set_host_delay($delay); + # + # WHERE, + # + # $ok: 1 if successful, 0 if not ($delay less than minimum) + # + # $delay: number of seconds that your robot should wait between queries. + # + # Example: + # + # unless(&wwwbot'set_host_delay(90)) { + # die "Unable to set host delay to 90 seconds.\n"; + # } else { + # print "Number of seconds between robot queries set to 90 seconds.\n"; + # } + # + sub set_host_delay + { + local($delay) = @_; + + if ($delay >= $bot_host_delay_min) + { + $bot_host_delay = $delay; + return(1); + } + else + { + warn + "wwwbot'set_host_delay: Minimum host delay is $bot_host_delay_min\n"; + return(0); + } + } + + # =========================================================================== + # get_host_delay(): Get the current host delay. This is initially set + # to the minimum value, so if you call it before calling + # set_host_delay() then you can get the value for the + # minimum host delay. + # + # $delay = &get_host_delay; + # + # WHERE, + # + # $delay: Current delay in seconds between requests from same hosts + # + sub get_host_delay + { + return($bot_host_delay); + } + + # =========================================================================== + # visited(): Should be called immediately after each &www'request call. + # Routine tracks the time of your visit to the host, and is used + # in the &visitable() routine. See &visitable() for more info. + # + # + # &wwwbot'visited($url); + # + # WHERE, + # + # $url: the WWW URL in absolute form that the robot just visited. + # + # Note: See &visitable() for a example including &visited() + # + sub visited + { + local($url) = @_; + + local($address) = (&wwwurl'parse($url))[1]; + + $botvisit{$address} = time; + } + + # =========================================================================== + # visitable(): Should be called immediately before each &www'request call. + # Routine returns 0 if enough time has elapsed since your + # robot's last visit, otherwise returns the number of seconds + # your browser should wait before visiting the host again. + # + # $delay = &wwwbot'visitable($url); + # + # WHERE, + # + # $delay: If your browser has waited long enough since the last access, + # then $delay will be 0. otherwise $delay will be set to the + # number of seconds your robot should wait before trying this + # URL again. + # + # $url: the WWW URL in absolute form that the robot is about to visit. + # + # Example: + # + # $url = 'http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/whats-new.html'; + # $delay = &wwwbot'visitable($url); + # if ($delay) { sleep($delay); } + # # Of course you don't have to sleep - meanwhile, you can check other URLs.. + # $resp = &www'request('HEAD',$url,*headers,*content,30); + # &wwwbot'visited($url); + # + sub visitable + { + local($url) = @_; + + local($address) = (&wwwurl'parse($url))[1]; + local($last_access) = (time - $botvisit{$address}); + + if ($last_access >= $bot_host_delay) { return(0); } + + return($bot_host_delay - $last_access); + } + + # =========================================================================== + + 1; *** ../libwww-perl-0.12/wwwhtml.pl Fri Jul 8 01:12:04 1994 --- wwwhtml.pl Thu Jul 21 01:50:37 1994 *************** *** 1,4 **** ! # $Id: wwwhtml.pl,v 0.12 1994/07/08 08:08:14 fielding Exp $ # --------------------------------------------------------------------------- # wwwhtml: A package for parsing pages of HyperText Markup Language (HTML) # for a World-Wide Web spider. --- 1,4 ---- ! # $Id: wwwhtml.pl,v 0.13 1994/07/21 08:50:11 fielding Exp $ # --------------------------------------------------------------------------- # wwwhtml: A package for parsing pages of HyperText Markup Language (HTML) # for a World-Wide Web spider. *************** *** 14,19 **** --- 14,21 ---- # 13 Jun 1994 (RTF): Initial version # 07 Jul 1994 (RTF): Removed buggy attempt to extract comments. # Updated META parsing to reflect HTML 2.0 proposal. + # 20 Jul 1994 (RTF): Fix segmentation fault if we are fooled into trying + # to extract links from a non-html document. # # If you have any suggestions, bug reports, fixes, or enhancements, # send them to Roy Fielding at . *************** *** 63,72 **** $content =~ s#]*>([^<]+)]*>##i; # Extract the title if ($1) { $headers{'title'} = $1; } ! $content =~ s/^[^<]+[^<]*[^<]+$/>/; # Remove everything after last element # Isolate all META elements as text $content =~ s/]*http-equiv\s*=\s*"?([^">]+)[^>]*content\s*=\s*"?([^">]+)[^>]*>/M $1 $2\n/gi; $content =~ s/]*name\s*=\s*"?([^">]+)[^>]*content\s*=\s*"?([^">]+)[^>]*>/M $1 $2\n/gi; --- 65,77 ---- $content =~ s#]*>([^<]+)]*>##i; # Extract the title if ($1) { $headers{'title'} = $1; } ! $content =~ s/^[^<]+//; # Remove everything before first element ! $content =~ s/>[^<]*/>/g; # Remove everything between elements (text) ! $content =~ s/<[^>]*[^>]+$/>/; # Remove everything after last element + return unless ($content); # Return if we removed everything + # Isolate all META elements as text $content =~ s/]*http-equiv\s*=\s*"?([^">]+)[^>]*content\s*=\s*"?([^">]+)[^>]*>/M $1 $2\n/gi; $content =~ s/]*name\s*=\s*"?([^">]+)[^>]*content\s*=\s*"?([^">]+)[^>]*>/M $1 $2\n/gi; *************** *** 125,131 **** $link =~ tr/A-Z/a-z/; $headers{$link} = $orig; } ! else { warn "A mistake was made in link extraction"; } } } --- 130,136 ---- $link =~ tr/A-Z/a-z/; $headers{$link} = $orig; } ! else { warn "A mistake was made in link extraction from $base"; } } } *** ../libwww-perl-0.12/wwwhttp.pl Fri Jul 8 01:12:43 1994 --- wwwhttp.pl Wed Jul 20 09:15:06 1994 *************** *** 1,4 **** ! # $Id: wwwhttp.pl,v 0.12 1994/07/08 08:08:14 fielding Exp $ # --------------------------------------------------------------------------- # wwwhttp: A package for sending HTTP requests and handling responses for the # World-Wide Web. This package is designed for use by www.pl --- 1,4 ---- ! # $Id: wwwhttp.pl,v 0.13 1994/07/20 16:14:56 fielding Exp $ # --------------------------------------------------------------------------- # wwwhttp: A package for sending HTTP requests and handling responses for the # World-Wide Web. This package is designed for use by www.pl *************** *** 16,21 **** --- 16,22 ---- # 07 Jul 1994 (RTF): Moved require of sys/socket.ph outside of package due to # a bug in perl4 found by Martijn Koster # Fixed error handling in case of problems in eval. + # 19 Jul 1994 (RTF): Fixed nagging warning from perl -w that made no sense. # # If you have any suggestions, bug reports, fixes, or enhancements, # send them to Roy Fielding at . *************** *** 169,175 **** # =========================================================================== ! # parseRFC822head(): Breaks out the headers passed in $head into an % array # indexed by a lower-cased keyword. Returns nothing. # # This routine is (mostly) from Gene Spafford's --- 170,176 ---- # =========================================================================== ! # parseRFC822head(): Breaks out the headers passed in $head into a %headers # indexed by a lower-cased keyword. Returns nothing. # # This routine is (mostly) from Gene Spafford's *************** *** 177,188 **** # sub parseRFC822head { ! local($head, *array) = @_; return if (!$head); ! local($save1) = $*; ! local($keyw, $val); $* = 1; $_ = $head; s/\r//g; --- 178,189 ---- # sub parseRFC822head { ! local($head, *headers) = @_; return if (!$head); ! local($save1, $keyw, $val, @array); + $save1 = ($* || 0); $* = 1; $_ = $head; s/\r//g; *************** *** 193,208 **** { ($keyw, $val) = m/^([^:]+):\s*(.*\S)\s*$/g; $keyw =~ tr/A-Z/a-z/; ! if (defined($array{$keyw})) { ! $array{$keyw} .= ", $val"; } else { ! $array{$keyw} = $val; } } ! $* = $save1; } --- 194,209 ---- { ($keyw, $val) = m/^([^:]+):\s*(.*\S)\s*$/g; $keyw =~ tr/A-Z/a-z/; ! if (defined($headers{$keyw})) { ! $headers{$keyw} .= ", $val"; } else { ! $headers{$keyw} = $val; } } ! $* = $save1; } *** ../libwww-perl-0.12/wwwmime.pl Fri Jul 8 01:13:20 1994 --- wwwmime.pl Fri Jul 15 06:54:42 1994 *************** *** 1,4 **** ! # $Id: wwwmime.pl,v 0.12 1994/07/08 08:08:14 fielding Exp $ # --------------------------------------------------------------------------- # wwwmime.pl: A package for handling MIME-specific operations for # a World-Wide Web client. --- 1,4 ---- ! # $Id: wwwmime.pl,v 0.13 1994/07/15 13:54:34 fielding Exp $ # --------------------------------------------------------------------------- # wwwmime.pl: A package for handling MIME-specific operations for # a World-Wide Web client. *************** *** 11,16 **** --- 11,17 ---- # 13 Jun 1994 (RTF): Initial version # 14 Jun 1994 (RTF): Changed environment variable to LIBWWW_PERL # 07 Jul 1994 (RTF): Made calls to load_mimetypes more tolerant + # 15 Jul 1994 (RTF): Moved some code into new function set_content() # # If you have any suggestions, bug reports, fixes, or enhancements, # send them to Roy Fielding at . *************** *** 90,97 **** sub fakehead { local($suffix, $contentlen, $lastmod, *headers) = @_; - local($[) = 0; - local(@suf, $conenc); $headers{'date'} = &wwwdates'wtime(time,'GMT'); $headers{'mime-version'} = '1.0'; --- 91,96 ---- *************** *** 101,106 **** --- 100,119 ---- { $headers{'last-modified'} = &wwwdates'wtime($lastmod,'GMT'); } + + &set_content($suffix, *headers); + } + + + # =========================================================================== + # set_content(): Set the Content-type and Content-encoding headers based on + # the filename extension(s) passed in $suffix. + # + sub set_content + { + local($suffix, *headers) = @_; + local($[) = 0; + local(@suf, $conenc); $suffix =~ tr/A-Z/a-z/; @suf = split(/\./,$suffix); *** ../libwww-perl-0.12/wwwurl.pl Fri Jul 8 01:14:01 1994 --- wwwurl.pl Sat Jul 16 04:12:22 1994 *************** *** 1,4 **** ! # $Id: wwwurl.pl,v 0.12 1994/07/08 08:08:14 fielding Exp $ # --------------------------------------------------------------------------- # wwwurl: A package for parsing and manipulating World-Wide Web # Uniform Resource Locators (URL). --- 1,4 ---- ! # $Id: wwwurl.pl,v 0.13 1994/07/16 11:12:16 fielding Exp $ # --------------------------------------------------------------------------- # wwwurl: A package for parsing and manipulating World-Wide Web # Uniform Resource Locators (URL). *************** *** 15,20 **** --- 15,21 ---- # Replaced complicated unescape loop with a simple # substitute (from Steven E. Brenner via Brooks Cutter). # Added escape() routine (w/mods) from Brooks Cutter. + # 16 Jul 1994 (RTF): Added get_site() routine. # # If you have any suggestions, bug reports, fixes, or enhancements, # send them to Roy Fielding at . *************** *** 245,249 **** --- 246,272 ---- return &compose($scheme, $addr, $port, $path, $query, $frag); } + + # =========================================================================== + # get_site(): Return the site part of the passed-in absolute URL + # (i.e. the hostname:port) replacing any missing port with + # the default. If the URL scheme does not allow hostport, + # then we return ''; + # + sub get_site + { + local($scheme, $site, $port, $path, $query, $frag) = &parse($_[0]); + + return '' unless (defined($DefPort{$scheme})); + + if (!$port) { $port = $DefPort{$scheme}; } + + if ($port) { $site .= ":$port"; } + + return $site; + } + + + # =========================================================================== 1;