another patch for wwwhtml.pl

Roy T. Fielding (fielding@simplon.ICS.UCI.EDU)
Fri, 22 Jul 1994 03:13:00 -0700


*sigh*

Here is another patch for wwwhtml.pl -- the problem was found when
MOMspider fell into a black hole due to some ugly HTML like:

    You can read about <A HREF="
    http://hoohoo.ncsa.uiuc.edu:80/cgi/"> the Common Gateway Interface
    at the URL
    http://hoohoo.ncsa.uiuc.edu:80/cgi/.</A> In addition to documenting

(note how the newline occurs after the doublequote).

To make matters worse, this happens to be from Thomas Boutell's WWW-FAQ.html
and thus there are hundreds of copies of this lying around the web.


....Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                   (fielding@ics.uci.edu)
    <A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>

p.s. I'll be out of touch from now until Sunday evening due to a friend's
     wedding in lovely Atwater, CA (only a 7 hour drive to the middle of
     nowhere).  Don't break anything while I'm gone  ;-)

------------------------------cut here-----------------------------------
*** ../libwww-perl-0.20/wwwhtml.pl	Thu Jul 21 01:50:37 1994
--- wwwhtml.pl	Fri Jul 22 02:58:20 1994
***************
*** 1,4 ****
! # $Id: wwwhtml.pl,v 0.13 1994/07/21 08:50:11 fielding Exp $
  # ---------------------------------------------------------------------------
  # wwwhtml: A package for parsing pages of HyperText Markup Language (HTML)
  #          for a World-Wide Web spider.
--- 1,4 ----
! # $Id: wwwhtml.pl,v 0.14 1994/07/22 09:57:52 fielding Exp $
  # ---------------------------------------------------------------------------
  # wwwhtml: A package for parsing pages of HyperText Markup Language (HTML)
  #          for a World-Wide Web spider.
***************
*** 16,21 ****
--- 16,26 ----
  #                    Updated META parsing to reflect HTML 2.0 proposal.
  # 20 Jul 1994 (RTF): Fix segmentation fault if we are fooled into trying
  #                    to extract links from a non-html document.
+ # 22 Jul 1994 (RTF): Fixed parsing of href's that had a new-line after
+ #                    the quote mark, causing an extra space to precede the
+ #                    extracted URL, which in turn created a black hole.
+ #                    Also added code to extract and change the base URL
+ #                    if there exists a <BASE href="..."> element.
  #
  # If you have any suggestions, bug reports, fixes, or enhancements,
  # send them to Roy Fielding at <fielding@ics.uci.edu>.
***************
*** 62,67 ****
--- 67,75 ----
  
      $content =~ s/\s+/ /g;           # Remove all extra whitespace and newlines
  
+     $content =~ s#<base\s[^>]*href\s*=\s*"?\s*([^">\s]+)[^>]*>##i; # Base?
+     if ($1) { $base = $1; }
+     
      $content =~ s#<title[^>]*>([^<]+)</title[^>]*>##i;  # Extract the title
      if ($1) { $headers{'title'} = $1; }
      
***************
*** 73,84 ****
      return unless ($content);       # Return if we removed everything
  
                                      # Isolate all META elements as text
!     $content =~ s/<meta\s[^>]*http-equiv\s*=\s*"?([^">]+)[^>]*content\s*=\s*"?([^">]+)[^>]*>/M $1 $2\n/gi;
!     $content =~ s/<meta\s[^>]*name\s*=\s*"?([^">]+)[^>]*content\s*=\s*"?([^">]+)[^>]*>/M $1 $2\n/gi;
                                      # Isolate all A element HREFs as text
!     $content =~ s/<a\s[^>]*href\s*=\s*"?([^">]+)[^>]*>/A $1\n/gi;
                                      # Isolate all IMG element SRCs as text
!     $content =~ s/<img\s[^>]*src\s*=\s*"?([^">]+)[^>]*>/I $1\n/gi;
  
      $content =~ s/<[^>]*>//g;       # Remove all remaining elements
      $content =~ s/\n+/\n/g;         # Remove all blank lines
--- 81,92 ----
      return unless ($content);       # Return if we removed everything
  
                                      # Isolate all META elements as text
!     $content =~ s/<meta\s[^>]*http-equiv\s*=\s*"?\s*([^">\s]+)[^>]*content\s*=\s*"?([^">]+)[^>]*>/M $1 $2\n/gi;
!     $content =~ s/<meta\s[^>]*name\s*=\s*"?\s*([^">\s]+)[^>]*content\s*=\s*"?([^">]+)[^>]*>/M $1 $2\n/gi;
                                      # Isolate all A element HREFs as text
!     $content =~ s/<a\s[^>]*href\s*=\s*"?\s*([^">\s]+)[^>]*>/A $1\n/gi;
                                      # Isolate all IMG element SRCs as text
!     $content =~ s/<img\s[^>]*src\s*=\s*"?\s*([^">\s]+)[^>]*>/I $1\n/gi;
  
      $content =~ s/<[^>]*>//g;       # Remove all remaining elements
      $content =~ s/\n+/\n/g;         # Remove all blank lines
***************
*** 89,95 ****
  
      foreach $elem (split(/\n/,$content))
      {
!         if ($elem =~ /^A (.*)$/)
          {
              $orig = $1;
              push(@lorig, $orig);
--- 97,103 ----
  
      foreach $elem (split(/\n/,$content))
      {
!         if ($elem =~ /^A\s+(\S*)$/)
          {
              $orig = $1;
              push(@lorig, $orig);
***************
*** 106,112 ****
              }
              push(@links, &wwwurl'compose($scheme,$host,$port,$path,'',''));
          }
!         elsif ($elem =~ /^I (.*)$/)
          {
              $orig = $1;
              push(@lorig, $orig);
--- 114,120 ----
              }
              push(@links, &wwwurl'compose($scheme,$host,$port,$path,'',''));
          }
!         elsif ($elem =~ /^I\s+(\S*)$/)
          {
              $orig = $1;
              push(@lorig, $orig);