another patch for wwwhtml.pl
Roy T. Fielding (fielding@simplon.ICS.UCI.EDU)
Fri, 22 Jul 1994 03:13:00 -0700
*sigh*
Here is another patch for wwwhtml.pl -- the problem was found when
MOMspider fell into a black hole due to some ugly HTML like:
You can read about <A HREF="
http://hoohoo.ncsa.uiuc.edu:80/cgi/"> the Common Gateway Interface
at the URL
http://hoohoo.ncsa.uiuc.edu:80/cgi/.</A> In addition to documenting
(note how the newline occurs after the doublequote).
To make matters worse, this happens to be from Thomas Boutell's WWW-FAQ.html
and thus there are hundreds of copies of this lying around the web.
....Roy Fielding ICS Grad Student, University of California, Irvine USA
(fielding@ics.uci.edu)
<A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>
p.s. I'll be out of touch from now until Sunday evening due to a friend's
wedding in lovely Atwater, CA (only a 7 hour drive to the middle of
nowhere). Don't break anything while I'm gone ;-)
------------------------------cut here-----------------------------------
*** ../libwww-perl-0.20/wwwhtml.pl Thu Jul 21 01:50:37 1994
--- wwwhtml.pl Fri Jul 22 02:58:20 1994
***************
*** 1,4 ****
! # $Id: wwwhtml.pl,v 0.13 1994/07/21 08:50:11 fielding Exp $
# ---------------------------------------------------------------------------
# wwwhtml: A package for parsing pages of HyperText Markup Language (HTML)
# for a World-Wide Web spider.
--- 1,4 ----
! # $Id: wwwhtml.pl,v 0.14 1994/07/22 09:57:52 fielding Exp $
# ---------------------------------------------------------------------------
# wwwhtml: A package for parsing pages of HyperText Markup Language (HTML)
# for a World-Wide Web spider.
***************
*** 16,21 ****
--- 16,26 ----
# Updated META parsing to reflect HTML 2.0 proposal.
# 20 Jul 1994 (RTF): Fix segmentation fault if we are fooled into trying
# to extract links from a non-html document.
+ # 22 Jul 1994 (RTF): Fixed parsing of href's that had a new-line after
+ # the quote mark, causing an extra space to precede the
+ # extracted URL, which in turn created a black hole.
+ # Also added code to extract and change the base URL
+ # if there exists a <BASE href="..."> element.
#
# If you have any suggestions, bug reports, fixes, or enhancements,
# send them to Roy Fielding at <fielding@ics.uci.edu>.
***************
*** 62,67 ****
--- 67,75 ----
$content =~ s/\s+/ /g; # Remove all extra whitespace and newlines
+ $content =~ s#<base\s[^>]*href\s*=\s*"?\s*([^">\s]+)[^>]*>##i; # Base?
+ if ($1) { $base = $1; }
+
$content =~ s#<title[^>]*>([^<]+)</title[^>]*>##i; # Extract the title
if ($1) { $headers{'title'} = $1; }
***************
*** 73,84 ****
return unless ($content); # Return if we removed everything
# Isolate all META elements as text
! $content =~ s/<meta\s[^>]*http-equiv\s*=\s*"?([^">]+)[^>]*content\s*=\s*"?([^">]+)[^>]*>/M $1 $2\n/gi;
! $content =~ s/<meta\s[^>]*name\s*=\s*"?([^">]+)[^>]*content\s*=\s*"?([^">]+)[^>]*>/M $1 $2\n/gi;
# Isolate all A element HREFs as text
! $content =~ s/<a\s[^>]*href\s*=\s*"?([^">]+)[^>]*>/A $1\n/gi;
# Isolate all IMG element SRCs as text
! $content =~ s/<img\s[^>]*src\s*=\s*"?([^">]+)[^>]*>/I $1\n/gi;
$content =~ s/<[^>]*>//g; # Remove all remaining elements
$content =~ s/\n+/\n/g; # Remove all blank lines
--- 81,92 ----
return unless ($content); # Return if we removed everything
# Isolate all META elements as text
! $content =~ s/<meta\s[^>]*http-equiv\s*=\s*"?\s*([^">\s]+)[^>]*content\s*=\s*"?([^">]+)[^>]*>/M $1 $2\n/gi;
! $content =~ s/<meta\s[^>]*name\s*=\s*"?\s*([^">\s]+)[^>]*content\s*=\s*"?([^">]+)[^>]*>/M $1 $2\n/gi;
# Isolate all A element HREFs as text
! $content =~ s/<a\s[^>]*href\s*=\s*"?\s*([^">\s]+)[^>]*>/A $1\n/gi;
# Isolate all IMG element SRCs as text
! $content =~ s/<img\s[^>]*src\s*=\s*"?\s*([^">\s]+)[^>]*>/I $1\n/gi;
$content =~ s/<[^>]*>//g; # Remove all remaining elements
$content =~ s/\n+/\n/g; # Remove all blank lines
***************
*** 89,95 ****
foreach $elem (split(/\n/,$content))
{
! if ($elem =~ /^A (.*)$/)
{
$orig = $1;
push(@lorig, $orig);
--- 97,103 ----
foreach $elem (split(/\n/,$content))
{
! if ($elem =~ /^A\s+(\S*)$/)
{
$orig = $1;
push(@lorig, $orig);
***************
*** 106,112 ****
}
push(@links, &wwwurl'compose($scheme,$host,$port,$path,'',''));
}
! elsif ($elem =~ /^I (.*)$/)
{
$orig = $1;
push(@lorig, $orig);
--- 114,120 ----
}
push(@links, &wwwurl'compose($scheme,$host,$port,$path,'',''));
}
! elsif ($elem =~ /^I\s+(\S*)$/)
{
$orig = $1;
push(@lorig, $orig);