html parsing (more problems)
Anthony Thyssen (anthony@cit.gu.edu.au)
Wed, 23 Nov 1994 22:20:18 +1000
Greetings, yes it is me again.
While checking links (using HEAD requests on a http URL)
A couple of documents were downloaded with no <TITLE> html elements.
The Html parsing routine at this point set the $header{'title'}
to what ever the contents of $1 is ( from outside the routine! )
Here is the relevent code ....
#=======8<------CUT HERE--------axes/crowbars permitted---------------
$content =~ s/\s+/ /g; # Remove all extra whitespace and newlines
$content =~ s#<base\s[^>]*href\s*=\s*"?\s*([^">\s]+)[^>]*>##i; # Base?
if ($1) { $base = $1; }
$content =~ s#<title[^>]*>([^<]+)</title[^>]*>##i; # Extract the title
if ($1) { $headers{'title'} = $1; }
#=======8<------CUT HERE--------axes/crowbars permitted---------------
You will notice that the base test would actually have the same
problem internally.
The following patch will fix this BUG.
#=======8<------CUT HERE--------axes/crowbars permitted---------------
*** wwwhtml.pl.orig Sun Nov 6 15:39:38 1994
--- wwwhtml.pl Wed Nov 23 22:04:34 1994
***************
*** 67,77 ****
$content =~ s/\s+/ /g; # Remove all extra whitespace and newlines
! $content =~ s#<base\s[^>]*href\s*=\s*"?\s*([^">\s]+)[^>]*>##i; # Base?
! if ($1) { $base = $1; }
! $content =~ s#<title[^>]*>([^<]+)</title[^>]*>##i; # Extract the title
! if ($1) { $headers{'title'} = $1; }
$content =~ s/^[^<]+//; # Remove everything before first element
$content =~ s/>[^<]*/>/g; # Remove everything between elements (text)
--- 67,81 ----
$content =~ s/\s+/ /g; # Remove all extra whitespace and newlines
! # Base Given
! if( $content =~ s#<base\s[^>]*href\s*=\s*"?\s*([^">\s]+)[^>]*>##i ) {;
! $base = $1;
! }
! # Extract the title
! if( $content =~ s#<title[^>]*>([^<]+)</title[^>]*>##i ) {
! $headers{'title'} = $1;
! }
$content =~ s/^[^<]+//; # Remove everything before first element
$content =~ s/>[^<]*/>/g; # Remove everything between elements (text)
#=======8<------CUT HERE--------axes/crowbars permitted---------------
I'll probably also be sending further patchs for the <PLAINTEXT>
and <LISTING>...</LISTING> html elements. I was talking about earlier.
Anthony Thyssen ( System Programmer ) http://www.cit.gu.edu.au/~anthony/
-------------------------------------------------------------------------------
"Got orders to let no-one through," (a guard)
"We're not anyone," said Bane, "and that's an order."
--- Terry Pratchett ``The Carpet People''
-------------------------------------------------------------------------------