html parsing (more problems)

Anthony Thyssen (anthony@cit.gu.edu.au)
Wed, 23 Nov 1994 22:20:18 +1000


Greetings, yes it is me again.
While checking links (using HEAD requests on a http URL)
A couple of documents were downloaded with no <TITLE> html elements.

The Html parsing routine at this point set the $header{'title'}
to what ever the contents of $1 is ( from outside the routine! )

Here is the relevent code ....

#=======8<------CUT HERE--------axes/crowbars permitted---------------
    $content =~ s/\s+/ /g;           # Remove all extra whitespace and newlines

    $content =~ s#<base\s[^>]*href\s*=\s*"?\s*([^">\s]+)[^>]*>##i; # Base?
    if ($1) { $base = $1; }
    
    $content =~ s#<title[^>]*>([^<]+)</title[^>]*>##i;  # Extract the title
    if ($1) { $headers{'title'} = $1; }
    
#=======8<------CUT HERE--------axes/crowbars permitted---------------

You will notice that the base test would actually have the same
problem internally.

The following patch will fix this BUG.

#=======8<------CUT HERE--------axes/crowbars permitted---------------
*** wwwhtml.pl.orig     Sun Nov  6 15:39:38 1994
--- wwwhtml.pl  Wed Nov 23 22:04:34 1994
***************
*** 67,77 ****
  
      $content =~ s/\s+/ /g;           # Remove all extra whitespace and newlines
  
!     $content =~ s#<base\s[^>]*href\s*=\s*"?\s*([^">\s]+)[^>]*>##i; # Base?
!     if ($1) { $base = $1; }
      
!     $content =~ s#<title[^>]*>([^<]+)</title[^>]*>##i;  # Extract the title
!     if ($1) { $headers{'title'} = $1; }
      
      $content =~ s/^[^<]+//;         # Remove everything before first element
      $content =~ s/>[^<]*/>/g;       # Remove everything between elements (text)
--- 67,81 ----
  
      $content =~ s/\s+/ /g;           # Remove all extra whitespace and newlines
  
!     # Base Given
!     if( $content =~ s#<base\s[^>]*href\s*=\s*"?\s*([^">\s]+)[^>]*>##i ) {;
!        $base = $1;
!     }
      
!     # Extract the title
!     if( $content =~ s#<title[^>]*>([^<]+)</title[^>]*>##i ) {
!       $headers{'title'} = $1;
!     }
      
      $content =~ s/^[^<]+//;         # Remove everything before first element
      $content =~ s/>[^<]*/>/g;       # Remove everything between elements (text)
#=======8<------CUT HERE--------axes/crowbars permitted---------------

I'll probably also be sending further patchs for the <PLAINTEXT>
and <LISTING>...</LISTING>  html elements. I was talking about earlier.


  Anthony Thyssen ( System Programmer )   http://www.cit.gu.edu.au/~anthony/
-------------------------------------------------------------------------------
  "Got orders to let no-one through," (a guard)
  "We're not anyone," said Bane, "and that's an order."
                               --- Terry Pratchett   ``The Carpet People''
-------------------------------------------------------------------------------