Re: Libwww suggestions

bcutter@pdn.paradyne.com
Fri, 26 Aug 94 13:08:28 EDT


> Also, does anyone know of perl code for converting HTML entities to
> their 8bit text equivalents in iso-8859-1?  Or maybe their appropriate
> if crude ASCII renditions?  And possibly back the other way (at least
> with the 8bit stuff)?  It seems that would be a useful addition to wwwhtml.

Included below is some code from some a HTML/SGML parsing and display library
I'm working on... (I hope to post more info on this shortly)
[ Btw - does anyone have a URL on SGML Style sheets? ]

Here is a routine that unescapes &ampersand escape codes into their
proper code..

While this doesn't unescape the &umlt; codes, etc - all you need to do is
put the mapping the %html_ampersand array below..  Most of the international
mappings are described like "o with grave accent" - in which case I think
it's acceptable (for ascii display) to map it to "o"...

(Btw - this is a intelligent unescape routine - it will unescape
"&" (proper) and "&amp" (improper - ?) - and won't try to
escape bogus tags like "AT&T")

-Brooks
bcutter@paradyne.com


%html_ampersand = (
'lt',      '<',
'gt',      '>',
'amp',     '&',
'quot',    '"',
'nbsp',    ' ',
);

sub unescape {
  local($html) = @_;
  local($ndx,$ctr,$amp,$c,$len,$_,$replace);
  return($html) if (($ndx = index($html,'&')) == -1);
  unless($html_ampersand_len) {
  # Set the max len of the keys - so I can abort early if no match
    for (keys %html_ampersand) {
      $len = length($_);
      if ($len > $html_ampersand_len) {
        $html_ampersand_len = $len;
      }
    }
  }
  $len = length($html);
  do {
    $ctr = 1;
    $amp = '';
    $replace = 1;
 
    while (($replace) && (!$html_ampersand{$amp})) {
      $c = substr($html,$ndx+$ctr,1);
      if (($c eq '&') || ($c eq ';')) { last; }
      if ($ctr > $html_ampersand_len) { $replace = 0; next; }
      $c =~ tr/A-Z/a-z/;
      $amp .= $c;
      $ctr++;
      if ($ndx+$ctr >= $len) { $replace = 0; next; }
    }
 
    if ($replace && $html_ampersand{$amp}) {
      if (substr($html,$ndx+$ctr,1) eq ';') {
        substr($html,$ndx,$ctr+1) = $html_ampersand{$amp};
      } else {
        substr($html,$ndx,$ctr) = $html_ampersand{$amp};
      }
    }
  } while ((($ndx = index($html,'&',$ndx+1)) != -1) && ($ndx+$ctr < $len));
  return($html);
}