Re: Libwww suggestions
bcutter@pdn.paradyne.com
Fri, 26 Aug 94 13:08:28 EDT
> Also, does anyone know of perl code for converting HTML entities to
> their 8bit text equivalents in iso-8859-1? Or maybe their appropriate
> if crude ASCII renditions? And possibly back the other way (at least
> with the 8bit stuff)? It seems that would be a useful addition to wwwhtml.
Included below is some code from some a HTML/SGML parsing and display library
I'm working on... (I hope to post more info on this shortly)
[ Btw - does anyone have a URL on SGML Style sheets? ]
Here is a routine that unescapes &ersand escape codes into their
proper code..
While this doesn't unescape the ¨t; codes, etc - all you need to do is
put the mapping the %html_ampersand array below.. Most of the international
mappings are described like "o with grave accent" - in which case I think
it's acceptable (for ascii display) to map it to "o"...
(Btw - this is a intelligent unescape routine - it will unescape
"&" (proper) and "&" (improper - ?) - and won't try to
escape bogus tags like "AT&T")
-Brooks
bcutter@paradyne.com
%html_ampersand = (
'lt', '<',
'gt', '>',
'amp', '&',
'quot', '"',
'nbsp', ' ',
);
sub unescape {
local($html) = @_;
local($ndx,$ctr,$amp,$c,$len,$_,$replace);
return($html) if (($ndx = index($html,'&')) == -1);
unless($html_ampersand_len) {
# Set the max len of the keys - so I can abort early if no match
for (keys %html_ampersand) {
$len = length($_);
if ($len > $html_ampersand_len) {
$html_ampersand_len = $len;
}
}
}
$len = length($html);
do {
$ctr = 1;
$amp = '';
$replace = 1;
while (($replace) && (!$html_ampersand{$amp})) {
$c = substr($html,$ndx+$ctr,1);
if (($c eq '&') || ($c eq ';')) { last; }
if ($ctr > $html_ampersand_len) { $replace = 0; next; }
$c =~ tr/A-Z/a-z/;
$amp .= $c;
$ctr++;
if ($ndx+$ctr >= $len) { $replace = 0; next; }
}
if ($replace && $html_ampersand{$amp}) {
if (substr($html,$ndx+$ctr,1) eq ';') {
substr($html,$ndx,$ctr+1) = $html_ampersand{$amp};
} else {
substr($html,$ndx,$ctr) = $html_ampersand{$amp};
}
}
} while ((($ndx = index($html,'&',$ndx+1)) != -1) && ($ndx+$ctr < $len));
return($html);
}