Perl HTML parsing library

Brooks Cutter (bcutter@twilight.paradyne.com)
Tue, 6 Sep 1994 00:01:40 -0400


----------
X-Sun-Data-Type: text
X-Sun-Data-Description: text
X-Sun-Data-Name: text
X-Sun-Charset: us-ascii
X-Sun-Content-Lines: 316


Included below is documentation on a HTML parsing library I've been
working on, and the libraries themselves.  I've seen what exists out
there and I haven't found anything that did what I neded...

The library is still under development and has a few bugs left, but I
hope to have the basic functionality completed soon..

I'd appreciate any comments on the methods, design, implementation, etc...

Brooks Cutter
bcutter@paradyne.com
--------------------------------------------

I've been working on a modular HTML parsing library that can be used
for a wide range of programs that need to deal with HTML.  The library
is designed so you can register a subroutine to handle the output of
one or more SGML/HTML <tags> and concatenate the combined output of
these routines.  This allows routines to be used interchangeably and
does not require the routines to know what other routines exist.

I have been able to use this library for a variety of purposes, including:

- A full featured WWW browser that translates the HTML
  into a (tk) widget specific format.  

  This program translates HTML into something used by a internal widget.

- A www robot (w3new) that extracts URL's from a page

  This program indicates that it's only interested in <img> and <a>
  tags, and discards everything else (other robots may use more than the URL's)

- a HTML filter (for the browser) that looks at the href in each
  hyperlink.  If the URL extension looks like a image (yes, the script
  should use HEAD) it will insert a <img> tag in the hyperlinked text,
  the src taken from the href
  think of a icon list - supplementing hyperlinked images with inline img's ]

  This program replaces some HTML, and preserves the rest of the document.

In the near future I plan on working with a friend to write routines for
converting HTML to postscript.

This library (and the interface) is more complex than anything I've seen
yet, but it is much more flexible than anything I have seen either.
The library is still under development and has a few bugs left (interword 
spacing) and some optimizing work needs to be done to improve performance.

The only other package I'm aware of that does HTML parsing in perl is
the one by Jim Davis at http://dri.cornell.edu/pub/davis/html-parser.html
I haven't spent much time with this package, but from what little I looked
at, it differs with this package in the following ways:

- global variables are used to set the behavior of the package.
	In this package, all interaction is done through arguments and
	well defined interfaces (which should make version transitioning easier)
- Jim's library processes HTML with perl regular expressions.
	In my library, the HTML is first split up into a array with each
	element being either a HTML <tag> or text in between tags.
	Because of this, this program has no (or less) problems with
	multiple tags on the same line or excessive whitespace in
	tags.

I'm sure there are other differences I'm not aware of..

One of the odd requirements of my browser is that when HTML text
is processed, it needs to know what context it is in.  For example,
bold text rendering was implemented by creating a label and setting
the font to bold.  A hyperlink (anchor) was implemented by creating a button
with the hyperlinked text.  But when handling a hyperlink in bold text,
the output of the hyperlink couldn't be placed on a label - instead
the hyperlink needed to know the context in which it was called - with
a bold emphasis.

First, allow me to define the two classes HTML tags (as used in this library).

There are two types of HTML tags.  "Block" Tags that stand alone, like:
<img> <p> <hr> <br> <link> <meta> <isindex> ...
and "Single" tags that enclose text, like
<a> <h1..6> <ul> <ol> <dl> <address> <em> <b> <i> ...

When the library processes a HTML block tag, it gathers all the text
until the end of that block, and then calls the routine with the text.
So for the HTML bit:
<a href="/some/url">this is the text</a>
calls the routine that handles anchors with the $text argument
of "this is the text"

However this means that if there is HTML tags in the tag, it has
to handle them.  (There is a easy way to do this - described later)
<a href="/some/url"><b>this is the text</b></a>
calls the routine that handles anchors with the $text argument
of "<b>this is the text</b>"

This also means that blocks like <ul>..</ul> contans large amount of
embedded tags.

By parsing the HTML this way and giving routines the option of handling
the embedded tags or let the parser do the work - this allows the routines
not to assume anything about the output of sub-elements.

== Communicating with the package and routines ==

There are a series of subroutines that allow you to communicate with
the package.  These include:

&reset_tags($class)
	where $class is either 'none','text','html', or 'postscript'

	'none' - reset the lists of known tags so it will ignore/preserve everything
	'text' - reset the lists to routines that convert HTML to ascii text
	'html' - reset the lists to routines that convert HTML to my browser format
                 (included presently as examples - more generic routines todo later)
  'postscript ' - not currently supported, but in the future will reset
                  to the tags that convert HTML to postscript

&set_tags(*block,*single,$clear)
	*block is a pointer to a tags/routine associative array for handling blocks
	*single is a pointer to a tags/routine associative array for handling single
	$clear is a boolean flag, if set it does equiv of &reset_tags('none') first

a "tags/routine" associative array is one where the key is the HTML tag
processed (in lowercase) and the value is the package'name of the 
routine that will process the tag specified by the key.

Allow me to illustrate with this example:
----------------------------------------------------------------------
&w3bbc_html'reset_tags('none');
%html_block = (
	'a',"'collect_url",
);
%html_single = (
	'img',"'collect_url",
);
&w3bbc_html'set_tags(*html_block,*html_single);
----------------------------------------------------------------------

In this case, the routine &main'collect_url will be called to handle
either <a> hyperlinks or <img> collect_url tags.  On the call to set_tags,
it doesn't pass $clear, so this defaults to 0 - don't clear the tags.
(At the top of the example it explicitly clears them with the reset_tags call)

&parse_html(*links,@html)
	This routine will break up arguments 2-n into the array pointer passed
	as the first arguments (@links)  This routine must be called before calling
	&process_html which interprets/displays/whatever the HTML
&process_html(*opts,*links)
	*opts is a pointer to a associative array for communicating with the
	package and tag processing routines.
	*links is a pointer to @links which is the first argument to &parse_html

== A simple ascii to HTML converter ==

Now that you know about opts, here is a simple example of a 
HTML to ascii converter..

------------------------------/example/------------------------------
&w3bbc_html'reset_tags('text');
&w3bbc_html'parse_html(*links,<STDIN>);
%opts = (
'_first',1,
);
$processed = &w3bbc_html'process_html(*opts,*links);
print STDERR "$processed\n";
------------------------------/example/------------------------------

The above program does the following:

1. Call reset_tags and setup HTML -> text conversion
2. Call parse_html which converts STDIN to a array of tags in @links
3. Call process html with %opts and @links as argument, and it will
   return the processed ascii text.
4. print it.


== The Interface to the tag processing routines ==

Block  tags are called with the arguments: (*opts,*open,*close,$text)
Single tags are called with the arguments: (*opts,*open)

*opts passes %opts
The %opts array is passed to the intial call to &process_html, and then
is passed to *opts tags.  All communication between routines
(for example, like the current emphasis or font type/size)

You can also control the behavior of the parsing/formatting by
passing certain key/value pairs through %opts.  keys prefaced with
a _underscore_ are reserved for the package..


_max_len
	This is a integer representing the maximum width for which text should
	be formatted.  If it's 0, then it won't (shouldn't) wrap at all.
_first
	This is a temporary hack, which I hope to lose eventually.
	This is set to one when &process_html is first called, so when
	the first tag found is a <h1> block - or somethign similar - it doesn't
	put a newline into the stream (starting the page off with a newline)
_filter
	normally text that is between HTML tags is inserted between the output
	of the tags around it.  However, if this text should be translated
	into a different format then you can supply the name of a filter routine
	here.  (For example - in my browser the underlying font of the widget
	is a preformatted text font, so text outside html tags needs to be
	rendered as a label with a variable widget font)
_unknown
	either 'ignore' (default) or 'preserve'
	This key controls whether or not the parser ignores or preserves unknown
	tags.  Preserving HTML tags is useful for programs that improve or rewrite
	parts of a HTML stream.
_debug
	prints out various debugging things (surprise!)
	Currently, if unknown tags are ignored and this is enabled, unknown
	tags will be printed out on STDERR

These are used internally by the browser:

_line_len
	The current length of the line.  Used to determine when to wordwrap
_spaces
	to be used for interword spacing (not currently in use)
_newlines
	the number of newlines currently in the processed stream.  normally
	the maximum newlines that can occur concurrently is 2.
	This is for multiple <p><p>'s or so <p><h1>..</h1> isn't excessively spaced

*open  passes $open  and %open

$open is the name of the opening tag,  in the case of <a href> $open eq "a"

%open is the key/value pairs in the tag, with the key composed of the
$open tag and the interior tag.  For example, given
<a alt="This is a hyperlink" href="/some/url.html">Some text</a>
then
%open = (
'a_alt',  'This is a hyperlink',
'a_href', '/some/url.html',
)

*close passes $close and %close

$close is the name of the closing tag, in the case of <a href> $open eq "a"

%close same as listed above for %open.  (few tags have close key/value pairs)

=== Internal Routines used by tag-processing routines ===

When writing routines that process tags, you have to manage information
passed in %opts like the length of the current line, how many
newlines are in the stream (reset when length > 0)..

Several routines exist to make it easier to manage this information..

&has_tag($text)
	returns 1 if $text contains HTML/SGML <tags>

&is_tag($text)
	returns 1 if $text is a HTML/SGML <tags>

&newlines(*opts,$min,$max)
	This routine inserts up to $min newlines in the routine, providing that
	the number of newlines does not exceed $max.  If $max isn't passed, it
	defaults to 2.

	This is used to avoid problems like multiple <p>'s generating
	more than 2 newlines at a time.  For example, the first <p> will
	call &newlines(*opts,2) and depending if there is text before the <p>,
	it probably returns 2 newlines.  On the second call - there is already
	2 newlines in the processed stream, so it returns 0.
	This way the routines only indicate their desire for certain behavior
	and don't need to look backwards in the processed stream to determine
	if the behavior is required/desired.

&display_text(*opts,$text)
	This routine is called every time text is displayed.  $text should
	have no newlines in it - and the length of $text will be added
	to $opts{'_line_len'} and $opts{'_newlines'} will be rest to 0

=== Writing a block tag routine ===

Below is a example of a routine that converts a HTML header
into preformatted ascii text.  While this routine does recognize
and handle embedded tags within the text, it doesn't handle
wordwrapping.

------------------------------/example/------------------------------
sub text_tag_h {
  local(*opts,*open,*close,$html) = @_;
  $html =~ s/\s+/ /g; # Convert multiple space/newlines into single space
  local($ret,$text,@parsed);
  $ret .= &newlines(*opts,2); # 2 newlines before..
  if (&has_tag($html)) {
    @parsed = ();
    &parse_html(*parsed,$html);
    $html = &process_html(*opts,*parsed);
  }
  local($start) = $open;
  $start =~ s/^h//i;
  $text = "Header level #$start: $html";
  $ret .= $text; # put the text in the stream
  &display_text(*opts,$text); # register the displayed text length
  $ret .= &newlines(*opts,2); # 2 newlines after..
  return($ret);
}

------------------------------/example/------------------------------

=== Future directions ===

- fix existing bugs like:
  - interword spacing 
  - I believe there actually is still a '<pre>' reference in the code -
    the tag should be hidden from view..
	- parsing by SGML dtd
	- display by SGML style sheets (anyone got any leads on this?!?!?)
----------
X-Sun-Data-Type: default
X-Sun-Data-Description: default
X-Sun-Data-Name: w3bbc_html.pl
X-Sun-Charset: us-ascii
X-Sun-Content-Lines: 664

require 'text_tags.pl';

# Notes
#
# The hardest part of this library was getting the spacing right.
#
#
# - Some tags have a "whitespace block" and some tags don't.  For example,
# <h#>..</h#>, <pre>..</pre>, <[oud]l>..</[oud]l>, <p>, <hr>
# has a blank line before and after the tags
# <br> is replaced with a single blank line..
# <b>..</b> <i>..</i> <a>..</a> <img> have 0 or more spaces on either side..
# 
# Rules for HTML spacing...
# 0. The following returns do not apply for anything inside a <pre> block.
# Excessive returns are allowed in <pre> and other html tags are
# supported in <pre>...
# 1. At most, the largest "whitespace block" will be two returns (\n\n)
# 2. Multiple whitespace characters will be collapsed into a single space,
# unless it is the beginning of a line

require 'shellwords.pl' unless(defined &shellwords);
# I saw a nasty HTML deviation that shellwords won't parse.  it was:
# <a alt="some text"href="/file.html"> ... no space between " and href

package w3bbc_html;

%html_tag_ignore = (
'head',1,
'html',1,
'body',1,
);

%html_tags_block = (
'a',       "'html_tag_a",
'b',       "'html_tag_em",
'i',       "'html_tag_em",
'em',      "'html_tag_em", # map to italic
'cite',    "'html_tag_em", # map to italic
'strong',  "'html_tag_em", # map to bold
'dfn',     "'html_tag_em", # map to bold
'h1',      "'html_tag_h",
'h2',      "'html_tag_h",
'h3',      "'html_tag_h",
'h4',      "'html_tag_h",
'h5',      "'html_tag_h",
'h6',      "'html_tag_h",
'pre',     "'html_tag_pre",
'title',   "'html_tag_title",
);

%html_tags_single = (
'img',     "'html_tag_img",
'isindex', "'html_tag_isindex",
);

%text_tag_ignore = (
'head',1,
'html',1,
'body',1,
);

%text_tags_block = (
'a',       'text_tag_a',
'b',       'text_tag_em',
'i',       'text_tag_em',
'em',      'text_tag_em', # map to italic
'cite',    'text_tag_em', # map to italic
'strong',  'text_tag_em', # map to bold
'dfn',     'text_tag_em', # map to bold
'ul',      'text_tag_list',
'ol',      'text_tag_list',
'dl',      'text_tag_list',
'menu',    'text_tag_list',
'dir',     'text_tag_list',
'address', 'text_tag_address',
'h1',      'text_tag_h',
'h2',      'text_tag_h',
'h3',      'text_tag_h',
'h4',      'text_tag_h',
'h5',      'text_tag_h',
'h6',      'text_tag_h',
'pre',     'text_tag_pre',
);

%text_tags_single = (
'img',     'text_tag_img',
'p',       'text_tag_p',
'br',      'text_tag_br',
'hr',      'text_tag_hr',
# These should be with <ul>/<ol> or <dl>
'li',      'text_tag_li',
'dt',      'text_tag_dt',
);

# This is a hack, so it will parse deviations from syntactically correct HTML
# this was used to stop a unterminated <a> from making the whole doc linked
%tags_not_nested = (
'a',    1,
);

%html_ampersand = (
'lt',      '<',
'gt',      '>',
'amp',     '&',
'quot',    '"',
'nbsp',    ' ',
);


# Interface
# reset_tags($class)
# Where class is:
# text, html, postscript, none
# text - reset to text bindings
# html - reset to html bindings
# postscript - reset to poscript (not yet though)
# none - erase all current bindings
sub reset_tags {
	local($class) = @_;
	if (!$class || $class eq 'text') {
		%tags_block = %text_tags_block;
		%tags_single = %text_tags_single;
		%tags_ignore = %text_tags_ignore;
	} elsif ($class eq 'html') {
		%tags_block = %html_tags_block;
		%tags_single = %html_tags_single;
		%tags_ignore = %html_tags_ignore;
	} elsif ($class eq 'postscript') {
		print STDERR "class $class not yet supported\n";
	} elsif ($class eq 'none') {
		undef %tags_block;
		undef %tags_single;
		undef %tags_ignore;
	} else {
		print STDERR "class $class unsupported\n";
	}
#&'dumpvar('w3bbc_html','tags_single','tags_block');
}

sub set_tags_class {
	local($class) = @_;
#&'dumpvar('w3bbc_html','tags_single','tags_block');
	if ($class eq 'text') {
		for (keys %text_tags_block)  { $tags_block{$_} = $text_tags_block{$_}; }
		for (keys %text_tags_single) { $tags_single{$_} = $text_tags_single{$_}; }
		for (keys %text_tags_ignore) { $tags_ignore{$_} = $text_tags_ignore{$_}; }
	} elsif ($class eq 'html') {
		for (keys %html_tags_block)  { $tags_block{$_} = $html_tags_block{$_}; }
		for (keys %html_tags_single) { $tags_single{$_} = $html_tags_single{$_}; }
		for (keys %html_tags_ignore) { $tags_ignore{$_} = $html_tags_ignore{$_}; }
	}
#&'dumpvar('w3bbc_html','tags_single','tags_block');
}
sub set_tags {
	local(*block,*single,$clear) = @_;
	local($_);
	if ($clear) { undef %tags_block; undef %tags_single; }
	for (keys %block)  { $tags_block{$_} = $block{$_}; }
	for (keys %single) { $tags_single{$_} = $single{$_}; }
	for (keys %ignore) { $tags_ignore{$_} = $ignore{$_}; }
}

sub newlines {
	local(*opts,$num,$max) = @_;
	$max = 2 unless($max);
#print STDERR "newlines($num)\n";
#print STDERR "caller(0)=(",join(',',caller(0)),")\n";
#print STDERR "caller(1)=(",join(',',caller(1)),")\n";
	return if ($opts{'_newlines'} >= $max);
#print STDERR "caller(2)=(",join(',',caller(2)),")\n";
#print STDERR "opts{'_first'}='$opts{'_first'}'\n";
#print STDERR "opts{'_line_len'}='$opts{'_line_len'}'\n";
#print STDERR "opts{'_newlines'}='$opts{'_newlines'}'\n";
	local($n,$nl);
	if ($opts{'_first'} && !$opts{'_line_len'} && !$opts{'_newlines'}) {
		# Ignore newlines at the beginning if top of file and no leading text
	} elsif($opts{'_newlines'} < $max) {
		if (($opts{'_newlines'}+$num) > $max) {
			$nl = "\n" x ($max - $opts{'_newlines'});
			$opts{'_newlines'} = $max;
		} else {
			$nl = "\n" x $num;
			$opts{'_newlines'} += $num;
		}
	}
	$opts{'_first'} = $opts{'_line_len'} = $opts{'_spaces'} = 0;
#print STDERR "newlines($num,$max) = ",length($nl),"\n";
#print STDERR "opts{'_first'}='$opts{'_first'}'\n";
#print STDERR "opts{'_line_len'}='$opts{'_line_len'}'\n";
#print STDERR "opts{'_newlines'}='$opts{'_newlines'}'\n";
	return($nl);
}

sub display_text {
	local(*opts,$text) = @_;
#print STDERR "display_text(opts,$text)\n";
#print STDERR "caller(0)=(",join(',',caller(0)),")\n";
#print STDERR "caller(1)=(",join(',',caller(1)),")\n";
#print STDERR "caller(2)=(",join(',',caller(2)),")\n";
	#return unless (length($text));
	$opts{'_spaces'} = $opts{'_newlines'} = 0;
#print STDERR "display_text: opts{'_line_len'}='$opts{'_line_len'}'\n";
	$opts{'_line_len'} += length($text);
#print STDERR "display_text: opts{'_line_len'}='$opts{'_line_len'}'\n";
	$opts{'_spaces'} = 1 if ($text =~ / $/);
}

sub has_tag {
	#return(($_[0] =~ /<\S.*>/) ? 1 : 0);
	return(($_[0] =~ /<\S+\s*.*>/) ? 1 : 0);
}

sub is_tag {
	#return(($_[0] =~ /^<\S.*>$/) ? 1 : 0);
	return(($_[0] =~ /^<\S+\s*.*>$/) ? 1 : 0);
}

# Takes a HTML tag like:
# <img src="/some/path/to/a/pic.gif" ismap>
# and returns
# ('a','src','/home/path/to/a/pic.gif','ismap','1')
# which should be read like
# ($tag,%tags) = &split_tag($html);
#
sub split_tag {
	local($html_tag) = shift;
	return($html_tag) unless($html_tag =~ /^<.*>$/);
	local($tag,@tags,$_,$key,$val,%ret);

	$tag = $html_tag;
	$tag =~ s/^<//;
	$tag =~ s/\s*>$//;
	$tag =~ s/[\r\n]+/ /g; # to handle <a\nhref...>
	@tags = &'shellwords($tag);
	$tag = shift(@tags);
	$tag =~ tr/A-Z/a-z/;

	for (@tags) {
	# This will skip value's like
	# <img src="file" alt="">
		if (/=/) {
			local($val_i);
			($key,$val) = split(/=/,$_,2);
			$key =~ tr/A-Z/a-z/;
			$val =~ s/^"//;
			$val =~ s/"$//;
			next unless(length $val);

			# For <img src="foo" ismap="ismap"> 
			#$val = 1 if ($key =~ /^$val$/i);
			$val_i = $val; $val_i =~ tr/A-Z/a-z/;
			$val = 1 if ($key eq $val_i);
			$ret{"${tag}_$key"} = $val;
		} else {
			# For <img src="foo" ismap>
			$ret{"${tag}_$_"} = 1;
		}
	}
	return($tag,%ret) if (%ret);
	return($tag);
}

sub process_html_text {
	local(*opts,$text,$nowrap) = @_;
#print STDERR "process_html_text(opts,$text)\n";
#print STDERR "caller(0)=(",join(',',caller(0)),")\n";
#print STDERR "caller(1)=(",join(',',caller(1)),")\n";
#print STDERR "caller(2)=(",join(',',caller(2)),")\n";
	local($ret,$tag,$_,@t,$t,$routine);
	return unless(($text =~ /\S/) || ($opts{'_tag'} eq 'pre'));

	if ($opts{'_tag'}) {
		$tag = $opts{'_tag'};
		local($open,$close) = ($tag,"/$tag");
		local(%open,%close);
		for (keys %opts) { next unless(/^${tag}_/o); $open{$_} = $opts{$_}; }
		if ($tags_single{$tag}) {
			$ret .= &call_tags_single(*opts,*open);
		} elsif ($tags_block{$tag}) {
			$ret .= &call_tags_block(*opts,*open,*close_tag,$text) if ($text);
		} elsif ($tags_ignore{$tag}) {
			return undef;
		}
		return($ret);
	}

	$text =~ s/\s+/ /g; 
	if ($opts{'_newlines'}) {
		$text =~ s/^\s+//;
	}
	$routine = $opts{'_filter'};
	if (!$nowrap && ($opts{'_line_len'}+length($text) > $opts{'_max_len'})) {

		@t = &wordwrap($opts{'_line_len'},$opts{'_max_len'},$text);
		for $t (0..$#t) {
			if ($opts{'_filter'}) {
				$ret .= &$routine(*opts,$t[$t]);
			} else {
				$ret .= $t[$t];
				&display_text(*opts,$t[$t]);
			}
			$ret .= &w3bbc_html'newlines(*opts,1) unless($t == $#t);
		}
	} else {
		if ($opts{'_filter'}) {
			$ret .= &$routine(*opts,$text);
		} else {
			$ret .= &unescape($text);
			&display_text(*opts,$text);
		}
	}
	$text = '';
	return($ret);
}

sub call_tags_single {
	local(*opts,*tag) = @_;
	local($routine) = $tags_single{$tag};
	return &$routine(*opts,*tag);
}

sub call_tags_block {
	local(*opts,*tag,*close_tag,$text) = @_;
	local($routine) = $tags_block{$tag};
	return &$routine(*opts,*tag,*close_tag,$text);
}

sub process_html {
	local(*opts,*html_in) = @_;
	local($ret,$_,$l,$open,%open,$num_tags,$text,$routine,$close_tag,%close_tag);
	local($last_tag,$open_ndx);

	#unless($opts{'_textspace'}) { $opts{'_textspace'} = "    "; } # 4 spaces
	unless($opts{'_textspace'}) { $opts{'_textspace'} = "  "; } # 2 spaces
	unless($opts{'_max_len'}) { $opts{'_max_len'} = 75; }
	$open_ndx = $l = -1;
	while (defined $html_in[++$l]) {
#print STDERR "html_in[$l]='$html_in[$l]'\n";
		#next if ($html_in[$l] =~ /^<\!.*>$/);
		next if (substr($html_in[$l],0,2) eq '<!');
		if (&is_tag($html_in[$l])) {
			$html_in[$l] =~ s/[\r\n]+/ /g; # Convert newlines into spaces
#print STDERR "html_in[$l] istag\n";
			#next if ($html_in[$l] =~ /^<!/); # HTML comment 
			# (should be <!-- --> but everyone doesn't use SGML comment form)
			$last_tag = $open;
			$html_in[$l] =~ s/\s+=/=/g; # kludge to fix <a href ="url">
			$html_in[$l] =~ s/=\s+/=/g; # kludge to fix <a href= "url">
			# The above should be rewritten to take a argument, check for /=/, 
			# if not take the next, unless (/=\s*\S/) and the next..
			($open,%open) = &split_tag($html_in[$l]);
			#if ($text =~ /\S/) { $ret .= &process_html_text(*opts,$text); }
			if (($text =~ /\S/)  || ($opts{'_tag'} eq 'pre'))
				{ $ret .= &process_html_text(*opts,$text); }
			$text = '';
			if ($tags_ignore{$open}) {
				next;
			} elsif ($tags_single{$open}) {
				$ret .= &call_tags_single(*opts,*open);
			} elsif ($tags_block{$open}) {
				$num_tags = 1;
				while (defined $html_in[++$l]) {
#print STDERR "html_in[$l]='$html_in[$l]'\n";
					if (($num_tags) && ($html_in[$l] =~ m!^<$open(\s.*)*\s*>$!i) 
					&& ($tags_not_nested{$open})) {
					# Un-nested block tags
						$num_tags--;
						($close_tag,%close_tag) = ("/$open");
						$l--;
						last;
					} elsif ($html_in[$l] =~ m!^<$open(\s.*)*\s*>$!i) { 
					# nested block tags
						$open_ndx = $l+1;
						$num_tags++;
						$text .= $html_in[$l];
						($close_tag,%close_tag) = ();
					} elsif ($html_in[$l] =~ m!^</$open(\s.*)*\s*>$!i) { 
						$open_ndx = -1;
						$num_tags--;
						unless ($num_tags) {
							($close_tag,%close_tag) = &split_tag($html_in[$l]);
							last;
						} else {
							$text .= $html_in[$l];
						}
					} else {
						$text .= $html_in[$l];
					}
				}
				if (!$close && $open_ndx != -1) {
					#print STDERR "$open wasn't closed, resetting..\n";
					#$l = $open_ndx; 
					#$text = '';
					#next;
					$close = "/$open";
				}
				$ret .= 
					&call_tags_block(*opts,*open,*close_tag,$text)
					if (($text =~ /\S/)  || ($opts{'_tag'} eq 'pre'));
				$text = '';
			} else {
				# Otherwise it's a invalid tag, ignore it
				if ($opts{'_unknown'} eq 'preserve') {
					$text .= $html_in[$l];
				} else {
					print "process_html: Ignoring tag '$open'\n" if ($opts{'_debug'});
				}
			}
		} else {
			$text .= $html_in[$l];
		}
	}

	if (($text =~ /\S/)  || ($opts{'_tag'} eq 'pre'))
		{ $ret .= &process_html_text(*opts,$text); }
	$text = '';

	return($ret);
}


# subroutine: parse_html
# Argument 1: (*data)  - array pointer to return data
# Argument 2-n: variables with HTML data
# Returns in *data HTML split up - first is non-HTML tag, 2nd HTML tag...
sub parse_html {
	local(*data) = shift(@_);
	local($save,$lt,$gt,$data,$_);
	NEXTLINE: for (@_) {
		$save .= $_;
		if ((($lt = index($save,'<')) == -1) || (index($save,'>',$lt) == -1))
			{ next; }
		$lt = $gt = 0;
		while (($lt = index($save, '<', $gt)) >= $[) {
			# This is the data *BEFORE* the '<'
			if ($lt) { # do If isn't /^</
				if ($gt) {
					$data = substr($save, ($gt+1), ($lt-$gt-1));
				} else {
					$data = substr($save, ($gt), ($lt-$gt));
				}
				push(@data, $data);
			}
			$gt = index($save, '>', $lt);
			if ($gt == -1) {
				$save = substr($save, $lt);
				next NEXTLINE;
			}
			# This is the data *INSIDE* the <>
			$data = substr($save, $lt, ($gt-$lt+1));
			push(@data, $data);
		}
		$save = substr($save, ($gt+1));
	}
	push(@data, $save);
	return(@data);
}

sub parse_html_list {
	local($type) = shift(@_);
	local(@links,$n,$_,@ret,$text,$oldn,@type,$key,$subtype,@subtype,$tmp);
	$n = 1;
	&w3bbc_html'parse_html(*links, @_);
	$subtype = ' ';
	for (@links) {	
		if (m!^<(dl|ol|ul|menu|dir)\s*>$!i) {
			$tmp = $1;
			$tmp =~ tr/A-Z/a-z/;
			$key = "$type/$subtype/$n";
			push(@type,$type);
			push(@subtype,$subtype);
			$type = $tmp; 
			push(@ret,$key,$text) if ($text =~ /\S/); # At least one non-space char
			$text = '';
			$subtype = ' ';
			$oldn = $n;
			$n++;
		} elsif (m!^</(dl|ol|ul|menu|dir)\s*>$!i) {
			$key = "$type/$subtype/$n";
			push(@ret,$key,$text) if ($text =~ /\S/); # At least one non-space char
			#$type = pop(@type);
			$type = ' '; # Continuation of previous type
			#$subtype = pop(@subtype);
			$subtype = ' ';
			$text = '';
			$oldn = $n;
			$n--;
		} elsif (m!^<li\s*>$!i) {
			if ($type eq ' ') { $type = pop(@type); }
			$key = "$type/$subtype/$n";
			push(@ret,"$key",$text) if ($text =~ /\S/); # At least one non-space char
			$subtype = ' ';
			$text = '';
		} elsif (m!^<dt\s*>$!i) { # Description term
			$key = "$type/$subtype/$n";
			push(@ret,$key, $text) if ($text =~ /\S/);
			$subtype = 'dt';
			$text = '';
		} elsif (m!^<dd\s*>$!i) { # Description definition
			$key = "$type/$subtype/$n";
			push(@ret,$key, $text) if ($text =~ /\S/);
			$subtype = 'dd';
			$text = '';
		} else {
			$text .= $_;
		}
	}
	push(@ret,"$type/$subtype/$n",$text) if ($text =~ /\S/); 
	return(@ret);
}


sub wordwrap {
  local($offset,$length,$text) = @_;
#print STDERR "wordwrap($offset,$length,$text)\n";
#print STDERR "return($text) if (($offset+length($text)) <= $length);\n";
	return($text) if (($offset+length($text)) <= $length);
#print STDERR "caller(0)=(",join(',',caller(0)),")\n";
#print STDERR "caller(1)=(",join(',',caller(1)),")\n";
#print STDERR "caller(2)=(",join(',',caller(2)),")\n";
	# Die hard - otherwise it will loop forever..
	die "wordwrap called without valid length argument" unless($length);
  local(@r,$space,$l,$r);

	if ($offset) { $r = ($length - $offset) } else { $r = $length; }

  while (1) {
    last unless($text);
		$l = length($text);
    if ($l > $r) {
      $space = rindex($text, ' ', $r);
      if ($space == -1) {
				unless($offset) {
        	push(@r, substr($text, 0, $r));
        	$text = substr($text, $r, $length);
				} else {
					push(@r,''); # skip to next line
				}
        #$text = substr($text, $length);
      } else {
        push(@r, substr($text, 0, $space));
        $text = substr($text, ($space+1)); # Skip the space
      }
    } else {
      push(@r, $text);
      $text = '';
    }
		$r = $length;
		$offset = 0;
  }
  return(@r);
}

# This routine will take a argument of a url which contains
# HTML data.  It will extract the <img src=> and <a href=> links,
# and return a array with two entries for each link.  
#
# For <a href=> links, the first entry will be the description between
# <a> and </a> and the second will be the URL for that hyperlink
#
# For <img src=> links, the first will be the entire HTML tag, and the
# second will be the URL for the image..
#
sub extract_links_desc {
	local($in_url) = shift(@_);
	local($links,@links,%links);
	local(@ret,$url,$_,$desc,@headers,%headers,$content,$response);

  $response = 
		&www'lrequest('GET', *in_url, *headers, *content, $request_timeout);

	&parse_html(*links,$content);
	$links = -1;
	while (defined $links[++$links]) {
		$_ = $links[$links];
		if ((/^<a\s.*href\s*=\s*"(\S+)".*>$/i) 
		|| (/^<a\s.*href\s*=\s*(\S+).*>$/i)) {
			$url = $1; $desc = '';
			while (defined $links[++$links]) {
				last if ($links[$links] =~ m!<\s*/\s*a\s*>!);
				$desc .= $links[$links];
			}
			push(@ret,$url,$desc);
		} elsif ((/^\s*<img\s+src\s*=\s*"(\S+)".*>$/i) 
		|| (/^\s*<img\s+src\s*=\s*(\S+).*>$/i)) {
			push(@ret,$1,$_);
		}	else {
			next;
		}
	}
	return(@ret);
}


sub parse_html_tag {
	local(*opts,*tag,*close_tag,$html) = @_;
	#local(*opts,*tag,*close_tag,$html) = (shift,shift,shift,shift);
	local(@parsed,@save,$ret);

#	unless ($opts{'_tag'} eq $open) {
#		$ret = &newlines(*opts,2);
#	}
	if (&has_tag($html)) {
		@parsed = ();
		&parse_html(*parsed,$html);

		$html = &process_html(*opts,*parsed);
	} else {
		&display_text(*opts,$html);
	}

	$ret .= $html;
	return($ret);
}

sub unescape {
	local($html) = @_;
#print STDERR "unescape($html)\n";
#print STDERR "caller(0)=(",join(',',caller(0)),")\n";
#print STDERR "caller(1)=(",join(',',caller(1)),")\n";
#print STDERR "caller(2)=(",join(',',caller(2)),")\n";
	local($ndx,$ctr,$amp,$c,$len,$_,$replace);
	return($html) if (($ndx = index($html,'&')) == -1);
	unless($html_ampersand_len) {
	# Set the max len of the keys - so I can abort early if no match
		for (keys %html_ampersand) { 
			$len = length($_);
			if ($len > $html_ampersand_len) {
				$html_ampersand_len = $len;
			}
		}
	}
	$len = length($html);
	do {
		$ctr = 1;
		$amp = '';
		$replace = 1;

		while (($replace) && (!$html_ampersand{$amp})) {
			$c = substr($html,$ndx+$ctr,1);
			if (($c eq '&') || ($c eq ';')) { last; }
			if ($ctr > $html_ampersand_len) { $replace = 0; next; }
			$c =~ tr/A-Z/a-z/;
			$amp .= $c;
			$ctr++;
			if ($ndx+$ctr >= $len) { $replace = 0; next; }
		}

		if ($replace && $html_ampersand{$amp}) {
			if (substr($html,$ndx+$ctr,1) eq ';') {
				substr($html,$ndx,$ctr+1) = $html_ampersand{$amp};
			} else {
				substr($html,$ndx,$ctr) = $html_ampersand{$amp};
			}
		}
	} while ((($ndx = index($html,'&',$ndx+1)) != -1) && ($ndx+$ctr < $len));
#print STDERR "unescape = $html\n";
	return($html);
}

package main;

1;
----------
X-Sun-Data-Type: default
X-Sun-Data-Description: default
X-Sun-Data-Name: text_tags.pl
X-Sun-Charset: us-ascii
X-Sun-Content-Lines: 330

package w3bbc_html;

sub text_tag_pre {
	local(*opts,*tag,*close,$html) = @_;
	local(@parsed,@save,$ret);

	#unless ($opts{'_tag'} eq 'pre') {
	unless ($opts{'_tag'}) {
		$html =~ s/^\s*[\r\n]+//;
		$html =~ s/\s*[\r\n]+\s*$//;
		$ret .= &newlines(*opts,2);
	}
	if (&has_tag($html)) {
		@parsed = ();
		&parse_html(*parsed,$html);

		@save = ($opts{'_tag'},$opts{'font_type'},$opts{'font_emphasis'});
		$opts{'font_type'} = 'plain';
		$opts{'font_emphasis'} = '*';
		$opts{'_tag'} = 'pre';
		$html = &process_html(*opts,*parsed);
		($opts{'_tag'},$opts{'font_type'},$opts{'font_emphasis'}) = @save;
	} else {
		&display_text(*opts,$html);
	}

	$ret .= $html;
	#unless ($opts{'_tag'} eq 'pre') {
	unless ($opts{'_tag'}) {
		$ret .= &newlines(*opts,2);
	}
	return($ret);
}

sub text_tag_address {
	local(*opts,*tag,*close,$html) = @_;
	local(@parsed,@save,$ret);

	$ret .= &newlines(*opts,1);
	if ($opts{'a_href'}) {
    # Rather than handle it myself, wrap it with the URL and pass it on..
    local(%open2,%close2);
    $open2{'a_href'} = $opts{'a_href'};
    $open2{'a_name'} = $opts{'a_name'};
    local($routine) = $w3bbc_html'tags_block{'a'};
    $ret .= &$routine(*opts,*open2,*close2,$html);
	} elsif (&has_tag($html)) {
		@parsed = ();
		&parse_html(*parsed,$html);

		$ret .= &process_html(*opts,*parsed);
	} else {
		&display_text(*opts,$html);
		$ret .= $html;
	}

	$ret .= &newlines(*opts,1);
	return($ret);
}

sub text_tag_a {
	local(*opts,*open,*close,$html) = @_;
	local($ret) = &parse_html_tag(*opts,*tag,*close,$html);
	return $ret;
}

sub text_tag_img {
	return("[IMAGE]");
}

sub text_tag_h {
	local(*opts,*open,*close,$html) = @_;
	$html =~ s/\s+/ /g; # Convert multiple space/newlines into single space
	local($ret,$text,@parsed);
	$ret .= &newlines(*opts,2); # 2 newlines before..
	if (&has_tag($html)) {
		@parsed = ();
		&parse_html(*parsed,$html);
		$html = &process_html(*opts,*parsed);
	}
	local($start) = $open;
	$start =~ s/^h//i;
	$text = "Header level #$start: $html";
	$ret .= $text; # put the text in the stream
	&display_text(*opts,$text); # register the displayed text length
	$ret .= &newlines(*opts,2); # 2 newlines after..
	return($ret);
}

sub text_tag_cite {
#
# bug in routine: will capitalize HTML tags, like 
# <a href="/case/sensitive/path">
#
	local(*opts,*tag,*close,$html) = @_;
	if ($open eq 'b') {
		$html =~ tr/a-z/A-Z/;
	} elsif ($open eq 'i') {
		$html = "{$html}";
	}
	local(@save) = ($opts{'_tag'});
	local($ret) = &parse_html_tag(*opts,*tag,*close,$html);
	($opts{'_tag'}) = @save;
	return $ret;
}

sub text_tag_p {
	local(*opts,*tag) = @_;
	return(&newlines(*opts,2));
}

sub text_tag_br {
	local(*opts,*tag) = @_;
	return(&newlines(*opts,1));
}

sub text_tag_li { # in case they forget the <ul> or <ol>
	local(*opts,*open) = @_;
	local($ret,$r);
	$ret = &w3bbc_html'newlines(*opts,1,1);
	$r = "o".substr($opts{'_textspace'},1);
	&w3bbc_html'display_text(*opts,$r);
	$ret .= $r;
	return($ret);
}

sub text_tag_dt { # in case they forget the <ul> or <ol>
	local(*opts,*open) = @_;
	local($ret,$r);
	return &w3bbc_html'newlines(*opts,2,2);
}

sub text_tag_hr {
	local(*opts,*open) = @_;
	local($ret,$hr);
	#$ret .= &newlines(*opts,2);
	$ret .= &newlines(*opts,1,1);
	$hr = '-' x $opts{'_max_len'};
	&display_text(*opts,$hr);
	#$ret .= $hr . &newlines(*opts,2);
	$ret .= $hr . &newlines(*opts,1,1);
	return($ret);
}


sub text_tag_list {
	local(*opts,*open,*close,$html) = @_;
#print STDERR "text_tag_list($opts{'_tag'},$open,$close,$html)\n";
#print STDERR "caller(0)=(",join(',',caller(0)),")\n";
#print STDERR "caller(1)=(",join(',',caller(1)),")\n";
#print STDERR "caller(2)=(",join(',',caller(2)),")\n";

	return unless($html =~ /\S/);
	local($text,$n,$oldn,$ret,$ol_num,$space,@ol_num,$noud,$type,$indent);
	local($_,$subtype,$indent_space,$nl,$processed,@li,@parsed,@save);
	local($len,$prefix,@text,$escaped,$tmp,$indent_space1,$indent_space2);

	if ($opts{'_tag'}) {
		if (&has_tag($html)) {
			@parsed = ();
			&parse_html(*parsed,$html);
			$html = &process_html(*opts,*parsed);
			return($html);
		} else {
			$html =~ s/\s+/ /g;
			$indent = $opts{'_indent'};
			$indent_space = ($opts{'_textspace'} x ($indent-1));
			$indent_space1 = ($opts{'_textspace'} x ($indent));
			$indent_space2 = ($opts{'_textspace'} x ($indent+1));
			$len = $opts{'_max_len'} - length($indent_space2);
			@text = &wordwrap($opts{'_line_len'},$len,$html);
			for $text (0..$#text) { 
				if (($text == 0) && ($opts{'list_first'})) {
					$escaped = '';
				} elsif ($opts{'_line_len'}) { 
					$escaped = '';
				} elsif (($opts{'_tag'} eq 'dl') || ($opts{'tag_subtype'} eq 'dd')) {
					$escaped = $indent_space1;
				} else {
					$escaped = $indent_space2;
					#$escaped = $indent_space;
				}
				$opts{'list_first'} = 0;
				local(@save) = ($opts{'_tag'});
				$opts{'_tag'} = '';
				$ret .= $escaped . &w3bbc_html'process_html_text(*opts,$text[$text]);
				($opts{'_tag'}) = @save;
				$ret .= &newlines(*opts,1) unless($text == $#text);
			}
		}
		return($ret);
	}
	
	@li = &parse_html_list($open,$html);
	$ret .= &newlines(*opts,2);
	local($otype,$osubtype); # otype/osubtype is old type/subtype
	while (@li) {
		$oldn = $n;
		$otype = $type;
		$osubtype = $subtype;
		$noud = shift(@li);
		($type,$subtype,$n) = split(/\//,$noud);
		if (($oldn) && ($oldn > $n)) { # tag change, end of current tag
			$ol_num = pop(@ol_num);
			$ret .= &newlines(*opts,1,1);
			$ret .= &newlines(*opts,1) if ($otype eq 'dl');
			#$opts{'_indent'}--;
			$opts{'_indent'}-=$oldn-$n;
		} elsif (($oldn) && ($oldn < $n)) { # tag change, begin of new tag
			push(@ol_num,$ol_num);
			$ol_num = 0;
			$ret .= &newlines(*opts,1,1);
			$ret .= &newlines(*opts,1) if ($type eq 'dl');
			#$opts{'_indent'}++;
			$opts{'_indent'}+=$n-$oldn;
		}
		$indent = $n + $opts{'_indent'};

		$text = shift(@li);
		if (&has_tag($text)) {
			@parsed = ();
			&parse_html(*parsed,$text);
			local(@save) = 
				($opts{'_indent'},$opts{'_tag'},$opts{'tag_type'},$opts{'tag_subtype'});
			$opts{'_indent'} = $indent;
			$opts{'tag_subtype'} = $subtype;
			$opts{'tag_type'} = $type;
			$opts{'_tag'} = $open;
			#$opts{'list_first'} = 1;
			$opts{'list_first'} = 1 unless($type eq '');
			$text = &process_html(*opts,*parsed);
			$opts{'list_first'} = 0;
			($opts{'_indent'},$opts{'_tag'},$opts{'tag_type'},$opts{'tag_subtype'}) 
				= @save;
			$processed = 1;
		} else {
			$text =~ s/\s+/ /g;
			$processed = 0;
		}
		$indent_space = (($indent > 1) ? ($opts{'_textspace'} x ($indent-1)):undef);
		$indent_space1 = ($opts{'_textspace'} x ($indent));
		$indent_space2 = ($opts{'_textspace'} x ($indent+1));
		$space = $opts{'_textspace'};
		$opts{'_space'} = 1;
		if ($type eq 'ol') { # <li> tag
			$ol_num++;
			substr($space,0,length($ol_num)+1) = "$ol_num.";
			$prefix = $indent_space . $opts{'_textspace'} . $space;
		} elsif (($type eq 'ul') || ($type eq 'menu') || ($type eq 'dir')) {# <li>
			substr($space,0,1) = "o";
			$prefix = $indent_space . $opts{'_textspace'} . $space;
		} elsif (($type eq 'dl') && ($subtype eq 'dt')) { # <dt> tag
			$prefix = $indent_space;
		} elsif (($type eq 'dl') && ($subtype eq 'dd')) { # <dd> tag
			$prefix = $indent_space . $opts{'_textspace'};
		} elsif ($type eq ' ') { # Continuation
			#$prefix = $indent_space . $opts{'_textspace'};
			$prefix = $indent_space2;
			#$prefix = $indent_space2 . $opts{'_textspace'};
		} else {
			warn 
				"Unknown type '$type' subtype '$subtype' noud='$noud' text='$text'\n";
		}
		if ($text) {
			$len = $opts{'_max_len'} - length($indent_space);
			if ($processed) {
				$ret .= $prefix . $text;
			} else {
				if ($opts{'_space'}) { $text =~ s/^ //; $opts{'_space'} = 0; }
				@text = &wordwrap($opts{'_line_len'},$len,$text);
				for $text (0..$#text) { 
					if ($text == 0) {
						$escaped = $prefix;
					} elsif ($type eq 'dl' && $subtype eq 'dt') {
						$escaped = $indent_space;
					} elsif ($type eq 'dl' && $subtype eq 'dd') {
						$escaped = $indent_space1;
					} else {
						$escaped = $indent_space2;
					}
					$ret .= $escaped . 
						&w3bbc_html'process_html_text(*opts,$text[$text],1);
					$ret .= &newlines(*opts,1) unless($text == $#text);
				}
			}
		}
		$ret .= &newlines(*opts,1,1);
	}
	$ret .= &newlines(*opts,1);
	return($ret);
}

sub html_tag_pre {
  local(*opts,*tag,*close_tag,$html) = @_;
  return unless($html);
#print STDERR "html_tag_pre($opts{'_tag'},$tag,$close_tag,$html)\n";
  local(@parsed,@save,$ret);
 
  if (&w3bbc_html'has_tag($html)) {
    @parsed = ();
    &w3bbc_html'parse_html(*parsed,$html);
 
    @save = ($opts{'_tag'},$opts{'font_type'},$opts{'font_emphasis'});
    $opts{'_tag'} = 'pre';
    $ret .= &w3bbc_html'process_html(*opts,*parsed);
    ($opts{'_tag'},$opts{'font_type'},$opts{'font_emphasis'}) = @save;
  } elsif ($html =~ /\S/) {
    if ($html =~ /[\r\n]/) {
      local($h);
      while ($html =~ /[\r\n]/) {
        ($h,$html) = split(/[\r\n]/,$html,2);
				$ret .= $h;
        &w3bbc_html'display_text(*opts,$h);
        $ret .= &w3bbc_html'newlines(*opts,1);
     }
    }
    $ret .= &tag_display_label(*opts,$html);
  } else {
    $ret .= $html;
    &w3bbc_html'display_text(*opts,$html);
  }
 
  return($ret);
}



package main;

1;
----------
X-Sun-Data-Type: default
X-Sun-Data-Description: default
X-Sun-Data-Name: w3bbc_tags.pl
X-Sun-Charset: us-ascii
X-Sun-Content-Lines: 399

# This is the html tags for w3bbc_html.pl

sub html_tag_img {
  local(*opts,*open) = @_;
  local($url) = &wwwurl'absolute($gl_base_url,$open{'img_src'});
	$gl_page_url{$url} = 1;
  $open{'img_src'} = $url;
	#&url_seen($url);
	local($msg,$pic,$relief,$old,$content_type,$_);

  delete $opts{'_first'}; # cuz image is first..
	$pic = $gl_image_cache{$url};
  unless ($pic && -e $pic) {
		local($headers,@headers,%headers,$content);
print STDERR "\n\t\Retrieving $url\n";
    $resp = &www'lrequest('GET',*url,*headers,*content, $gl_document_timeout);
		if ($resp =~ /^2/) { # response ok
    	$gl_picnum++;
    	if ($headers{'content-type'} eq 'image/gif') {
      	$pic = "$gl_dir_tmp/pic.$$.".time.".$gl_picnum.gif";
    	} elsif ($headers{'content-type'} eq 'image/x-xbitmap') {
      	$pic = "$gl_dir_tmp/pic.$$.".time.".$gl_picnum.xbm";
    	}
   		open(OUT, ">$pic"); print OUT $content; close(OUT);
   		$gl_image_cache{$url} = $pic;
   		$gl_image_cache_type{$url} = $headers{'content-type'};
			open(OUT,">>$gl_cache_file");
			print OUT "$url\t$pic\t$headers{'content-type'}\n";
			close(OUT);
		} else { $pic = $gl_image_error_img; }
  }
#print STDERR "headers{'content-type'}='$headers{'content-type'}'\n";
#print STDERR "gl_image_cache{$url}='$gl_image_cache{$url}'\n";
#print STDERR "gl_image_cache_type{$url}='$gl_image_cache_type{$url}'\n";
	$content_type = $headers{'content-type'} || $gl_image_cache_type{$url};
#print STDERR "content_type='$content_type'\n";
  #if (($content_type eq 'image/gif') || ($content_type eq 'image/x-xbitmap')) {
	if ($pic) {
    $gl_htext_widget_num++;

    $msg .= <<EOF;
%%
if('$opts{'a_href'}') { \$relief = \$gl_image_link_style; }
else { \$relief = 'flat' } 
&tk'destroy("\$root_win_html.view.$gl_htext_widget_num");
EOF

    $msg .= <<EOF;
&tk'label("\$root_win_html.view.$gl_htext_widget_num",
  '-borderwidth',    2,
  '-relief',         "\$relief",
  );
&tk'do("\$root_win_html.view", 'append', "\$root_win_html.view.$gl_htext_widget_num");
EOF

		$gl_msg_after .= <<EOF;
%%
&tk'do("\$root_win_html.view.$gl_htext_widget_num",'configure',
  '-bitmap',         "@$pic",
	);
EOF

    $gl_msg_after .= '%%';

    if ($opts{'a_href'} || $open{'img_alt'}) {
      local($footer);
      if ($open{'img_alt'}) { $footer = "($open{'img_alt'}) "; }
			$footer .= $opts{'a_href'};
      $msg .= <<EOF;
&tk'bind("\$root_win_html.view.$gl_htext_widget_num",'<Any-Enter>',"+&set_window_status('$footer')");
&tk'bind("\$root_win_html.view.$gl_htext_widget_num",'<Any-Leave>',"+&set_window_status('')");
if ("$opts{'a_href'}") {
	if ("$open{'img_ismap'}") {
		&tk'bind("\$root_win_html.view.$gl_htext_widget_num",'<1>',"&goto_url('$opts{'a_href'}',0,'%x','%y')");
	} else {
		&tk'bind("\$root_win_html.view.$gl_htext_widget_num",'<1>',"&goto_url('$opts{'a_href'}')");
	}
	&tk'bind("\$root_win_html.view.$gl_htext_widget_num",'<3>',"&set_window_url('$opts{'a_href'}')");
}
EOF
    }
    $msg .= "%%";
  }
	$opts{'_newlines'} = 0;
	#&set_window_status($old) if ($old =~ /\S/);
#print STDERR "html_tag_img returns\n$msg\n";
#print STDERR "gl_msg_after='$gl_msg_after'\n";
  return($msg);
}


sub html_tag_em {
  local(*opts,*open,*close,$html) = @_;
	return unless($html);
  local($msg,$emphasis,@parsed);
  # Bold tag
  if (&w3bbc_html'has_tag($html)) {
    @parsed = ();
    &w3bbc_html'parse_html(*parsed,$html);
    local(@save) = ($opts{'font_emphasis'});
    if (($open eq 'b') || ($open eq 'strong') || ($open eq 'dfn')) {
      $emphasis = 'bold';
    } elsif (($open eq 'i') || ($open eq 'cite') || ($open eq 'em')) {
      $emphasis = 'italic';
    } else {
      die "Don't recognize open tag $open\n";
    }
    ($opts{'font_emphasis'}) = $emphasis;
    $msg = &w3bbc_html'process_html(*opts,*parsed);
    ($opts{'font_emphasis'}) = @save;
  } elsif ($opts{'a_href'}) {
    # Rather than handle it myself, wrap it with the URL and pass it on..
    local(%open2,%close2);
    $open2{'a_href'} = $opts{'a_href'};
    local($routine) = $w3bbc_html'tags_block{'a'};
    local(@save) = ($opts{'font_emphasis'});
    if (($open eq 'b') || ($open eq 'strong') || ($open eq 'dfn')) {
      $emphasis = 'bold';
    } elsif (($open eq 'i') || ($open eq 'cite') || ($open eq 'em')) {
      $emphasis = 'italic';
    } else {
      die "Don't recognize open tag $open\n";
    }
    ($opts{'font_emphasis'}) = $emphasis;
    $msg = &$routine(*opts,*open2,*close2,$html);
    ($opts{'font_emphasis'}) = @save;
  } else {
    $gl_htext_widget_num++;

    if (($open eq 'b') || ($open eq 'strong') || ($open eq 'dfn')) {
      $emphasis = 'bold';
    } elsif (($open eq 'i') || ($open eq 'cite') || ($open eq 'em')) {
      $emphasis = 'italic';
    } else {
      die "Don't recognize open tag $open\n";
    }

		local(@save) = ($opts{'font_emphasis'});
		$opts{'font_emphasis'} = $emphasis;
    local(@text,$text);
		@text = &w3bbc_html'wordwrap($opts{'_line_len'},$opts{'_max_len'},$html);
		local($indent_space) = ($opts{'_textspace'} x ($opts{'_indent'}+1));
    for $text (0..$#text) {
    	$msg .= $indent_space if ($opts{'_indent'} && $text);
    	$msg .= &tag_display_label(*opts,$text[$text]);
    	$msg .= &w3bbc_html'newlines(*opts,1) unless($text == $#text);
    }
		($opts{'font_emphasis'}) = @save;
  }
  return($msg);
}

sub html_tag_title {
  local(*opts,*open,*close,$html) = @_;

	return unless($html =~ /\S/);

	if (&w3bbc_html'has_tag($html)) {
		@parsed = ();
		&w3bbc_html'parse_html(*parsed,$html);
		$html = &w3bbc_html'process_html(*opts,*parsed);
	}

	$html = "Title has internal tags that can't be displayed." 
		if ($html =~ /%%/);
	$html = &w3bbc_html'unescape($html);
	$html =~ s/\\/\\\\/g;
	$html =~ s/'/\\'/g;
	return "%%\n\&set_window_title('$html');\n%%";
}

sub html_tag_a {
  local(*opts,*open,*close,$html) = @_;
  local($url,$msg,@parsed);
  if ($open{'a_href'}) {
    $url = &wwwurl'absolute($gl_base_url,$open{'a_href'});
		$gl_page_url{$url} = 1;
    $open{'a_href'} = $url;
  } else {
    $url = $opts{'a_href'};
		$gl_page_url{$url} = 1;
  }
	&url_seen($url) unless($opts{'_tag'} eq 'a');

  if (&w3bbc_html'has_tag($html)) {
    @parsed = ();
    &w3bbc_html'parse_html(*parsed,$html);
    local(@save) = ($opts{'a_href'},$opts{'_tag'});
    $opts{'a_href'} = $url;
    $opts{'_tag'} = 'a' if ($url);
    $msg = &w3bbc_html'process_html(*opts,*parsed);
    ($opts{'a_href'},$opts{'_tag'}) = @save;
  } else {
    local($footer,$color);
    if ($open{'a_alt'}) { $footer = "($open{'a_alt'}) $url"; } 
    else { $footer = $url; }
    $html =~ s/\s+/ /g;
		if ($opts{'_tag'} eq 'pre') {
			$msg .= &msg_tag_a(*opts,$html,$url,$footer);
		} else {
    	local(@text,$text);
			local($indent_space) = ($opts{'_textspace'} x ($opts{'_indent'}+1));

			@text = &w3bbc_html'wordwrap($opts{'_line_len'},$opts{'_max_len'},$html);
    	for $text (0..$#text) {
    		$msg .= $indent_space if ($opts{'_indent'} && $text);
#print STDERR "a_href text[$text]='$text[$text]'\n";
				$msg .= &msg_tag_a(*opts,$text[$text],$url,$footer) if($text[$text]);
    		$msg .= &w3bbc_html'newlines(*opts,1) unless($text == $#text);
    	}
		}
  }
  return($msg);
}

sub msg_tag_a {
	local(*opts,$html,$url,$footer) = @_;
	local($msg);

  &w3bbc_html'display_text(*opts,$html);
  $html =~ s/\\/\\\\/g; # Escape backslash
  $html =~ s/'/\\'/g; # Escape 'single quotes'
  $gl_htext_widget_num++;
	$html = &w3bbc_html'unescape($html);
  $msg .= <<EOF;
%%
local(\$font) = 
	\$gl_display_font{\$gl_display_font_type,'$opts{'font_type'}','$opts{'font_emphasis'}'} 
  || 
	\$gl_display_font{\$gl_display_font_type,'$opts{'font_type'}','none'};
EOF
	if ($opts{'a_href'} || $open{'a_href'}) {
  	$msg .= <<EOF;
local(\$color) = &'url_color('$url');
EOF
	}
  $msg .= <<EOF;
&tk'destroy("\$root_win_html.view.$gl_htext_widget_num");
EOF
	if ($opts{'a_href'} || $open{'a_href'}) {
  	$msg .= <<EOF;
&tk'button("\$root_win_html.view.$gl_htext_widget_num",
EOF
	} else {
  	$msg .= <<EOF;
&tk'label("\$root_win_html.view.$gl_htext_widget_num",
EOF
	}
  $msg .= <<EOF;
  '-text',             '$html',
  '-relief',           "\$gl_link_style",
EOF
	if ($opts{'a_href'} || $open{'a_href'}) {
  	$msg .= <<EOF;
  '-command',          "&goto_url('$url')",
EOF
	}
  $msg .= <<EOF;
  '-font',             "\$font",
EOF
	if ($opts{'a_href'} || $open{'a_href'}) {
  	$msg .= <<EOF;
  '-fg',               "\$color",
  '-activeforeground', "\$color",
EOF
	}
  	$msg .= <<EOF;
);
&tk'do("\$root_win_html.view", 'append', "\$root_win_html.view.$gl_htext_widget_num");
EOF
	if ($opts{'a_href'} || $open{'a_href'}) {
  	$msg .= <<EOF;
&tk'bind("\$root_win_html.view.$gl_htext_widget_num",'<Any-Enter>',"&display_footer_url('$footer','%W')");
&tk'bind("\$root_win_html.view.$gl_htext_widget_num",'<Any-Leave>',"&clear_footer_url('%W')");
&tk'bind("\$root_win_html.view.$gl_htext_widget_num",'<3>',"&set_window_url('$url')");
EOF
	}
  $msg .= '%%';
	return($msg);			
}

sub html_tag_h {
  local(*opts,*open,*close,$html) = @_;
  local($ret,$type,@parsed);
  local($start) = $open;
	unless ($opts{'_tag'} =~ /^h/i) {
  	$html =~ s/^\s+//;
  	$html =~ s/\s+$//;
  	$ret = &w3bbc_html'newlines(*opts,2);
	}
  $html =~ s/\s+/ /g;
  $start =~ s/^h//i;
  $type = 'header' . int($start);
  if (&w3bbc_html'has_tag($html)) {
    local(@save) = ($opts{'font_type'},$opts{'font_emphasis'},$opts{'_tag'});
    $opts{'font_type'} = $type;
    $opts{'font_emphasis'} = 'none';
    $opts{'_tag'} = $open;
    @parsed = ();
    &w3bbc_html'parse_html(*parsed,$html);
    $ret .= &w3bbc_html'process_html(*opts,*parsed);
    ($opts{'font_type'},$opts{'font_emphasis'},$opts{'_tag'}) = @save;
  } else {
    local(@save) = ($opts{'font_type'},$opts{'font_emphasis'});
    $opts{'font_type'} = $type;
    $opts{'font_emphasis'} = 'none';
    local(@text,$text);
		@text = &w3bbc_html'wordwrap($opts{'_line_len'},$opts{'_max_len'},$html);
    for $text (0..$#text) {
    	$ret .= &tag_display_label(*opts,$text[$text]);
    	$ret .= &w3bbc_html'newlines(*opts,1) unless($text == $#text);
    }
    ($opts{'font_type'},$opts{'font_emphasis'}) = @save;
  }
  $ret .= &w3bbc_html'newlines(*opts,2) unless($opts{'_tag'} =~ /^h/i);
  return($ret);
}
sub html_tag_isindex {
  local(*opts,*open) = @_;
	return "%%\n\&cb_disp_isindex_win('$gl_current_url');\n%%";
}

sub html_tag_pre {
  local(*opts,*tag,*close_tag,$html) = @_;
	return unless($html);
  local(@parsed,@save,$ret);
 
  if (&w3bbc_html'has_tag($html)) {
    @parsed = ();
    &w3bbc_html'parse_html(*parsed,$html);
 
    @save = ($opts{'_tag'},$opts{'font_type'},$opts{'font_emphasis'});
    $opts{'font_type'} = 'plain';
    $opts{'font_emphasis'} = 'none';
    $opts{'_tag'} = 'pre';
    $ret .= &w3bbc_html'process_html(*opts,*parsed);
    ($opts{'_tag'},$opts{'font_type'},$opts{'font_emphasis'}) = @save;
	} else {
		if ($html =~ /[\r\n]/) {
#print STDERR "html\n$html\n";
			local($h);
			while ($html =~ /[\r\n]/) {
				($h,$html) = split(/[\r\n]/,$html,2);
				$ret .= &w3bbc_html'unescape($h);
				&w3bbc_html'display_text(*opts,$h);
				$ret .= &w3bbc_html'newlines(*opts,1);
     }
		}
		if ($html) {
			local($esc) = &w3bbc_html'unescape($html);
			&w3bbc_html'display_text(*opts,$esc);
			$ret .= $esc;
		}
  }
 
  return($ret);
}



sub tag_display_label {
  local(*opts,$html) = @_;
#print STDERR "tag_display_label(opts,$font,$html)\n";
#print STDERR "opts{'font_type'}='$opts{'font_type'}'\n";
#print STDERR "opts{'font_emphasis'}='$opts{'font_emphasis'}'\n";
#print STDERR "caller(0)=(",join(',',caller(0)),")\n";
#print STDERR "caller(1)=(",join(',',caller(1)),")\n";
#print STDERR "caller(2)=(",join(',',caller(2)),")\n";

return unless($html);
  local($ret);
	$html = &w3bbc_html'unescape($html);
 
  $gl_htext_widget_num++;
  &w3bbc_html'display_text(*opts,$html);
  $html =~ s/\\/\\\\/g; # Escape backslash
  $html =~ s/'/\\'/g; # Escape 'single quotes'
 
  $ret .= <<EOF;
%%
local(\$font) = 
	\$gl_display_font{\$gl_display_font_type,'$opts{'font_type'}','$opts{'font_emphasis'}'} 
|| 
	\$gl_display_font{\$gl_display_font_type,'$opts{'font_type'}','none'};

&tk'destroy("\$root_win_html.view.$gl_htext_widget_num");
&tk'label("\$root_win_html.view.$gl_htext_widget_num",
'-text',        '$html',
'-font',        "\$font",
);
&tk'do("\$root_win_html.view", 'append', "\$root_win_html.view.$gl_htext_widget_num");
EOF
  $ret .= '%%';
  return($ret);
}



1;
----------
X-Sun-Data-Type: default-app
X-Sun-Data-Description: default
X-Sun-Data-Name: testhtml
X-Sun-Charset: us-ascii
X-Sun-Content-Lines: 38

#!/usr/local/bin/perl

$url = shift(@ARGV);

unless($url) { die "requires argument of url\n"; }

require 'w3bbc_html.pl';

$libwww_perl_version = '0.30';

unshift(@INC, $ENV{'LIBWWW_PERL'} ||
	"/homes/bcutter/dev/tkperl/libwww/libwww-perl-$libwww_perl_version");

require 'www.pl';

&www'set_def_header('http','User-Agent',"testhtml/0.1 $www'Library");


$response = &www'request('GET',$url,*headers,*content,30);

&w3bbc_html'reset_tags('text');
&w3bbc_html'parse_html(*links,$content);
%opts = (
#'_unknown','preserve', # preserve or unknown (def)
'_debug',  1,
);
$processed = &w3bbc_html'process_html(*opts,*links);

print STDERR <<EOF;
                                     HTML
----------------------------------------------------------------------
$content
----------------------------------------------------------------------
                                Processed Text
----------------------------------------------------------------------
$processed
----------------------------------------------------------------------
EOF
----------
X-Sun-Data-Type: default-app
X-Sun-Data-Description: default
X-Sun-Data-Name: testhtml2
X-Sun-Charset: us-ascii
X-Sun-Content-Lines: 73

#!/usr/local/bin/perl

$home_url = shift(@ARGV);

unless($home_url) { die "requires argument of url\n"; }

require 'w3bbc_html.pl';

$libwww_perl_version = '0.30';

unshift(@INC, $ENV{'LIBWWW_PERL'} ||
	"/homes/bcutter/dev/tkperl/libwww/libwww-perl-$libwww_perl_version");

require 'www.pl';

&www'set_def_header('http','User-Agent',"testhtml/0.1 $www'Library");


$response = &www'request('GET',$home_url,*headers,*content,30);

&w3bbc_html'reset_tags('none');
%html_block = (
	'a',"'collect_url",
);
%html_single = (
	'img',"'collect_url",
);
&w3bbc_html'set_tags(*html_block,*html_single);

&w3bbc_html'parse_html(*links,$content);
%opts = (
#'_unknown','preserve', # preserve or ignore (def)
#'_debug',  1,
);
$processed = &w3bbc_html'process_html(*opts,*links);
$urls = join("\n",@global_urls);
print "#global_urls=$#global_urls\n";

print STDERR <<EOF;
                                     HTML
----------------------------------------------------------------------
$content
----------------------------------------------------------------------
                                URLs
----------------------------------------------------------------------
$urls
----------------------------------------------------------------------
EOF


sub collect_url {
  local(*opts,*open,*close,$html) = @_;
  local($url) = $open{'a_href'} || $open{'img_src'};
  if ($html && &w3bbc_html'has_tag($html)) {
    local(@parsed);
    &w3bbc_html'parse_html(*parsed,$html);
    &w3bbc_html'process_html(*opts,*parsed);
  }
  return unless($url);
  $url = &wwwurl'absolute($home_url,$url);
  ($scheme,$address,$port,$path,$query,$frag) = &wwwurl'parse($url);
  $port = 80 unless($port);
  $url = &wwwurl'compose($scheme,$address,$port,$path,$query,$frag);

  return if ($global_urls{$url});
  $global_urls{$url}++;
  if ($open{'img_src'}) {
    unshift(@global_urls,$url); # inlined images get priority
  } else {
    push(@global_urls,$url);
  }
}

----------
X-Sun-Data-Type: default-app
X-Sun-Data-Description: default
X-Sun-Data-Name: testhtml3
X-Sun-Charset: us-ascii
X-Sun-Content-Lines: 65

#!/usr/local/bin/perl

$home_url = shift(@ARGV);

unless($home_url) { die "requires argument of url\n"; }

require 'w3bbc_html.pl';

$libwww_perl_version = '0.30';

unshift(@INC, $ENV{'LIBWWW_PERL'} ||
	"/homes/bcutter/dev/tkperl/libwww/libwww-perl-$libwww_perl_version");

require 'www.pl';

&www'set_def_header('http','User-Agent',"testhtml/0.1 $www'Library");


$response = &www'request('GET',$home_url,*headers,*content,30);

&w3bbc_html'reset_tags('none');
%html_block = (
	'a',"'insert_img",
);
%html_single = (
);
&w3bbc_html'set_tags(*html_block,*html_single);

&w3bbc_html'parse_html(*links,$content);
%opts = (
'_unknown','preserve', # preserve or ignore (def)
#'_debug',  1,
);
$processed = &w3bbc_html'process_html(*opts,*links);
$urls = join("\n",@global_urls);
print "#global_urls=$#global_urls\n";

print STDERR <<EOF;
                                     HTML
----------------------------------------------------------------------
$content
----------------------------------------------------------------------
                                Processed
----------------------------------------------------------------------
$processed
----------------------------------------------------------------------
EOF


sub insert_img {
  local(*opts,*open,*close,$html) = @_;
	local($_,$args,$val);
	for (keys %open) {
		$val = $open{$_};
		s/${open}_//; # delete a_ from tag
		$args .= " $_=\"$val\"";
	}
	if ($open{'a_href'} =~ /\.(xbm|gif)/i) {
		$html = "<$open$args>".'<img src="'.$open{'a_href'}.'">'.$html."</$open>";
	} else {
		$html = "<$open$args>".$html."</$open>";
	}
	return($html);
}