Re: HTML::Parser 3.13 Bug ?

Gisle Aas (gisle@activestate.com)
08 Jan 2001 11:52:50 -0800


Christian Recktenwald <chris@citecs.de> writes:

> I just recognized that "<p><b> Text </b></p>" seems 
> to be not parsed correctly, while "<p> <b> Text </b> </p>" is.
> 
> More precisely, the start event for <p> and the end event for </b>
> are not generated.

It is.  It's just that you only print it out when there is a "text"
event.  That make your code fail unless there is text after each start
tag.  Whitespace is text.

Regards,
Gisle


> -- check.pl ------
> #!/usr/bin/perl
> # some html pretty printer
> 
> use HTML::Parser;
> 
> $indent = -1;
> $indentstr = "  ";
> 
> $p = HTML::Parser->new(api_version => 3);
> $p->xml_mode(1);
> 
> $p->handler(start => \&start_handler, 'tagname,self' );
> $p->handler(end => \&end_handler, 'tagname,self' );
> 
> sub start_handler {
> 	my $tag = shift;
> 	my $self = $self;

Try to print "$tag" here to verify that the right thing actually
happens.

> 	shift->handler(text => sub { $indent ++;
> 				     my $text = shift;
> 				     my $ind = $indentstr x $indent; 
> 				     chomp($text);
> 				     $text =~ s/^\s*//gs;
> 				     $text =~ s/\r?\n/ /gs;
> 	                             print $ind, "<$tag>","\n", 
> 				     	($text ne "" )?($ind." $text"."\n"):(""); }, 
> 					"dtext");
> }
> 
> sub end_handler {
> 	my $tag = shift;
> 	my $self = shift;
> 	$self->handler(text => sub { print $indentstr x $indent , "</$tag>\n";
> 	                             $indent --;
> 				   });
> }
> 
> $p->parse_file("test.html");
> 
> ------------------
> -- test.html -----
> <HTML>
> <HEAD>
> <TITLE>
> text
> </TITLE>
> </HEAD>
> <BODY BGCOLOR=#ffffff>
> <H1> ue1 </H1>
> text
> <H2> ue2 </H2>
> text
> <H2> ue3 </H2>
> text
> <p><b> text1 </b></p>
> <p> <b> text2 </b> </p>
> 
> </BODY>
> </HTML>