Re: <p> in tables

Harald Joerg (Harald.Joerg@mch.sni.de)
Wed, 17 Dec 1997 09:49:19 +0100


wisse e. wrote:
> When parsing a html file with some unexpected changes are made to the html.
> The following Perl script:
> [script deleted]
> Chops up tables when it finds <p> tags present in the table elements.
> A <p> in the original html file results in a </table> element to be added
> before the <p> element.
> [more deleted]
> 
> I am not aware that <p> is forbidden inside tables. Is there any way to switch
> this behaviour off?

Not a switch as far as I know. I ran into a similar problem but
since I had to subclass the TreeBuilder anyway I "fixed" it directly
by writing my own &start routine, "stealing" most parts of the original.
   In libwww-perl 5.15, the relevant lines of HTML::TreeBuilder are:

lines 185-188
	# Handle implicit endings and insert based on <tag> and position
	if ($tag eq 'p' || $tag =~ /^h[1-6]/ || $tag eq 'form') {
	    # Can't have <p>, <h#> or <form> inside these
-	    $self->end([qw(p h1 h2 h3 h4 h5 h6 pre textarea)], 'li');
+	    $self->end([qw(p h1 h2 h3 h4 h5 h6 pre textarea)], qw(li td));

BTW: The other change I made regarding "unexpected changes" is to
comment out line 338 of HTML::TreeBuilder (subroutine &text):
	$text =~ s/\s+/ /g;  # canoncial space
This leaves spaces and line breaks as they were before - which I
need because I use the TreeBuilder to make global changes in HTML
files but don't want to confuse the original authors of the files.

With these two changes (to be exact: with my own class used instead
of HTML::TreeBuilder) the as_HTML method prints:
---------------------------------------

This is the heading

This is the first paragraph which contains a link and an image.


Eerste veld Tweede veld
Zoveelste veld Laatste veld

Met daarin een heel nieuwe paragraaf


--------------------------------------- -- Oook, --haj--