wisse e. wrote:
> When parsing a html file with some unexpected changes are made to the html.
> The following Perl script:
> [script deleted]
> Chops up tables when it finds <p> tags present in the table elements.
> A <p> in the original html file results in a </table> element to be added
> before the <p> element.
> [more deleted]
>
> I am not aware that <p> is forbidden inside tables. Is there any way to switch
> this behaviour off?
Not a switch as far as I know. I ran into a similar problem but
since I had to subclass the TreeBuilder anyway I "fixed" it directly
by writing my own &start routine, "stealing" most parts of the original.
In libwww-perl 5.15, the relevant lines of HTML::TreeBuilder are:
lines 185-188
# Handle implicit endings and insert based on <tag> and position
if ($tag eq 'p' || $tag =~ /^h[1-6]/ || $tag eq 'form') {
# Can't have <p>, <h#> or <form> inside these
- $self->end([qw(p h1 h2 h3 h4 h5 h6 pre textarea)], 'li');
+ $self->end([qw(p h1 h2 h3 h4 h5 h6 pre textarea)], qw(li td));
BTW: The other change I made regarding "unexpected changes" is to
comment out line 338 of HTML::TreeBuilder (subroutine &text):
$text =~ s/\s+/ /g; # canoncial space
This leaves spaces and line breaks as they were before - which I
need because I use the TreeBuilder to make global changes in HTML
files but don't want to confuse the original authors of the files.
With these two changes (to be exact: with my own class used instead
of HTML::TreeBuilder) the as_HTML method prints:
---------------------------------------
This is the first paragraph which contains a
link and an
.
| Eerste veld | Tweede veld |
| Zoveelste veld | Laatste veld Met daarin een heel nieuwe paragraaf |