Bugs in HTML::TreeBuilder?
WWW projekt (wwwproj@dna.lth.se)
Wed, 31 Jul 1996 12:41:22 +0200
Hi,
I believe that I have found two bugs in HTML::TreeBuilder. The first
should be easy to fix, but the other one might be difficult.
Possible bug 1
--------------
The first bug is shown by this example:
% perl -MHTML::TreeBuilder
$ht = new HTML::TreeBuilder;
$p = << "END";
<FORM>
<TEXTAREA NAME="EvenMoreText" ROWS=20 COLS=40>
Here's some text!
Next line.
Third with two spaces before.
</TEXTAREA>
<FORM>
END
$ht->parse($p);
print $ht->as_HTML;
^D
The TEXTAREA tag should be treated as XMP and LISTING - it should
preserve it's contents. This can be fixed by changing one line in
TreeBuilder. In sub text change
HTML::Entities::decode($text) unless $ignore_text;
! if ($pos->is_inside(qw(pre xmp listing))) {
return if $ignore_text;
$pos->push_content($text);
} else {
to
HTML::Entities::decode($text) unless $ignore_text;
! if ($pos->is_inside(qw(pre xmp listing textarea))) {
return if $ignore_text;
$pos->push_content($text);
} else {
Possible bug 2
--------------
This is a quotation from the HTML 2.0 internet draft:
"The XMP and LISTING elements are similar to the PRE element, but they
have a different syntax. Their content is declared as CDATA, which means
that no markup except the end-tag open delimiter-in-context is
recognized (see 9.6 "Delimiter Recognition" of SGML)."
So, XMP and LISTING should not recognize any tags but their own
end-tags. In HTML::TreeBuilder this is not true:
% perl -MHTML::TreeBuilder
$ht = new HTML::TreeBuilder;
$p = << "END";
It both adds an end-tag for FORM and changes the order of its
attributes.
Quotation from the draft once again, section 'Preformatted Text: PRE'
"Within preformatted text:
- Line breaks within the text are rendered as a move to the
beginning of the next line. (15)
- Anchor elements and phrase markup may be used. (16)
- Elements that define paragraph formatting (headings, address,
etc.) must not be used. (17)
- The horizontal tab character (code position 9 in the HTML
document character set) must be interpreted as the smallest
positive nonzero number of spaces which will leave the number
of characters so far on the line as a multiple of 8.
Documents should not contain tab characters, as they are not
supported consistently"
So, inside a PRE, ancors may be used, and any paragrah formatting may
not be used. (I wonder what happens with FORM and such, are they
allowed?)
Maybe such elements should be deleted instead of parsed when inside a
preformatted text.
So, if the user uses PRE, we seem to be able to do as we like with other
tags than its own end-tag, but in XMP and LISTING other tags should
*not* be parsed.
The same treatment that apply to XMP and LISTING should apply to
TEXTAREA as well.
The user can herself work around this by using the entity set (< and
so on), but the behaviour of TreeBuilder is not correct.
I had a look at the code and discovered that this might be difficult to
solve, HTML::Parser cannot know in what tags everything should be
treated as text.
I thought of two solutions :
Parser having a virtual method, is_text_tag, that is used to determine
if the current tag is one of the ones above. If it is, all text to the
end-tag is treated as only text.
Parser saving the eaten text for each tag and having a method
get_eaten_text to get it.
In the TreeBuilder methods start and end one could then easy find if the
tag was inside any of the above tags, and call text with the eaten text
got from get_eaten_text.
I had a look in the Hypermail Archive before I wrote this, so I hope I
am not reposting an old bugreport.
---
Stefan Eriksson, Lund university, Sweden
wwwproj@dna.lth.se, dat93ser@ludat.lth.se