Bugs in HTML::TreeBuilder?

WWW projekt (wwwproj@dna.lth.se)
Wed, 31 Jul 1996 12:41:22 +0200


Hi,

I believe that I have found two bugs in HTML::TreeBuilder. The first
should be easy to fix, but the other one might be difficult.

Possible bug 1
--------------
The first bug is shown by this example:

% perl -MHTML::TreeBuilder
$ht = new HTML::TreeBuilder;
$p = << "END";
<FORM>
<TEXTAREA NAME="EvenMoreText" ROWS=20 COLS=40>
Here's some text!
Next line.
  Third with two spaces before.
</TEXTAREA>

<FORM>
END
$ht->parse($p);
print $ht->as_HTML;
^D
The TEXTAREA tag should be treated as XMP and LISTING - it should preserve it's contents. This can be fixed by changing one line in TreeBuilder. In sub text change HTML::Entities::decode($text) unless $ignore_text; ! if ($pos->is_inside(qw(pre xmp listing))) { return if $ignore_text; $pos->push_content($text); } else { to HTML::Entities::decode($text) unless $ignore_text; ! if ($pos->is_inside(qw(pre xmp listing textarea))) { return if $ignore_text; $pos->push_content($text); } else { Possible bug 2 -------------- This is a quotation from the HTML 2.0 internet draft: "The XMP and LISTING elements are similar to the PRE element, but they have a different syntax. Their content is declared as CDATA, which means that no markup except the end-tag open delimiter-in-context is recognized (see 9.6 "Delimiter Recognition" of SGML)." So, XMP and LISTING should not recognize any tags but their own end-tags. In HTML::TreeBuilder this is not true: % perl -MHTML::TreeBuilder $ht = new HTML::TreeBuilder; $p = << "END"; <FORM METHOD=POST ACTION="cgi-bin/act.cgi"> END $ht->parse($p); print $ht->as_HTML; ^D <FORM ACTION="cgi-bin/act.cgi" METHOD="POST"> </FORM> It both adds an end-tag for FORM and changes the order of its attributes. Quotation from the draft once again, section 'Preformatted Text: PRE' "Within preformatted text: - Line breaks within the text are rendered as a move to the beginning of the next line. (15) - Anchor elements and phrase markup may be used. (16) - Elements that define paragraph formatting (headings, address, etc.) must not be used. (17) - The horizontal tab character (code position 9 in the HTML document character set) must be interpreted as the smallest positive nonzero number of spaces which will leave the number of characters so far on the line as a multiple of 8. Documents should not contain tab characters, as they are not supported consistently" So, inside a PRE, ancors may be used, and any paragrah formatting may not be used. (I wonder what happens with FORM and such, are they allowed?) Maybe such elements should be deleted instead of parsed when inside a preformatted text. So, if the user uses PRE, we seem to be able to do as we like with other tags than its own end-tag, but in XMP and LISTING other tags should *not* be parsed. The same treatment that apply to XMP and LISTING should apply to TEXTAREA as well. The user can herself work around this by using the entity set (< and so on), but the behaviour of TreeBuilder is not correct. I had a look at the code and discovered that this might be difficult to solve, HTML::Parser cannot know in what tags everything should be treated as text. I thought of two solutions : Parser having a virtual method, is_text_tag, that is used to determine if the current tag is one of the ones above. If it is, all text to the end-tag is treated as only text. Parser saving the eaten text for each tag and having a method get_eaten_text to get it. In the TreeBuilder methods start and end one could then easy find if the tag was inside any of the above tags, and call text with the eaten text got from get_eaten_text. I had a look in the Hypermail Archive before I wrote this, so I hope I am not reposting an old bugreport. --- Stefan Eriksson, Lund university, Sweden wwwproj@dna.lth.se, dat93ser@ludat.lth.se