Re: Weirdness in HTML::TreeBuilder ...
Dave Ruderman (daver@pathfinder.com)
Thu, 19 Sep 1996 12:08:03 -0400 (EDT)
Charlie Stross <charles@fma.com> wrote:
> I suspect the HTML parser in LWP 5.00 (and 5.01) is getting a little
> confused when it runs across a <SCRIPT> tag in the head of a document.
>
> Reason: I'm hacking on a homebrew module that recursively descends a
> web, building a parse tree of each document as it goes (via
> HTML::TreeBuilder), and spitting out tokens.
>
> If I feed it a document like this ...
>
> <html>
> <head>
> <META NAME="foo" CONTENT="bar">
> <SCRIPT LANGUAGE="JavaScript">
> <!-- some scripting nonsense goes here -->
> </SCRIPT>
> <title>some title</title>
> </head>
> <body>
>
> When I do the traversal, what comes out is this:
>
> <HTML>
> <HEAD>
> <META CONTENT="bar" NAME="foo">
> </HEAD>
> <BODY>
> <SCRIPT LANGUAGE="JavaScript">
> </SCRIPT>
> <TITLE>
> some
> title
> </TITLE>
> <P>
>
> Note that (a) the attributes of the META tag are in reverse order,
> and (b) the SCRIPT tag has somehow been shunted down out of the <HEAD>
> section! (a) I can live with, but (b) is somewhat disturbing. Am I
> missing something obvious, or is this a parser 'feature'?
>
> (NB: I initially suspected the reversed attributes of the META tag
> arose from the way I'd built a queue (as a buffer) in my module, but
> as far as I can tell the queue is innocent: push() at one end, shift()
> at the other. I'm currently scratching my head ...)
>
A) all attributes in all tags come out in alphabetical (or perl-hash order)
Since attributes are kept in a hash the order is lost. This can be more
annoying for readability in this case:
<IMG HEIGHT=30 ...(other attributes) SRC="abc.gif" WIDTH=30>
When I is common to key in image dimensions near each other.
One usually cannot rely on attribute ordering.
Not sure why B) happens.
I do know that %isBodyElement in TreeBuilder.pm specifically lists
<SCRIPT> tag as a body and a head element,
I presume this is from the HTML dtd.
-dave
-------------------------------------------------------------------
David Ruderman daver@pathfinder.com (212) 522-9919
Core Technology Engineer - Pathfinder - Time Inc., New Media
You simply must try: http://pathfinder.com/pathfinder/staff/daver