Re: possible bug in HTML::Parser comment handler
Sean M. Burke (sburke@spinn.net)
Fri, 12 Jan 2001 00:35:26 -0700
At 11:21 PM 2001-01-11 +0100, Bjoern Hoehrmann wrote:
>At 15:28 11.01.01 -0500, you wrote:
>>It seems that the parser is not properly detecting multi-line HTML
>>comments. I was trying to print out the dtext of a html document and
>>noticed that comments kept showing up in the output. Upon further
>>examination, the single line comments were being ignored but ones like
>>this:
>>
>><!--
>>td {font-family: Arial,Geneva,Helvetica,sans-serif; color: #000000;}
>>-->
>
>Well, the content model of the style element is CDATA, your "comments"
>may look like comments but they are no comments in HTML and SGML
>terms. That's not a bug.
I don't see what's wrong with that comment.
From ISO 8879 Section 10.3 declares a "comment declaration" (yes, horrible
term for it) as:
comment declaration =
"<!",
(comment
(s | comment)*
)?
">"
comment =
"--",
SGML_character*
"--"
And in section 6.2.1, there's the explanation of "s":
s = SPACE | RE | RS | SEPCHAR
and in the concrete syntax, that means [\x20\cm\cj\t]
And as to "SGML_character", section 9.2 basically says that aside from any
characters that you go and reserve as being impermissible, anything is an
SGML_character. (I'm getting this from the /SGML Handbook/, which contains
the full text of ISO 8879, plus annotation, etc.)
So I don't see a problem with
<!--
td {font-family: Arial,Geneva,Helvetica,sans-serif; color: #000000;}
-->
BTW, the XML spec's definition is even clearer, er, sort of:
Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'
To this they add: "Note that the grammar does not allow a comment ending
in --->. The following example is not well-formed: '<!-- B+, B, or B--->'".
I'm a bit unclear on whether this really falls out of the grammar, but
anyway.
--
Sean M. Burke sburke@cpan.org http://www.spinn.net/~sburke/