Re: possible bug in HTML::Parser comment handler

Sean M. Burke (sburke@spinn.net)
Fri, 12 Jan 2001 00:35:26 -0700


At 11:21 PM 2001-01-11 +0100, Bjoern Hoehrmann wrote:
>At 15:28 11.01.01 -0500, you wrote:
>>It seems that the parser is not properly detecting multi-line HTML
>>comments.  I was trying to print out the dtext of a html document and
>>noticed that comments kept showing up in the output.  Upon further
>>examination, the single line comments were being ignored but ones like
>>this:
>>
>><!--
>>td {font-family: Arial,Geneva,Helvetica,sans-serif; color: #000000;}
>>-->
>
>Well, the content model of the style element is CDATA, your "comments"
>may look like comments but they are no comments in HTML and SGML
>terms. That's not a bug.

I don't see what's wrong with that comment.

 From ISO 8879 Section 10.3 declares a "comment declaration" (yes, horrible
term for it) as:

 comment declaration =
  "<!",
  (comment
    (s | comment)*
  )?
  ">"

 comment =
  "--",
  SGML_character*
  "--"

And in section 6.2.1, there's the explanation of "s":

 s = SPACE | RE | RS | SEPCHAR
  and in the concrete syntax, that means [\x20\cm\cj\t]

And as to "SGML_character", section 9.2 basically says that aside from any
characters that you go and reserve as being impermissible, anything is an
SGML_character.  (I'm getting this from the /SGML Handbook/, which contains
the full text of ISO 8879, plus annotation, etc.)


So I don't see a problem with 
  <!--
  td {font-family: Arial,Geneva,Helvetica,sans-serif; color: #000000;}
  -->



BTW, the XML spec's definition is even clearer, er, sort of:

   Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'

To this they add:  "Note that the grammar does not allow a comment ending
in --->. The following example is not well-formed: '<!-- B+, B, or B--->'".
 I'm a bit unclear on whether this really falls out of the grammar, but
anyway.


--
Sean M. Burke  sburke@cpan.org  http://www.spinn.net/~sburke/