Re: possible bug in HTML::Parser comment handler
Dave (dave.olszewski@andover.net)
Fri, 12 Jan 2001 13:42:26 -0500 (EST)
I have solved the problem I was having thanks to the info here. All I had
to do was pass is_cdata as an arg to the handler and only print if it was
false. Thanks very much.
dave
On Fri, 12 Jan 2001, Sean M. Burke wrote:
> At 11:21 PM 2001-01-11 +0100, Bjoern Hoehrmann wrote:
> >At 15:28 11.01.01 -0500, you wrote:
> >>It seems that the parser is not properly detecting multi-line HTML
> >>comments. I was trying to print out the dtext of a html document and
> >>noticed that comments kept showing up in the output. Upon further
> >>examination, the single line comments were being ignored but ones like
> >>this:
> >>
> >><!--
> >>td {font-family: Arial,Geneva,Helvetica,sans-serif; color: #000000;}
> >>-->
> >
> >Well, the content model of the style element is CDATA, your "comments"
> >may look like comments but they are no comments in HTML and SGML
> >terms. That's not a bug.
>
> I don't see what's wrong with that comment.
>
> >From ISO 8879 Section 10.3 declares a "comment declaration" (yes, horrible
> term for it) as:
>
> comment declaration =
> "<!",
> (comment
> (s | comment)*
> )?
> ">"
>
> comment =
> "--",
> SGML_character*
> "--"
>
> And in section 6.2.1, there's the explanation of "s":
>
> s = SPACE | RE | RS | SEPCHAR
> and in the concrete syntax, that means [\x20\cm\cj\t]
>
> And as to "SGML_character", section 9.2 basically says that aside from any
> characters that you go and reserve as being impermissible, anything is an
> SGML_character. (I'm getting this from the /SGML Handbook/, which contains
> the full text of ISO 8879, plus annotation, etc.)
>
>
> So I don't see a problem with
> <!--
> td {font-family: Arial,Geneva,Helvetica,sans-serif; color: #000000;}
> -->
>
>
>
> BTW, the XML spec's definition is even clearer, er, sort of:
>
> Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'
>
> To this they add: "Note that the grammar does not allow a comment ending
> in --->. The following example is not well-formed: '<!-- B+, B, or B--->'".
> I'm a bit unclear on whether this really falls out of the grammar, but
> anyway.
>
>
> --
> Sean M. Burke sburke@cpan.org http://www.spinn.net/~sburke/
>
>