HTML::Parser ... is there a bug in option parsing?
Charlie Stross (charles@fma10.fma.com)
Mon, 30 Sep 1996 12:41:06 +0100
I've been playing with LWP 5.01 (writing an HTML prettifier, to make
ancient code more maintainable) and ran into a curious problem. My
program calls parse_htmlfile to build a parse tree, then walks it
spitting out elements and text (hopefully indented in a readable
manner). However, When I feed it a document containing a PICS declaration
in a header like this ...
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
blah
What comes out of $entity->starttag (where $entity is an HTML::Element
containing the PICS bumph) looks like this:
Blah
What's happening is that the PICS content uses single-quoting to
encapsulate information that includes double-quotes. The single-quoted
double quotes appear to be being mangled into " entities.
I note from the nearest available DTD that in a META tag, "content"
is defined as CDATA. I thought CDATA was exempt from interpolation?
(Please excuse me if I'm wrong: my SGML is at the pidgin-speak level).
Is this (a) a problem with HTML::Parser's handling of quoted text, (b) a
problem with the DTD, or (c) a figment of my fevered imagination?
-- Charlie Stross
(PS: the html prettyprinter is a rather crude prototype. Yes, I know I
should be using HTML::Parser and overloading it with methods to accomplish
my goal rather than do it this way; but I'm not quite object-minded
enough to do that without prototying the idea first. Copy available on
request for anyone who wants it ...)