HTML::Parser ... is there a bug in option parsing?

Charlie Stross (charles@fma10.fma.com)
Mon, 30 Sep 1996 12:41:06 +0100


I've been playing with LWP 5.01 (writing an HTML prettifier, to make 
ancient code more maintainable) and ran into a curious problem. My
program calls parse_htmlfile to build a parse tree, then walks it
spitting out elements and text (hopefully indented in a readable
manner). However, When I feed it a document containing a PICS declaration
in a header like this ...

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
blah What comes out of $entity->starttag (where $entity is an HTML::Element containing the PICS bumph) looks like this: Blah What's happening is that the PICS content uses single-quoting to encapsulate information that includes double-quotes. The single-quoted double quotes appear to be being mangled into " entities. I note from the nearest available DTD that in a META tag, "content" is defined as CDATA. I thought CDATA was exempt from interpolation? (Please excuse me if I'm wrong: my SGML is at the pidgin-speak level). Is this (a) a problem with HTML::Parser's handling of quoted text, (b) a problem with the DTD, or (c) a figment of my fevered imagination? -- Charlie Stross (PS: the html prettyprinter is a rather crude prototype. Yes, I know I should be using HTML::Parser and overloading it with methods to accomplish my goal rather than do it this way; but I'm not quite object-minded enough to do that without prototying the idea first. Copy available on request for anyone who wants it ...)