HTML tables, DOM and SGMLS

John Whelan (whelan@itp.unibe.ch)
Sun, 18 Apr 1999 09:22:47 +0200


stacy-lacy@worldnet.att.net, k.ueno@psynet.net
Bcc: whelan

>> I want to parse a html page having table format into
>> simple text. I tried the following code, but

>You can subclass HTML::Parser. I did a HTML::ParseTables which is such a
>subclass, but it has no documentation yet, and has not been tested much.

Last Summer/Fall I got most of the way through writing a HyperTable.pm
module to manipulate HTML tables, but when I tried to implement THEAD,
TBODY, TFOOT, COLGROUP etc I realized the grid model I was using was
not a good idea and decided to switch over to some sort of DOM-like
representation, which never got written.  There are two other existing
HTML table packages I know of, HTML::TableLayout by Steve Farrell
<sfarrell@healthquiz.com> and HTML::Table by Stacy Lacy
<stacy-lacy@worldnet.att.net>, which I believe are/were both on CPAN.
Neither of them had support for the various HTML 4.0 table constructs,
though.

This brings up two other points I've been meaning to raise: first of
all, has anyone written a Perl module to implement the HTML (rather
than XML) version of the W3C's Document Object Model?  None existed
last fall, and I was almost at the point of trying to write one myself
when more pressing matter intervened.

Second, one of the problems with HTML::Parser is that it relies upon
information hard-coded into the module source to know which tags are
stand-alone, which have implied closing tags, etc., and thus chokes on
things like the COL tag, which are recent stand-alones.  Has anyone
worked with the SGMLS module, which uses the (n)sgmls parser to read
the Document Type Definition determine the syntax of a document before
parsing it?  That would seem like a way to build a document tree or
DOM without having to update the parser source code every time the
HTML spec changes.  (Emacs psgml mode has this advantage; the version
put out before the HTML 4.0 spec handles HMTL 4.0 documents just
fine.)
					John T. Whelan
					whelan@iname.com
					http://www.slack.net/~whelan/