Re: SGML Parser in Perl?

Earl Hood (ehood@imagine.convex.com)
Thu, 13 Oct 1994 12:25:43 -0500


> I've just come back off holiday, and while on the beah I've been
> thinking about HTML and Perl 5 (yeah, sad case). I have been
> considering "htmls": a html parser that will check and rewrite HTML
> into different forms (like indented; smallest; fully expanded
> etc). This would have the HTML DTD built-in; and not actually parse it
> every time. This could mean a single C program is all you need to
> verify and manage HTML (unlike the entire sgmls setup). Of course I'd
> like to prototype it in Perl5.

Sgmls documentation is poor, but once you figure out how to use
it, it isn't too bad.  One can write a shell wrapper to always
include the SGML declaration and HTML DTD before parsing an HTML
document.


> I don't think full SGML compliance (ie those features that browsers
> don't implement) is all that important or even desirable for that
> purpose, and this program could pick things out that SGML is too
> general for.
> 
> Have you got any thoughts on this? 

I think 'full' SGML compliance is important.  Reason: Future versions
of HTML will definitely adopt the usage of other SGML features (Note:
Techincally it already does since the HTML 2.0 spec claims HTML is an
SGML application -- it is up to Web software to catch up).

Therefore, I can see HTML documents of the future use marked sections,
general entities, and other features of SGML.  Sgmls can handle this.
Plus, it outputs a 'normalized' version of the parsed document which is
easily parseable by a (Perl) post processor.


If you build a good HTML/SGML parser in Perl, then I'd be definitely
interested in using such a beast (and be willing to help, if desired).
However, you must consider the following:  Is it worth building a
parser from scratch when a tool already exists (sgmls) that has done
the job already?

Later,

	--ewh