Re: SGML Parser in Perl?
Earl Hood (ehood@imagine.convex.com)
Thu, 13 Oct 1994 12:25:43 -0500
> I've just come back off holiday, and while on the beah I've been
> thinking about HTML and Perl 5 (yeah, sad case). I have been
> considering "htmls": a html parser that will check and rewrite HTML
> into different forms (like indented; smallest; fully expanded
> etc). This would have the HTML DTD built-in; and not actually parse it
> every time. This could mean a single C program is all you need to
> verify and manage HTML (unlike the entire sgmls setup). Of course I'd
> like to prototype it in Perl5.
Sgmls documentation is poor, but once you figure out how to use
it, it isn't too bad. One can write a shell wrapper to always
include the SGML declaration and HTML DTD before parsing an HTML
document.
> I don't think full SGML compliance (ie those features that browsers
> don't implement) is all that important or even desirable for that
> purpose, and this program could pick things out that SGML is too
> general for.
>
> Have you got any thoughts on this?
I think 'full' SGML compliance is important. Reason: Future versions
of HTML will definitely adopt the usage of other SGML features (Note:
Techincally it already does since the HTML 2.0 spec claims HTML is an
SGML application -- it is up to Web software to catch up).
Therefore, I can see HTML documents of the future use marked sections,
general entities, and other features of SGML. Sgmls can handle this.
Plus, it outputs a 'normalized' version of the parsed document which is
easily parseable by a (Perl) post processor.
If you build a good HTML/SGML parser in Perl, then I'd be definitely
interested in using such a beast (and be willing to help, if desired).
However, you must consider the following: Is it worth building a
parser from scratch when a tool already exists (sgmls) that has done
the job already?
Later,
--ewh