Re: Perl HTML parsing library

bcutter@pdn.paradyne.com
Tue, 6 Sep 94 13:24:25 EDT


> > Included below is documentation on a HTML parsing library I've been
> > working on, and the libraries themselves.  I've seen what exists out
> > there and I haven't found anything that did what I neded...
> 
> You should look at the 'htmltoc' program that is part of the
> perlWWW package <URL:http://www.oac.uci.edu/indiv/ehood/> which
> has been out for a while.  htmltoc must be able to parse HTML
> inorder to modify HTML when generating a Table of Contents and
> inserting anchors into the document for linking.

Thanks for the tip... I should have described it more clearly - the
package I included is more than just a parsing library - it also
provides a method for processing/displaying HTML.  It's more than
just parsing in that it provides a uniform method to register
routines that deal with HTML - and hides issues not related to a
single tag from the routine.

The SGMLread_sgml routine you included seems equivalent to the parse_html
routine I included - both are trivial.  The meat of the my library is
the process_html stream which, after registering routines to manipulate
the tags, returns a processed stream suitable for use by the application.
("suitable" because the application registered beforehand what to do with
the data - and the process HTML is customized for that application)..

I'm still interested in a complete package that handles more than just
splitting HTML into tags/not-tags...  (The only thing comprable that
I've seen is the cern libwww, which is written in C...)

-Brooks
bcutter@paradyne.com