Re: Trouble understanding how HTML::TokeParser works
Gary Nielson (gnielson@charlotte.infi.net)
Tue, 13 Feb 2001 13:57:40 -0500 (EST)
That works great. You ought to post this on a Web site somewhere. I could
not find in any google search a good example of this using
HTML::TokeParser. Why did you use dumpvar.pl? Thanks.
Gary
On Tue, 13 Feb 2001, Tim Allwine wrote:
> Gary Nielson wrote:
> >
> > I can get by programming in Perl, but my head hurts trying to
> > understand how object-oriented modules such as TokeParser work.
> > Basically, I want to parse an html file where each entry looks like
> > this:
> >
> > <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
> > <A HREF="/rc/news/docs/07073706.htm">Junk DNA may not be such junk,
> > genome studies find</A>
> > </B></FONT></DT>
> > <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>WASHINGTON -
> > — The first in-depth look into the
> > human genome shows it is much more complicated.. <P>
> > </FONT></DD>
> >
>
> Here is one way to do it. Assume you have the following file
> called 'sample.html'.
>
> <html>
> <head><title>Tutorial</title></head>
> <body>
>
> <dl>
> <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
> <A HREF="/rc/news/docs/07073706.htm">Junk DNA may not be such junk,
> genome studies find</A>
> </B></FONT></DT>
> <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>WASHINGTON -
> — The first in-depth look into the
> human genome shows it is much more complicated.. <P>
> </FONT></DD>
>
> <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
> <A HREF="http://search.cpan.org/doc/GAAS/HTML-Parser-3.15/lib/HTML/TokeParse
> The HTML::TokeParser is an
> alternative interface to the HTML::Parser class.
> </A>
> </B></FONT></DT>
> <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>Sebastopol -
> It basically turns the HTML::Parser inside out.
> You associate a file (or any IO::Handle object or
> string) with the parser at construction
> time and then repeatedly call $parser->get_token
> to obtain the tags and text found in the parsed
> document.
> <P>
> </FONT></DD>
>
> <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
> <A HREF="http://search.cpan.org/doc/GAAS/HTML-Parser-3.15/Parser.pm">
> This is the new XS based HTML::Parser</A>
> </B></FONT></DT>
> <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>Boston -
> Objects of the HTML::Parser class will recognize
> markup and separate it from plain text (alias data
> content) in HTML documents. As different kinds of
> markup and text are recognized, the corresponding
> event handlers are invoked.
> <p>
> </FONT></DD>
> <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
> <A HREF="http://search.cpan.org/doc/GAAS/libwww-perl-5.10/lib/HTML/TreeBuild
> This is a parser that builds (and actually itself is) a HTML syntax tree.
> </A>
> </B></FONT></DT>
> <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>PITTSBURG -
> Objects of this class inherit the methods of both
> HTML::Parser and HTML::Element. After parsing has
> taken place it can be regarded as the syntax tree
> itself.
> <P>
> </FONT></DD>
>
> </dl>
> </body>
> </html>
>
> Run the following code.
>
> use strict;
> use HTML::TokeParser;
> require 'dumpvar.pl';
>
> my $p = HTML::TokeParser->new("sample.html");
> my $rss;
>
> while(my $token = $p->get_token) {
> next unless $token->[0] eq 'S' and
> $token->[1] eq 'dt';
> my $rec = {};
> while(my $token = $p->get_token) {
> last if $token->[0] eq 'E' and
> $token->[1] eq 'dd';
> if($token->[0] eq 'S' and
> $token->[1] eq 'a') {
> $rec->{url} = $token->[2]{href};
> $rec->{headline} = $p->get_trimmed_text('/a');
> } elsif($token->[0] eq 'S' and
> $token->[1] eq 'dd') {
> $rec->{summary} = $p->get_trimmed_text('/dd');
> }
> }
> push(@$rss,$rec);
> }
> #dumpValue(\$rss);
>
> for my $rec (@$rss) {
> print join('||',$rec->{url},$rec->{headline},$rec->{summary}),"\n\n";
> }
>
> __END__
> The TokeParser parses an html document and gives you an array of
> tokens to look through. The way you access this array of tokens
> is through the various methods in the class. The tokens
> themselves are represented by references to arrays.
>
> The above program parses the document and begins the winnowing
> process. The outer while loop rejects any token that is not a 'S'
> (start) tag and has a name of 'dt'. Once the first <dt> tag is
> found we create a hash ref that will hold the data for each
> record found. We jump out if we see the closing </dt>. If we see
> the starting <a> tag, grab the url, it is the third element in
> the token which is a hash ref and we want the value who's key is
> 'href'. Then grab all the text up to the closing </a> tag. If we
> see the <dd> tag then grab all the text up to the closing </dd>
> tag. When we jump out, push $rec into an array and go back for
> more.
>
--
Gary Nielson
gary@garynielson.com