Re: Trouble understanding how HTML::TokeParser works

Gary Nielson (gnielson@charlotte.infi.net)
Tue, 13 Feb 2001 13:57:40 -0500 (EST)


That works great. You ought to post this on a Web site somewhere. I could
not find in any google search a good example of this using
HTML::TokeParser. Why did you use dumpvar.pl? Thanks.

Gary

 On Tue, 13 Feb 2001, Tim Allwine wrote:

> Gary Nielson wrote:
> > 
> > I can get by programming in Perl, but my head hurts trying to
> > understand how object-oriented modules such as TokeParser work.
> > Basically, I want to parse an html file where each entry looks like
> > this:
> > 
> > <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
> > <A HREF="/rc/news/docs/07073706.htm">Junk DNA may not be such junk,
> > genome studies find</A>
> > </B></FONT></DT>
> > <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>WASHINGTON -
> > &#0151; The first in-depth look into the
> > human genome shows it is much more complicated.. <P>
> > </FONT></DD>
> > 
> 
> Here is one way to do it. Assume you have the following file
> called 'sample.html'.
> 
> <html>
> <head><title>Tutorial</title></head>
> <body>
> 
> <dl>
>     <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
>     <A HREF="/rc/news/docs/07073706.htm">Junk DNA may not be such junk,
>     genome studies find</A>
>     </B></FONT></DT>
>     <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>WASHINGTON -
>     &#0151; The first in-depth look into the
>     human genome shows it is much more complicated.. <P>
>     </FONT></DD>
> 
>     <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
>     <A HREF="http://search.cpan.org/doc/GAAS/HTML-Parser-3.15/lib/HTML/TokeParse
>     The HTML::TokeParser is an
>     alternative interface to the HTML::Parser class.
>     </A>
>     </B></FONT></DT>
>     <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>Sebastopol -
>     It basically turns the HTML::Parser inside out.
>     You associate a file (or any IO::Handle object or
>     string) with the parser at construction
>     time and then repeatedly call $parser->get_token
>     to obtain the tags and text found in the parsed
>     document.
>     <P>
>     </FONT></DD>
> 
>     <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
>     <A HREF="http://search.cpan.org/doc/GAAS/HTML-Parser-3.15/Parser.pm">
>     This is the new XS based HTML::Parser</A>
>     </B></FONT></DT>
>     <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>Boston -
>     Objects of the HTML::Parser class will recognize
>     markup and separate it from plain text (alias data
>     content) in HTML documents. As different kinds of
>     markup and text are recognized, the corresponding
>     event handlers are invoked.
>     <p>
>     </FONT></DD>
>     <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
>     <A HREF="http://search.cpan.org/doc/GAAS/libwww-perl-5.10/lib/HTML/TreeBuild
>     This is a parser that builds (and actually itself is) a HTML syntax tree.
>     </A>
>     </B></FONT></DT>
>     <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>PITTSBURG -
>     Objects of this class inherit the methods of both
>     HTML::Parser and HTML::Element. After parsing has
>     taken place it can be regarded as the syntax tree
>     itself.
>      <P>
>     </FONT></DD>
> 
> </dl>
> </body>
> </html>
> 
> Run the following code.
> 
> use strict;
> use HTML::TokeParser;
> require 'dumpvar.pl';
> 
> my $p = HTML::TokeParser->new("sample.html");
> my $rss;
> 
> while(my $token = $p->get_token) {
>     next unless $token->[0] eq 'S' and
>         $token->[1] eq 'dt';
>     my $rec = {};
>     while(my $token = $p->get_token) {
>         last if $token->[0] eq 'E' and
>             $token->[1] eq 'dd';
>         if($token->[0] eq 'S' and
>                 $token->[1] eq 'a') {
>             $rec->{url} = $token->[2]{href};
>             $rec->{headline} = $p->get_trimmed_text('/a');
>         } elsif($token->[0] eq 'S' and
>                 $token->[1] eq 'dd') {
>             $rec->{summary} = $p->get_trimmed_text('/dd');
>         }
>     }
>     push(@$rss,$rec);
> }
> #dumpValue(\$rss);
> 
> for my $rec (@$rss) {
>     print join('||',$rec->{url},$rec->{headline},$rec->{summary}),"\n\n";
> }
> 
> __END__
> The TokeParser parses an html document and gives you an array of
> tokens to look through. The way you access this array of tokens
> is through the various methods in the class. The tokens
> themselves are represented by references to arrays.
> 
> The above program parses the document and begins the winnowing
> process. The outer while loop rejects any token that is not a 'S'
> (start) tag and has a name of 'dt'. Once the first <dt> tag is
> found we create a hash ref that will hold the data for each
> record found. We jump out if we see the closing </dt>. If we see
> the starting <a> tag, grab the url, it is the third element in
> the token which is a hash ref and we want the value who's key is
> 'href'. Then grab all the text up to the closing </a> tag. If we
> see the <dd> tag then grab all the text up to the closing </dd>
> tag. When we jump out, push $rec into an array and go back for
> more.
> 

-- 
Gary Nielson
gary@garynielson.com