Re: Trouble understanding how HTML::TokeParser works

Tim Allwine (tallwine@oreilly.com)
Tue, 13 Feb 2001 09:33:12 -0800


Gary Nielson wrote:
> 
> I can get by programming in Perl, but my head hurts trying to
> understand how object-oriented modules such as TokeParser work.
> Basically, I want to parse an html file where each entry looks like
> this:
> 
> <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
> <A HREF="/rc/news/docs/07073706.htm">Junk DNA may not be such junk,
> genome studies find</A>
> </B></FONT></DT>
> <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>WASHINGTON -
> &#0151; The first in-depth look into the
> human genome shows it is much more complicated.. <P>
> </FONT></DD>
> 

Here is one way to do it. Assume you have the following file
called 'sample.html'.

Tutorial
Junk DNA may not be such junk, genome studies find
WASHINGTON - — The first in-depth look into the human genome shows it is much more complicated..

Sebastopol - It basically turns the HTML::Parser inside out. You associate a file (or any IO::Handle object or string) with the parser at construction time and then repeatedly call $parser->get_token to obtain the tags and text found in the parsed document.

This is the new XS based HTML::Parser
Boston - Objects of the HTML::Parser class will recognize markup and separate it from plain text (alias data content) in HTML documents. As different kinds of markup and text are recognized, the corresponding event handlers are invoked.

PITTSBURG - Objects of this class inherit the methods of both HTML::Parser and HTML::Element. After parsing has taken place it can be regarded as the syntax tree itself.


Run the following code.

use strict;
use HTML::TokeParser;
require 'dumpvar.pl';

my $p = HTML::TokeParser->new("sample.html");
my $rss;

while(my $token = $p->get_token) {
    next unless $token->[0] eq 'S' and
        $token->[1] eq 'dt';
    my $rec = {};
    while(my $token = $p->get_token) {
        last if $token->[0] eq 'E' and
            $token->[1] eq 'dd';
        if($token->[0] eq 'S' and
                $token->[1] eq 'a') {
            $rec->{url} = $token->[2]{href};
            $rec->{headline} = $p->get_trimmed_text('/a');
        } elsif($token->[0] eq 'S' and
                $token->[1] eq 'dd') {
            $rec->{summary} = $p->get_trimmed_text('/dd');
        }
    }
    push(@$rss,$rec);
}
#dumpValue(\$rss);

for my $rec (@$rss) {
    print join('||',$rec->{url},$rec->{headline},$rec->{summary}),"\n\n";
}

__END__
The TokeParser parses an html document and gives you an array of
tokens to look through. The way you access this array of tokens
is through the various methods in the class. The tokens
themselves are represented by references to arrays.

The above program parses the document and begins the winnowing
process. The outer while loop rejects any token that is not a 'S'
(start) tag and has a name of 'dt'. Once the first <dt> tag is
found we create a hash ref that will hold the data for each
record found. We jump out if we see the closing </dt>. If we see
the starting <a> tag, grab the url, it is the third element in
the token which is a hash ref and we want the value who's key is
'href'. Then grab all the text up to the closing </a> tag. If we
see the <dd> tag then grab all the text up to the closing </dd>
tag. When we jump out, push $rec into an array and go back for
more.