Re: Can LWP look for specific HTML tags?

Martijn Koster (mak@webcrawler.com)
Fri, 7 Nov 1997 11:33:20 +0000


On Fri, Nov 07, 1997 at 01:13:22AM -0600, Nathaniel Good wrote:

> currently use LWP to connect to the server and download the page to
> a file,where I then grep out the info I need. I do this by starting
> from the '<OL>' tag, going to each '<LI>' tag,get the info in
> the'<A>' tag and start the next item after the '</LI>' tag. I do
> this process until I see the '</OL>' tag.

Of course you can do this Perl, in any number of ways. The simplest
would be to just do the same "grep" you do outside Perl, in Perl,
using m///.

But there are more fancy ways, and you sound interested, so appended
find an example of doing it with the LWP HTML::Parser. Not perfect
(doesn't absolutify URIL's, or deal with nested lists), but you should
get the idea.

> 2) How can I get more detailed information on LWP functions and
> methods?  If someone could point me to a URL or mention a book that
> has this information

No good book that I know off... The best place to look is the source
code -- just read the POD sections and you can learn a lot. The
example appended is based on HTML::LinkExtor.pm

Have fun,

-- Martijn Koster, m.koster@pobox.com

package ListParser;

require HTML::Parser;
@ISA = qw(HTML::Parser);

sub new
{
    my($class, $base) = @_;
    my $self = $class->SUPER::new;
    $self->{inlist} = 0;
    $self;
}

sub start
{
    my($self, $tag, $attr) = @_;  # $attr is reference to a HASH
    if ($tag eq 'ol') {
	$self->{inlist} = 1;
    }
    elsif ($tag eq 'a') {
	print $attr->{'href'}, "\n" if ($self->{'inlist'});
    }
}

sub end
{
    my($self, $tag) = @_;
    if ($tag eq 'ol') {
	$self->{inlist} = 0;
    }
}

package main;

$p = ListParser->new;
$p->parse(<<EOM);
<title>hi</title>
<a href="no.html">No</a>
<ol>
<li><a href="foo.html">Foo</a>
<li><a href="bar.html">Bar</a>
</ol>
<a href="no.html">No</a>
EOM