Re: Can LWP look for specific HTML tags?
Martijn Koster (mak@webcrawler.com)
Fri, 7 Nov 1997 11:33:20 +0000
On Fri, Nov 07, 1997 at 01:13:22AM -0600, Nathaniel Good wrote:
> currently use LWP to connect to the server and download the page to
> a file,where I then grep out the info I need. I do this by starting
> from the '<OL>' tag, going to each '<LI>' tag,get the info in
> the'<A>' tag and start the next item after the '</LI>' tag. I do
> this process until I see the '</OL>' tag.
Of course you can do this Perl, in any number of ways. The simplest
would be to just do the same "grep" you do outside Perl, in Perl,
using m///.
But there are more fancy ways, and you sound interested, so appended
find an example of doing it with the LWP HTML::Parser. Not perfect
(doesn't absolutify URIL's, or deal with nested lists), but you should
get the idea.
> 2) How can I get more detailed information on LWP functions and
> methods? If someone could point me to a URL or mention a book that
> has this information
No good book that I know off... The best place to look is the source
code -- just read the POD sections and you can learn a lot. The
example appended is based on HTML::LinkExtor.pm
Have fun,
-- Martijn Koster, m.koster@pobox.com
package ListParser;
require HTML::Parser;
@ISA = qw(HTML::Parser);
sub new
{
my($class, $base) = @_;
my $self = $class->SUPER::new;
$self->{inlist} = 0;
$self;
}
sub start
{
my($self, $tag, $attr) = @_; # $attr is reference to a HASH
if ($tag eq 'ol') {
$self->{inlist} = 1;
}
elsif ($tag eq 'a') {
print $attr->{'href'}, "\n" if ($self->{'inlist'});
}
}
sub end
{
my($self, $tag) = @_;
if ($tag eq 'ol') {
$self->{inlist} = 0;
}
}
package main;
$p = ListParser->new;
$p->parse(<<EOM);
<title>hi</title>
<a href="no.html">No</a>
<ol>
<li><a href="foo.html">Foo</a>
<li><a href="bar.html">Bar</a>
</ol>
<a href="no.html">No</a>
EOM