HTML::Parser (HTML::LinkExtor) Broken?
Matthew Keller (keller57@potsdam.edu)
Sat, 06 Mar 1999 11:32:35 -0500
HTML::Parser (or maybe HTML::LinkExtor) is *NOT* returning all of the
tags it finds inside of <MAP> tags. The code below is taken from the
HTML::LinkExtor POD, on how to extract links. When run, it connects to a
page I wrote that contains a client-side image map.
This returns all of the AREA elements marked with 'SHAPE="circle"', but
none of the 'rect' areas. This page has a total of *27* link elements,
but only *7* are returned, because 20 of them are AREA elements with
'SHAPE="rect"'
I can reproduce these results on ANY client-side image map (I've tested
on over 30 different sites). Out of the below two AREA elements, only
the first one is treated as a link element.
*ANY* assistance would be most appreciated.
-- Begin HTML Snippet --
<AREA SHAPE="circle" COORDS="582,149,51"
HREF="http://mattwork.potsdam.edu/friends.htm" ALT="My Friends">
<AREA SHAPE="rect" COORDS="3,401,198,440",
HREF="http://mattwork.potsdam.edu/Me/" ALT="Me Stuff">
-- End HTML Snippet --
-- Begin Perl Code --
#!/usr/local/bin/perl -w
use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;
$url = "http://mattwork.potsdam.edu/zippy.htm"; # for instance
$ua = new LWP::UserAgent;
# Set up a callback that collect image links
my @imgs = ();
sub callback {
my($tag, %attr) = @_;
#return if $tag ne 'img'; # we only look closer at <img ...>
push(@imgs, values %attr);
}
# Make the parser. Unfortunately, we don't know the base yet
# (it might be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);
# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $url),
sub {$p->parse($_[0])});
# Expand all image URLs to absolute ones
my $base = $res->base;
@imgs = map { $_ = url($_, $base)->abs; } @imgs;
# Print them out
print join("\n", @imgs), "\n";
-- End Perl Code --
--
-> Matthew Keller <-
Distributed Computing
Windows/UNIX Support
and Host Services
Kellas Hall
State University of New York at Potsdam
http://mattwork.potsdam.edu/
-
They wouldn't give you the time of day.
They said you weren't a player.
They wouldn't accept your calls.
They are holding on line three.
-
PGP Keys -
http://mattwork.potsdam.edu/crypto/