Re: [Q] Ok... further questions when using Link Extractor
Martijn Koster (mak@webcrawler.com)
Fri, 5 Dec 1997 09:25:34 +0000
On Thu, Dec 04, 1997 at 01:37:26AM -0600, Mike Grommet wrote:
> when link extractor tries to absolutize the silly thing I get
> http://www.insolwwb.net/../d0001/g0000100.htm
> instead of
> http://www.insolwwb.net/~rholler/d0001/g0000100.htm which is
> what I want...
>
> any suggestions for fixing the problem???
Ehr. Quiet, I need to oncentrate... yes, it's coming through now. I
can feel the code,... yes, I can sense the results. Aha! -- the
problem is.. [poof]. Damn, no it's gone, perhaps a solar flare. Oh
well...
Perhaps I've missed a previous post, or am simply dense, but can you
post your code, with example input, example incorrect output, a
description of your expected output, and the version numbers of perl
and LWP?
I find the attached code, a minimally changed copy from `perldoc
LinkExtor`, gives absolute URL's, as does `lwp-request -o links`,
using perl5.004_04 and LWP 5.17.
Obl-LWP-Dev: Gisle, the following philosophically unclean change would
allow me to just cut and paste from `perldoc`.
--- LinkExtor.pm Tue Dec 2 12:50:03 1997
+++ /tmp/LinkExtor.pm Fri Dec 5 09:19:45 1997
@@ -154,12 +154,13 @@
push(@imgs, values %attr);
}
- # Make the parser. Unfortunately, we don't know the base yet (it might
- # be diffent from $url)
+ # Make the parser. Unfortunately, we don't know the base yet
+ # (it might be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);
# Request document and parse it as it arrives
- $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])});
+ $res = $ua->request(HTTP::Request->new(GET => $url),
+ sub {$p->parse($_[0])});
# Expand all image URLs to absolute ones
my $base = $res->base;
Cheers,
-- Martijn Koster, m.koster@pobox.com
use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;
$url = "http://www.insolwwb.net/~rholler/d0001/g0000087.htm#I1327"; # for instance
$ua = new LWP::UserAgent;
# Set up a callback that collect image links
my @imgs = ();
sub callback {
my($tag, %attr) = @_;
return if $tag ne 'a'; # we only look closer at <img ...>
push(@imgs, values %attr);
}
# Make the parser. Unfortunately, we don't know the base yet
# (it might be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);
# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $url),
sub {$p->parse($_[0])});
# Expand all image URLs to absolute ones
my $base = $res->base;
@imgs = map { $_ = url($_, $base)->abs; } @imgs;
# Print them out
print join("\n", @imgs), "\n";