Re: [Q] Ok... further questions when using Link Extractor

Martijn Koster (mak@webcrawler.com)
Fri, 5 Dec 1997 09:25:34 +0000


On Thu, Dec 04, 1997 at 01:37:26AM -0600, Mike Grommet wrote:

> when link extractor tries to absolutize the silly thing I get
> http://www.insolwwb.net/../d0001/g0000100.htm
> instead of 
> http://www.insolwwb.net/~rholler/d0001/g0000100.htm which is
> what I want...
> 
> any suggestions for fixing the problem???

Ehr. Quiet, I need to oncentrate... yes, it's coming through now. I
can feel the code,... yes, I can sense the results. Aha! -- the
problem is.. [poof]. Damn, no it's gone, perhaps a solar flare. Oh
well...

Perhaps I've missed a previous post, or am simply dense, but can you
post your code, with example input, example incorrect output, a
description of your expected output, and the version numbers of perl
and LWP?

I find the attached code, a minimally changed copy from `perldoc
LinkExtor`, gives absolute URL's, as does `lwp-request -o links`,
using perl5.004_04 and LWP 5.17.

Obl-LWP-Dev: Gisle, the following philosophically unclean change would
allow me to just cut and paste from `perldoc`.

--- LinkExtor.pm        Tue Dec  2 12:50:03 1997
+++ /tmp/LinkExtor.pm   Fri Dec  5 09:19:45 1997
@@ -154,12 +154,13 @@
      push(@imgs, values %attr);
   }

-  # Make the parser.  Unfortunately, we don't know the base yet (it might
-  # be diffent from $url)
+  # Make the parser.  Unfortunately, we don't know the base yet
+  # (it might be diffent from $url)
   $p = HTML::LinkExtor->new(\&callback);

   # Request document and parse it as it arrives
-  $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])});
+  $res = $ua->request(HTTP::Request->new(GET => $url),
+                      sub {$p->parse($_[0])});

   # Expand all image URLs to absolute ones
   my $base = $res->base;

Cheers,

-- Martijn Koster, m.koster@pobox.com

            use LWP::UserAgent;
            use HTML::LinkExtor;
            use URI::URL;

            $url = "http://www.insolwwb.net/~rholler/d0001/g0000087.htm#I1327";  # for instance
            $ua = new LWP::UserAgent;

            # Set up a callback that collect image links
            my @imgs = ();
            sub callback {
               my($tag, %attr) = @_;
               return if $tag ne 'a';  # we only look closer at <img ...>
               push(@imgs, values %attr);
            }

            # Make the parser.  Unfortunately, we don't know the base yet
            # (it might be diffent from $url)
            $p = HTML::LinkExtor->new(\&callback);

            # Request document and parse it as it arrives
            $res = $ua->request(HTTP::Request->new(GET => $url),
                sub {$p->parse($_[0])});

            # Expand all image URLs to absolute ones
            my $base = $res->base;
            @imgs = map { $_ = url($_, $base)->abs; } @imgs;

            # Print them out
            print join("\n", @imgs), "\n";