Re: [Q] Code problem using Link Extractor

Gisle Aas (gisle@aas.no)
04 Dec 1997 15:54:27 +0100


Mike Grommet <mgrommet@insolwwb.net> writes:

> I am attempting to use Link Extractor to pull links from a url,
> then absoluteize them to convert from relative links, and then keep
> the links which are at the same base as the url I am checking...
> 
> Suppose I am checking http://www.insolwwb.net
> so basically I want all pages that look like http://www.insolwwb.net/*
> but not http://www.microsoft.com
> 
> Here is my call back function for link extractor so far:
> 
> ---------- SNIP ---------------
> sub picklinks
> {
>   my($tag, %attr) = @_;
>   return if $tag ne 'a';            #return if not anchor tag
>   @value = values(%attr);    #get the url info
>   $val = $value[0];                 #@value is 1 element array... get the
> value
>   $val = url($val,$base)->abs; #absolutize the url
>   push(@links,$val);                 #push it on the @links array
> }
> 
> --------- SNIP -------------
> 
> This code seems to work fine for absolutizing the links before
> it pushes them on the links array...
> 
> $base is a global value of the base of the url we are checking.
> 
> 
> Now how do I make it so I chunk out the urls that are not under the same
> base?
> 
> BTW is there a better way to do this?  I am very open to criticism
> on this code (like I said, I am very very new to libwww)


I would something like this:

 sub picklinks
 {
   my($tag, %attr) = @_;
   return if $tag ne 'a';            #return if not anchor tag
   for my $url (values(%attr)) {
       $url = url($url,$base)->abs;
       next unless $url =~ /^\Q$base/o;
       push(@links, $url);
   }
 }

-- 
Gisle Aas