Re: URL-ification
Earl Hood (ehood@imagine.convex.com)
Tue, 09 Jan 1996 22:00:07 -0600
> URLs in text like:
>
> Been to http://www.xor.com/? Then try http://internet-plaza.net/.
>
> to something like
>
> Been to <A HREF="http://www.xor.com/">http://www.xor.com/</A>?
> Then try <A HREF="http://internet-plaza.net/">http://internet-plaza.net/</A>
...
> s{
> \b # start at word boundary
> ( # begin $1 {
> $urls : # need resource and a colon
> [$any] +? # followed by on or more
> # of any valid character, but
> # be conservative and take only
> # what you need to....
> ) # end $1 }
> (?= # look-ahead non-comsumptive assertion
> [$punc]+ # either 0 or more puntuation
> [^$any] # followed by a non-url char
> | # or else
> $ # then end of the string
> )
> }{<A HREF="$1">$1</A>}igox;
What you had did not handle lines like the following:
http://www.xor.com/foo?bar <http://internet-plaza.net/>.
Been to http://www.xor.com/foo?bar <http://internet-plaza.net/>.
(BTW, I had to edit out all the whitespace in the regular expression
to get it to find any matches on strings under 5.001m).
Here's something that works pretty well (perl 4 or 5):
----snip---
#! /usr/local/bin/perl
$Url = '(telnet://|">wais://|telnet://|">ftp://|afs://|wais://|telnet://| href="gopher://|'">gopher://|'">http://|ftp://|afs://|wais://|telnet://|gopher://|'.
'mailto:|prospero:">news:|nntp:|mid:|cid:|mailto:|prospero:)';
while (<>) {
print STDOUT &text2html($_);
}
exit 0;
## text2html converts a string to HTML by converting URLs to
## anchors and converting special characters into entity references.
##
sub text2html {
local($str) = $_[0];
local($item, $item2, $item2h, $endch, @array);
if (@array =
split(m%($Url[^\s\(\)\|<>"']*[^\.;,"'\|\[\]\(\)\s<>])%o, $str)) {
$str = '';
while($#array > 0) {
$item = &entify(shift @array); # Get non-URL text
$item2 = shift @array; # Get URL
$endch = chop $item2; # Check if '?' at end of URL
if ($endch ne '?') { # '?' is normally valid,
$item2 .= $endch; # but probably not part of
$endch = ''; # URL if last character
}
$item2h = &entify($item2); # Variable for <A> content
$str .= join('',
$item,
'<A HREF="', $item2, '">', $item2h, '</A>',
$endch);
# The next line is needed since Perl's split function also
# returns extra entries for nested ()'s in the split pattern.
shift @array if $array[0] =~ m%^$Url$%o;
}
$item = &entify(shift @array); # Last item in array
$str .= $item;
}
$str; # Return converted string
}
sub entify {
local($txt) = $_[0];
$txt =~ s/&/\&/g;
$txt =~ s/>/>/g;
$txt =~ s/</</g;
$txt;
}
----snip---
Doing a while version utilizing $` can probably be done to do the same
thing (and maybe faster). But the basic implementation of this routine
was written 2 years ago and has been rarely revisited.
--ewh