Re: URL-ification

Earl Hood (ehood@imagine.convex.com)
Tue, 09 Jan 1996 22:00:07 -0600


> URLs in text like:
> 
>  Been to http://www.xor.com/? Then try http://internet-plaza.net/. 
> 
> to something like
> 
>  Been to <A HREF="http://www.xor.com/">http://www.xor.com/</A>? 
>  Then try <A HREF="http://internet-plaza.net/">http://internet-plaza.net/</A>
...
> s{
>     \b                          # start at word boundary
>     (                           # begin $1  {
>       $urls     :               # need resource and a colon
>       [$any] +?                 # followed by on or more
>                               #  of any valid character, but
>                               #  be conservative and take only
>                               #  what you need to....
>     )                           # end   $1  }
>     (?=                         # look-ahead non-comsumptive assertion
>           [$punc]+            # either 0 or more puntuation
>           [^$any]             #   followed by a non-url char
>       |                       # or else
>           $                   #   then end of the string
>     )
> }{<A HREF="$1">$1</A>}igox;

What you had did not handle lines like the following:

http://www.xor.com/foo?bar <http://internet-plaza.net/>.
Been to http://www.xor.com/foo?bar <http://internet-plaza.net/>.

(BTW, I had to edit out all the whitespace in the regular expression
 to get it to find any matches on strings under 5.001m).

Here's something that works pretty well (perl 4 or 5):

----snip---
#! /usr/local/bin/perl

$Url = '(telnet://|">wais://|telnet://|">ftp://|afs://|wais://|telnet://| href="gopher://|'">gopher://|'">http://|ftp://|afs://|wais://|telnet://|gopher://|'.
	'mailto:|prospero:">news:|nntp:|mid:|cid:|mailto:|prospero:)';

while (<>) {
    print STDOUT &text2html($_);
}
exit 0;

##      text2html converts a string to HTML by converting URLs to
##      anchors and converting special characters into entity references.
##
sub text2html {
    local($str) = $_[0];
    local($item, $item2, $item2h, $endch, @array);
    if (@array =
	split(m%($Url[^\s\(\)\|<>"']*[^\.;,"'\|\[\]\(\)\s<>])%o, $str)) {

	$str = '';
	while($#array > 0) {
	    $item = &entify(shift @array);	# Get non-URL text
	    $item2 = shift @array;		# Get URL
	    $endch = chop $item2;		# Check if '?' at end of URL
	    if ($endch ne '?') {		#   '?' is normally valid,
		$item2 .= $endch;		#   but probably not part of
		$endch = '';			#   URL if last character
	    }
	    $item2h = &entify($item2);		# Variable for <A> content

	    $str .= join('',
			 $item,
			 '<A HREF="', $item2, '">', $item2h, '</A>',
			 $endch);

	    # The next line is needed since Perl's split function also
	    # returns extra entries for nested ()'s in the split pattern.
	    shift @array  if $array[0] =~ m%^$Url$%o;
	}
	$item = &entify(shift @array);		# Last item in array
	$str .= $item;
    }
    $str;	# Return converted string
}

sub entify {
    local($txt) = $_[0];
    $txt =~ s/&/\&amp;/g;
    $txt =~ s/>/&gt;/g;
    $txt =~ s/</&lt;/g;
    $txt;
}
----snip---

Doing a while version utilizing $` can probably be done to do the same
thing (and maybe faster).  But the basic implementation of this routine
was written 2 years ago and has been rarely revisited.

	--ewh