Re: Converting HTML to plain text

Daniel Reeves (dreeves@eecs.umich.edu)
Mon, 26 Jul 1999 18:41:09 -0400 (EDT)


Lynx is better at this than any existing Perl Module so I ended up using
that...
I dump the html to a temp file, and then run lynx -dump on it and capture
the output, which is a nicely formatted text version of the html, tables
and all...
(using the temp file is probably not necessary for most applications)

sub html2text
{
   my($rawhtml, $url_prefix) = @_;
   my $tmpfile = $ENV{'HOME'}."/trash/tmp-html2text-$$.html";

   #my $htmldata = HTML::Parse::parse_html($rawhtml);
   #my $formatter = new HTML::FormatText;
   #my $textdata = $formatter->format($htmldata);   
   
   #return $textdata;

   open(HTML, ">$tmpfile") or die "$!";
   print HTML $rawhtml;
   close(HTML);
   open(LYNX, "/usr/gnu/bin/lynx -dump  $tmpfile |") or die "$!";
   my $textdata = "";
   while (<LYNX>) { $textdata .= $_; }
   close(LYNX);
   $textdata =~ 
     s[file\://localhost/n/flip/h/dreeves/trash/][$url_prefix]ges;
   return $textdata;
}

--    --    --    --    --    --    --    --    --    --    --    -- 
Daniel Reeves               http://ai.eecs.umich.edu/people/dreeves/

   "In the last 10 years, we have come to realize that humans are more
   like worms than we ever imagined."  -- Bruce Alberts, president of
     the National Academy of Sciences, after mapping the DNA of a 
     microscopic roundworm.

On Fri, 23 Jul 1999, Balisteri, Peter wrote:

> I need to read a large group of html files, one at a time, and output only
> the text to another file.  I am new to using modules in perl.  I have been
> putting off learning anything about them until I needed it,  and from what I
> could glean from my web searches HTML::Parser could make this easy.   
> 
> Is there any sample code that anyone knows of to do what I described above?
> Or can someone give me some pointers? 
> 
> Thanks in advance. 
> ________________________________________________
> Peter Balisteri
> 
> Fisher-Price
> 636 Girard Avenue
> East Aurora, NY 14052
> 716-687-3822 (voice)
> 716-687-5040 (fax)
>    
>