Re: Converting HTML to plain text
Daniel Reeves (dreeves@eecs.umich.edu)
Mon, 26 Jul 1999 18:41:09 -0400 (EDT)
Lynx is better at this than any existing Perl Module so I ended up using
that...
I dump the html to a temp file, and then run lynx -dump on it and capture
the output, which is a nicely formatted text version of the html, tables
and all...
(using the temp file is probably not necessary for most applications)
sub html2text
{
my($rawhtml, $url_prefix) = @_;
my $tmpfile = $ENV{'HOME'}."/trash/tmp-html2text-$$.html";
#my $htmldata = HTML::Parse::parse_html($rawhtml);
#my $formatter = new HTML::FormatText;
#my $textdata = $formatter->format($htmldata);
#return $textdata;
open(HTML, ">$tmpfile") or die "$!";
print HTML $rawhtml;
close(HTML);
open(LYNX, "/usr/gnu/bin/lynx -dump $tmpfile |") or die "$!";
my $textdata = "";
while (<LYNX>) { $textdata .= $_; }
close(LYNX);
$textdata =~
s[file\://localhost/n/flip/h/dreeves/trash/][$url_prefix]ges;
return $textdata;
}
-- -- -- -- -- -- -- -- -- -- -- --
Daniel Reeves http://ai.eecs.umich.edu/people/dreeves/
"In the last 10 years, we have come to realize that humans are more
like worms than we ever imagined." -- Bruce Alberts, president of
the National Academy of Sciences, after mapping the DNA of a
microscopic roundworm.
On Fri, 23 Jul 1999, Balisteri, Peter wrote:
> I need to read a large group of html files, one at a time, and output only
> the text to another file. I am new to using modules in perl. I have been
> putting off learning anything about them until I needed it, and from what I
> could glean from my web searches HTML::Parser could make this easy.
>
> Is there any sample code that anyone knows of to do what I described above?
> Or can someone give me some pointers?
>
> Thanks in advance.
> ________________________________________________
> Peter Balisteri
>
> Fisher-Price
> 636 Girard Avenue
> East Aurora, NY 14052
> 716-687-3822 (voice)
> 716-687-5040 (fax)
>
>