HTML::FormatText
Andreas Gustafsson (gson@araneus.fi)
Sun, 20 Oct 1996 20:38:26 +0300 (EET DST)
I have been using HTML::FormatText from libwww-perl 5.03 to generate
plain-text versions of HTML documents, and I've run into some cases
where the text gets formatted in a less than optimal way. The
attached Perl code attempts to illustrate a few of them.
Please take this as a "suggestion for improvement".
--
Andreas Gustafsson, gson@araneus.fi
================================ Cut here ================================
use HTML::Parse;
require HTML::FormatText;
sub test_it {
$html = parse_html(shift);
$formatter = new HTML::FormatText;
print $formatter->format($html);
}
&test_it('
<P>This first paragraph will be indented by an extra space
because the leading newline in the HTML source is not stripped.</P>
<P>Next, we will try some fixed-width text. Testing:
<TT>test test test test</TT>. Note how the line is broken
between the last "test" and the period following it.
</P>
<P>There is an awfully large amount of vertical space between the
paragraphs. A single empty line would be enough.</P>
<P>The right margin setting is apparently treated as a minimum line length,
not a maximum like I would have expected. This means that if some
much-longer-than-usual word happens to fall at the end of
the line, it will stick out like a sore thumb.</P>
<UL>
<LI>The first item in an unnumbered list gets the asterisk wrong.
<LI>Subsequent items are fine,
<LI>as you can see.
</UL>
');
================================ Cut here ================================
Here is the output of the above Perl script:
================================ Cut here ================================
This first paragraph will be indented by an extra space because the
leading newline in the HTML source is not stripped.
Next, we will try some fixed-width text. Testing: test test test test
. Note how the line is broken between the last "test" and the period
following it.
There is an awfully large amount of vertical space between the paragraphs.
A single empty line would be enough.
The right margin setting is apparently treated as a minimum line length,
not a maximum like I would have expected. This means that if some much-longer-than-usual
word happens to fall at the end of the line, it will stick out like a
sore thumb.
*The first item in an unnumbered list gets the asterisk wrong.
* Subsequent items are fine,
* as you can see.
================================ Cut here ================================