HTML::FormatText

Andreas Gustafsson (gson@araneus.fi)
Sun, 20 Oct 1996 20:38:26 +0300 (EET DST)


I have been using HTML::FormatText from libwww-perl 5.03 to generate
plain-text versions of HTML documents, and I've run into some cases
where the text gets formatted in a less than optimal way.  The
attached Perl code attempts to illustrate a few of them. 

Please take this as a "suggestion for improvement".
-- 
Andreas Gustafsson, gson@araneus.fi

================================ Cut here ================================

use HTML::Parse;
require HTML::FormatText;

sub test_it {
    $html = parse_html(shift);
    $formatter = new HTML::FormatText;
    print $formatter->format($html);
}

&test_it('
<P>This first paragraph will be indented by an extra space
because the leading newline in the HTML source is not stripped.</P>

<P>Next, we will try some fixed-width text.  Testing:
<TT>test test test test</TT>.  Note how the line is broken
between the last "test" and the period following it.
</P>

<P>There is an awfully large amount of vertical space between the
paragraphs.  A single empty line would be enough.</P>

<P>The right margin setting is apparently treated as a minimum line length,
not a maximum like I would have expected.  This means that if some
much-longer-than-usual word happens to fall at the end of
the line, it will stick out like a sore thumb.</P>

<UL>
<LI>The first item in an unnumbered list gets the asterisk wrong.
<LI>Subsequent items are fine,
<LI>as you can see.
</UL>
');

================================ Cut here ================================

Here is the output of the above Perl script:

================================ Cut here ================================
    This first paragraph will be indented by an extra space because the
   leading newline in the HTML source is not stripped.

   

   Next, we will try some fixed-width text. Testing: test test test test
   . Note how the line is broken between the last "test" and the period
   following it.

   

   There is an awfully large amount of vertical space between the paragraphs.
   A single empty line would be enough.

   

   The right margin setting is apparently treated as a minimum line length,
   not a maximum like I would have expected. This means that if some much-longer-than-usual
   word happens to fall at the end of the line, it will stick out like a
   sore thumb.

     

      *The first item in an unnumbered list gets the asterisk wrong.

     * Subsequent items are fine,

     * as you can see.
================================ Cut here ================================