Patches for HTML::Format*
Andreas Gustafsson (gson@araneus.fi)
Sat, 11 Oct 1997 13:39:36 +0300 (EET DST)
About a year ago, I reported a number of cases of bad formatting by
HTML::FormatText. Not having had any response to that report, I
finally decided to try to fix these bugs myself.
Most of these problems were related to the treatment of whitespace
occurring in places other than flowing text, e.g. between paragraphs.
The Formatter class apparently tries to deal with this issue using
the subroutine eat_leading_space, but it is not entirely successful.
I have tried to fix the problem by replacing eat_leading_space with
a mechanism which defers the output of any horizontal space until
there is text not immediately preceded by vertical space.
I also made the corresponding changes to the FormatPS class. Here,
deferring the output of horizontal space is complicated by the need to
output spaces of the correct width for the current font. This
necessitated some changes to the font mechanisms, decoupling the
notion of the "current font" in the HTML input stream from that in the
PostScript output stream.
This patch also includes (and thereby replaces) my recent patch for a
benign typo in Formatter.pm.
I found it a bit hard to debug Formatter code because
Formatter::format trapped runtime errors in its subroutines using
eval { } and ignored them, even if they happened to represent actual
programming errors on my part. To ease the detection of such errors,
I rewrote this code in Formatter::format to use UNIVERSAL::can()
instead of eval { }.
Attached are my test case, its output using an unmodified
libwww-perl 5.13, its output after applying my patches, and the
patches themselves.
Regards,
--
Andreas Gustafsson, gson@araneus.fi
The test case:
================================ Cut here ================================
#!/usr/bin/perl
use HTML::Parse;
require HTML::FormatText;
sub test_it {
$html = parse_html(shift);
$formatter = new HTML::FormatText;
print $formatter->format($html);
}
&test_it('
<P>This first paragraph will be indented by an extra space
because the leading newline in the HTML source is not stripped.</P>
<P>Next, we will try some fixed-width text. Testing:
<TT>test test test test</TT>. Note how the line is broken
between the last "test" and the period following it.
</P>
<P>There is an awfully large amount of vertical space between the
paragraphs. A single empty line would be enough.</P>
<P>The right margin setting is apparently treated as a minimum line length,
not a maximum like I would have expected. This means that if some
much-longer-than-usual word happens to fall at the end of
the line, it will stick out like a sore thumb.</P>
<UL>
<LI>The first item in an unnumbered list gets the asterisk wrong.
<LI>Subsequent items are fine,
<LI>as you can see.
</UL>
');
================================ Cut here ================================
Here is the output of the above Perl script using stock libwww-perl 5.13:
================================ Cut here ================================
This first paragraph will be indented by an extra space because the
leading newline in the HTML source is not stripped.
Next, we will try some fixed-width text. Testing: test test test test
. Note how the line is broken between the last "test" and the period
following it.
There is an awfully large amount of vertical space between the paragraphs.
A single empty line would be enough.
The right margin setting is apparently treated as a minimum line length,
not a maximum like I would have expected. This means that if some much-longer-than-usual
word happens to fall at the end of the line, it will stick out like a
sore thumb.
*The first item in an unnumbered list gets the asterisk wrong.
* Subsequent items are fine,
* as you can see.
================================ Cut here ================================
Here's the output after applying the patches:
================================ Cut here ================================
This first paragraph will be indented by an extra space because the
leading newline in the HTML source is not stripped.
Next, we will try some fixed-width text. Testing: test test test test.
Note how the line is broken between the last "test" and the period
following it.
There is an awfully large amount of vertical space between the
paragraphs. A single empty line would be enough.
The right margin setting is apparently treated as a minimum line
length, not a maximum like I would have expected. This means that if
some much-longer-than-usual word happens to fall at the end of the
line, it will stick out like a sore thumb.
* The first item in an unnumbered list gets the asterisk wrong.
* Subsequent items are fine,
* as you can see.
================================ Cut here ================================
And, finally, the patches.
================================ Cut here ================================
diff -rc3 libwww-perl-5.13.orig/lib/HTML/FormatPS.pm libwww-perl-5.13/lib/HTML/FormatPS.pm
*** libwww-perl-5.13.orig/lib/HTML/FormatPS.pm Sat Aug 16 13:34:32 1997
--- libwww-perl-5.13/lib/HTML/FormatPS.pm Sat Oct 11 12:38:52 1997
***************
*** 212,217 ****
--- 212,227 ----
}
}
$self->{title} = "";
+
+ # The font ID last sent to the PostScript output (this may be
+ # temporarily different from the "current font" as read from
+ # the HTML input). Initially none.
+ $self->{psfontid} = "";
+
+ # Pending horizontal space. A list [ " ", $fontid, $width ],
+ # or undef if no space is pending.
+ $self->{hspace} = undef;
+
$self;
}
***************
*** 238,245 ****
$FontSizes[$size] * $self->{fontscale};
}
! sub findfont
{
my($self, $plain_with_size) = @_;
my $index = 0;
--- 248,259 ----
$FontSizes[$size] * $self->{fontscale};
}
+ # Determine the current font and set font-related members.
+ # If $plain_with_size is given (a number), use a plain font
+ # of that size. Otherwise, use the font specified by the
+ # HTML context. Returns the "font ID" of the current font.
! sub setfont
{
my($self, $plain_with_size) = @_;
my $index = 0;
***************
*** 254,260 ****
my $font = $FontFamilies{$family}[$index];
my $font_with_size = "$font-$size";
if ($self->{currentfont} eq $font_with_size) {
! return "";
}
$self->{currentfont} = $font_with_size;
$self->{pointsize} = $size;
--- 268,274 ----
my $font = $FontFamilies{$family}[$index];
my $font_with_size = "$font-$size";
if ($self->{currentfont} eq $font_with_size) {
! return $self->{currentfontid};
}
$self->{currentfont} = $font_with_size;
$self->{pointsize} = $size;
***************
*** 269,277 ****
$self->{fonts}{$font_with_size} = $fontID;
$fontID;
};
! "$font SF";
}
sub width
{
--- 283,316 ----
$self->{fonts}{$font_with_size} = $fontID;
$fontID;
};
! $self->{currentfontid} = $font;
! return $font;
! }
!
! # Construct PostScript code for setting the current font according
! # to $fontid, or an empty string if no font change is needed.
! # Assumes the return string will always be output as PostScript if
! # nonempty, so that our notion of the current PostScript font
! # stays in sync with that of the PostScript interpreter.
!
! sub switchfont
! {
! my($self, $fontid) = @_;
! if ($self->{psfontid} eq $fontid) {
! return "";
! } else {
! $self->{psfontid} = $fontid;
! return "$fontid SF";
! }
}
+ # Like setfont + switchfont.
+
+ sub findfont
+ {
+ my($self, $plain_with_size) = @_;
+ return $self->switchfont($self->setfont($plain_with_size));
+ }
sub width
{
***************
*** 438,444 ****
# If we are close enough to be bottom of the page, start a new page
# instead of this:
$self->vspace(1 + (6-$level) * 0.4);
- $self->eat_leading_space;
$self->{bold}++;
push(@{$self->{font_size}}, 8 - $level);
1;
--- 477,482 ----
***************
*** 482,487 ****
--- 520,526 ----
}
$self->{xpos} = $self->{lm};
$self->{vspace} = undef;
+ $self->{hspace} = undef;
}
}
***************
*** 613,630 ****
$self->{title} .= $text;
return;
}
$self->skip_vspace;
! my $font = $self->findfont();
! if (length $font) {
$self->show;
! $self->{line} .= "$font\n";
}
! my $w = $self->width($text);
! if ($self->{xpos} + $w > $self->{rm}) {
! $self->showline;
! return if $text =~ /^\s*$/;
! };
$self->{xpos} += $w;
$self->{showstring} .= $text;
$self->{largest_pointsize} = $self->{pointsize}
--- 652,695 ----
$self->{title} .= $text;
return;
}
+
+ my $fontid = $self->setfont();
+ my $w = $self->width($text);
+
+ if ($text =~ /^\s*$/) {
+ $self->{hspace} = [ " ", $fontid, $w ];
+ return;
+ }
+
$self->skip_vspace;
! # determine spacing / line breaks needed before text
! if ($self->{hspace}) {
! my ($stext, $sfont, $swidth) = @{$self->{hspace}};
! if ($self->{xpos} + $swidth + $w > $self->{rm}) {
! # line break
! $self->showline;
! } else {
! # no line break; output a space
! $self->show_with_font($stext, $sfont, $swidth);
! }
! $self->{hspace} = undef;
! }
!
! # output the text
! $self->show_with_font($text, $fontid, $w);
! }
!
!
! sub show_with_font {
! my ($self, $text, $fontid, $w) = @_;
!
! my $fontps = $self->switchfont($fontid);
! if (length $fontps) {
$self->show;
! $self->{line} .= "$fontps\n";
}
!
$self->{xpos} += $w;
$self->{showstring} .= $text;
$self->{largest_pointsize} = $self->{pointsize}
diff -rc3 libwww-perl-5.13.orig/lib/HTML/FormatText.pm libwww-perl-5.13/lib/HTML/FormatText.pm
*** libwww-perl-5.13.orig/lib/HTML/FormatText.pm Mon Nov 25 17:00:46 1996
--- libwww-perl-5.13/lib/HTML/FormatText.pm Sat Oct 11 12:39:00 1997
***************
*** 47,55 ****
my $self = shift;
$self->HTML::Formatter::begin;
$self->{lm} = 3; # left margin
! $self->{rm} = 70; # right margin
$self->{curpos} = 0; # current output position.
$self->{maxpos} = 0; # highest value of $pos (used by header underliner)
}
sub end
--- 47,56 ----
my $self = shift;
$self->HTML::Formatter::begin;
$self->{lm} = 3; # left margin
! $self->{rm} = 72; # right margin (actually, maximum text width)
$self->{curpos} = 0; # current output position.
$self->{maxpos} = 0; # highest value of $pos (used by header underliner)
+ $self->{hspace} = 0; # horizontal space pending flag
}
sub end
***************
*** 63,69 ****
my($self, $level, $node) = @_;
$self->vspace(1 + (6-$level) * 0.4);
$self->{maxpos} = 0;
- $self->eat_leading_space;
1;
}
--- 64,69 ----
***************
*** 95,107 ****
# should really handle bold/italic etc.
if (defined $self->{vspace}) {
if ($self->{out}) {
! $self->nl() while $self->{vspace}-- > -0.5;
$self->{vspace} = undef;
}
}
my $indent = ' ' x $self->{lm};
my $pre = shift;
! $pre =~ s/\n/\n$indent/g;
$self->collect($pre);
$self->{out}++;
}
--- 95,107 ----
# should really handle bold/italic etc.
if (defined $self->{vspace}) {
if ($self->{out}) {
! $self->nl() while $self->{vspace}-- >= 0;
$self->{vspace} = undef;
}
}
my $indent = ' ' x $self->{lm};
my $pre = shift;
! $pre =~ s/^/$indent/mg;
$self->collect($pre);
$self->{out}++;
}
***************
*** 111,142 ****
my $self = shift;
my $text = shift;
if (defined $self->{vspace}) {
if ($self->{out}) {
$self->nl while $self->{vspace}-- >= 0;
! $self->goto_lm;
! } else {
! $self->goto_lm;
! }
! $self->{vspace} = undef;
! }
!
! if ($self->{curpos} > $self->{rm}) { # line is too long, break it
! return if $text =~ /^\s*$/; # white space at eol is ok
! $self->nl;
$self->goto_lm;
}
! if ($self->{pending_space}) {
! $self->{pending_space} = 0;
! $self->collect(' ');
! my $pos = ++$self->{curpos};
! $self->{maxpos} = $pos if $self->{maxpos} < $pos;
}
- $self->{pending_space} = 1 if $text =~ s/\s+$//;
- return unless length $text;
-
$self->collect($text);
my $pos = $self->{curpos} += length $text;
$self->{maxpos} = $pos if $self->{maxpos} < $pos;
--- 111,143 ----
my $self = shift;
my $text = shift;
+ if ($text =~ /^\s*$/) {
+ $self->{hspace} = 1;
+ return;
+ }
+
if (defined $self->{vspace}) {
if ($self->{out}) {
$self->nl while $self->{vspace}-- >= 0;
! }
$self->goto_lm;
+ $self->{vspace} = undef;
+ $self->{hspace} = 0;
}
! if ($self->{hspace}) {
! if ($self->{curpos} + length($text) > $self->{rm}) {
! # word will not fit on line; do a line break
! $self->nl;
! $self->goto_lm;
! } else {
! # word fits on line; use a space
! $self->collect(' ');
! ++$self->{curpos};
! }
! $self->{hspace} = 0;
}
$self->collect($text);
my $pos = $self->{curpos} += length $text;
$self->{maxpos} = $pos if $self->{maxpos} < $pos;
***************
*** 158,164 ****
{
my $self = shift;
$self->{'out'}++;
- $self->{pending_space} = 0;
$self->{curpos} = 0;
$self->collect("\n");
}
--- 159,164 ----
diff -rc3 libwww-perl-5.13.orig/lib/HTML/Formatter.pm libwww-perl-5.13/lib/HTML/Formatter.pm
*** libwww-perl-5.13.orig/lib/HTML/Formatter.pm Thu Jul 3 09:50:12 1997
--- libwww-perl-5.13/lib/HTML/Formatter.pm Sat Oct 11 12:41:03 1997
***************
*** 42,47 ****
--- 42,48 ----
use strict;
use Carp;
+ use UNIVERSAL qw(can);
sub new
{
***************
*** 59,68 ****
if (ref $node) {
my $tag = $node->tag;
my $func = $tag . '_' . ($start ? "start" : "end");
! # We protect the call by eval, so we can recover if
# a handler is not defined for the tag.
! my $retval = eval { $self->$func($node); };
! return $@ ? 1 : $retval;
} else {
$self->textflow($node);
}
--- 60,72 ----
if (ref $node) {
my $tag = $node->tag;
my $func = $tag . '_' . ($start ? "start" : "end");
! # Use UNIVERSAL::can so that we can recover if
# a handler is not defined for the tag.
! if (can($self, $func)) {
! return $self->$func($node);
! } else {
! return 1;
! }
} else {
$self->textflow($node);
}
***************
*** 88,96 ****
$self->{font_size} = [3]; # last element is current size
$self->{basefont_size} = [3];
! $self->{makers} = []; # last element is current marker
! $self->{vspace} = undef; # vertical space
! $self->{eat_leading_space} = 0;
$self->{output} = [];
}
--- 92,99 ----
$self->{font_size} = [3]; # last element is current size
$self->{basefont_size} = [3];
! $self->{markers} = []; # last element is current marker
! $self->{vspace} = undef; # vertical space (dimension)
$self->{output} = [];
}
***************
*** 140,146 ****
{
my $self = shift;
$self->vspace(0);
- $self->eat_leading_space;
}
--- 143,148 ----
***************
*** 148,154 ****
{
my $self = shift;
$self->vspace(1);
- $self->eat_leading_space;
}
sub img_start
--- 150,155 ----
***************
*** 298,304 ****
{
my $self = shift;
$self->vspace(1);
- $self->eat_leading_space;
1;
}
--- 299,304 ----
***************
*** 333,339 ****
{
my $self = shift;
$self->vspace(1);
- $self->eat_leading_space;
$self->adjust_lm( +2 );
$self->adjust_rm( -2 );
1;
--- 333,338 ----
***************
*** 351,357 ****
{
my $self = shift;
$self->vspace(1);
- $self->eat_leading_space;
$self->i_start(@_);
1;
}
--- 350,355 ----
***************
*** 387,393 ****
my $self = shift;
$self->bullet($self->{markers}[-1]);
$self->adjust_lm(+2);
- $self->eat_leading_space;
1;
}
--- 385,390 ----
***************
*** 453,459 ****
{
my $self = shift;
$self->vspace(1);
- $self->eat_leading_space;
1;
}
--- 450,455 ----
***************
*** 466,472 ****
my $self = shift;
$self->adjust_lm(+6);
$self->vspace(0);
- $self->eat_leading_space;
1;
}
--- 462,467 ----
***************
*** 486,510 ****
{
my $self = shift;
if ($self->{pre}) {
$self->pre_out($_[0]);
} else {
for (split(/(\s+)/, $_[0])) {
next unless length $_;
- if ($self->{eat_leading_space}) {
- $self->{eat_leading_space} = 0;
- next if /^\s/;
- }
$self->out($_);
}
}
}
-
-
- sub eat_leading_space
- {
- shift->{eat_leading_space} = 1;
- }
sub vspace
--- 481,500 ----
{
my $self = shift;
if ($self->{pre}) {
+ # strip leading and trailing newlines so that the <pre> tags
+ # may be placed on lines of their own without causing extra
+ # vertical space as part of the preformatted text
+ $_[0] =~ s/\n$//;
+ $_[0] =~ s/^\n//;
$self->pre_out($_[0]);
} else {
for (split(/(\s+)/, $_[0])) {
next unless length $_;
$self->out($_);
}
}
}
sub vspace
================================ Cut here ================================