Re: HTML::Parser line numbers

Brian Slesinsky (bslesins@best.com)
Thu, 22 Jan 1998 17:53:26 -0800 (PST)


> Yeah, this would be a worthwhile feature-add.  Can you post the patch?

Here are diffs for HTML::Parser version 2.12:

49c49,52
<       $self->text($$buf) if length $$buf;
---
>       if(length $$buf) {
>           $self->text($$buf);
>           $self->count($$buf);
>       }
75a79
>               $self->count($text);
78a83
>               $self->count($1);
83,84c88,90
<           if ($$buf =~ s|^<!--(.*?)-->||s) {
<               $self->comment($1);
---
>           if ($$buf =~ s|^(<!--(.*?)-->)||s) {
>               $self->comment($2);
>               $self->count($1);
113a120
>               $self->count($eaten);
127a135
>               $self->count("</$1$2");
133a142
>               $self->count("</");
139a149
>           $self->count($1);
193a204
>                   $self->count("$eaten>");
196a208
>                   $self->count($eaten);
203a216
>               $self->count($eaten);
217a231,232
> # hook for HTML::CountingParser
> sub count {}



Here's my subclass that counts lines:



package HTML::CountingParser;
use strict;
use HTML::Parser;
use vars qw($VERSION @ISA);

@ISA = qw(HTML::Parser);
$VERSION = '0.01';

sub new {
    my $class = shift;
    my $self = HTML::Parser::new($class);
    $self->{'_lineno'} = 1;
    $self->{'_offset'} = 0;
    $self->{'_chars'} = 0;
    return $self;
}

sub count {
    my($self,$buf) = @_;

#    print $buf; # debug: this should echo the input exactly

    $self->{'_chars'} += length($buf);
    my $nl_count = ($buf =~ tr/\n/\n/);
    if($nl_count==0) {
        $self->{'_offset'} += length($buf);
    } else {
        $self->{'_lineno'} += $nl_count;
        $self->{'_offset'} = length($buf) - rindex($buf,"\n") - 1;
    }
}

sub get_counts {
    my($self) = @_;
    return ($self->{'_lineno'},$self->{'_offset'},$self->{'_chars'});
}

1;
__END__


=head1 NAME

HTML::CountingParser - parser that counts characters and lines

=head1 SYNOPSIS

same as HTML::Parser

=head1 DESCRIPTION

Just like HTML::Parser with the addition of the get_counts() method.
Internal methods such as start(), end(), and text() can call this
method to find out the current location in the file.  This is useful
for reporting syntax errors and the like.

=over

=item ($lineno, $offset, $characters) = $self->get_counts;

Returns the current location in the file.  For text, the location
returned is the beginning of the text.  For tags and comments, it is
the '<' character at the beginning.

The location is returned in two ways: line number and offset, and the
number of characters from the beginning of the file.  Line numbers
start from 1 and character counts start from 0.

=back

=head1 AUTHOR

January 14, 1998 - created by Brian Slesinsky for NuvoMedia
January 22, 1998 - donated to libwww-perl.

=head1 SEE ALSO

HTML::Parser

=cut