An XS implementation of HTML::Parser
Gisle Aas (gisle@aas.no)
04 Nov 1999 13:31:43 +0100
I wanted HTML::Parser to become fast, so I just made a new XS based
implementation which I have uploaded a test release of. You will find
HTML-Parser-XS-2.99_01.tar.gz in my CPAN directory. It now pass the
old HTML-Parser test suite for me, but it would be nice if some others
could try it out too.
The XS-tarball only contains a replacement for the HTML::Parser
module, so you probably want to have some older plain release
installed as well to get the other HTML::* modules.
One thing that kills the performance of the old parser is to use
method callbacks. Subroutine calls (and especially method calls) are
very expensive in perl. The new parser can be initialised with
callbacks in the constructor:
HTML::Parser->new(start => sub { .... },
end => sub { .... },
text => sub { .... },
);
When you do it like this, then no methods are ever called as
callbacks. This also allow the parser to skip some work for the
things you have no interest in.
I will probably also make configure options that just makes it collect
tokens in an array so that they can be obtained without any subroutine
call at all. This should allow me to implement a very fast
HTML::TokeParser.
Other news with this implementation is that it will also recognise
processing instructions and that declaration lines are parsed into
individual tokens.
Other things I have been considered is to make it parse <xmp>...</xmp>
in the old depreciated 'literal' mode and perhaps even "marked
sections". Other possibilities are limited XML parsing mode. Are
there anything else anybody has missed?
I am not sure how I would like:
<script> <--
....
--></script>
to be parsed. Should "..." always be a comment?
The good thing about XS is that we can make many options without
sacrificing performance (much).
Regards,
Gisle