Re: LWP 5.0: HTML::Parse turns memory hog
Gisle Aas (aas@bergen.sn.no)
Tue, 18 Jun 1996 14:04:57 +0200
In message <Pine.SGI.3.93.960618115332.29859A-100000@ebor.york.ac.uk>, Mike Bru
denell writes:
> I'm Mike Brudenell, a Systems Programmer in the Computing Service at the
> University of York in the UK. I'm currently writing some software to
> traverse links in Web pages (ultimately to turn into a link checker and
> report generator), and using Perl + libwwwperl-5.00 to do this.
>
> I've just been running my program "in anger" for the first time, and it
> was working very well...
>
> ...Until, that is, it hit a small number of pages belonigng to one of my
> colleagues. These are documentation for a utility called "xgen", and the
> HTML files themselves are produced by a utility called LaTeX2HTML.
>
> These cause the $HTML::Parse module's "parse_html" function to turn into a
> *terrific* memory hog. For example, one of the documentation files is
> 150Kb. Feeding it to parse_html takes an incredibly long time, during
> which time the process' VM usage shoots up horrendously (eg from 2500
> pages to 5000 pages, where 1 "page" = 4Kb).
>
> Even more interestingly if, having checked one of these pages I know
> obtain the next page and parse that within the same process memory shoots
> up even further in even bigger steps (eg, from 5000 up to 50000 pages!)
>
> I was wondering if you would be interested in receiving a copy of these
> problem files to test against parse_html on your own system?
>
> My gut feeling is that there is some significant problem in parse_html.
> And I'm not 100% convinced that using "delete" on the parsed object tree
> properly releases the space used by the object tree.
I believe the problem to be that the parser builds a huge number of
interlinked HTML::Element objects. Each element would be an anonymous
hash looking like this:
bless {
_tag => 'body',
_parent => \$parent
_content => [\$child1, $child2,...],
}, HTML::Element;
I am not surprised that this turns out to eat a lot of memory for
large HTML documents with many tags. Does anybody know what the
typical perl memory consumption for an object like the HTML::Element
would be?
What you should do it to try to use the new HTML::Parser object
directly. It allows you to look for links without consuming any
memory for building a parse tree. We don't really need that tree for
this kind of application.
Untested code below:
package LinkExtractor;
require HTML::Parser;
@ISA=qw(HTML::Parser);
sub start # called when start tags are recognized
{
my($self,$tag,$attr) = @_;
# Should really you something like the %HTML::Element::linkElements
# has to extract links.
if ($tag eq 'a') {
print "Link: $attr->{href}\n";
} elsif ($tag eq 'img') {
print "Link: $attr->{src}\n";
} ...
}
package main;
$p = new LinkExtractor;
$p->parse_file("foo.html");
Regards,
Gisle.