Re: LWP 5.0: HTML::Parse turns memory hog

Gisle Aas (aas@bergen.sn.no)
Tue, 18 Jun 1996 14:04:57 +0200


In message <Pine.SGI.3.93.960618115332.29859A-100000@ebor.york.ac.uk>, Mike Bru
denell writes:
> I'm Mike Brudenell, a Systems Programmer in the Computing Service at the
> University of York in the UK.  I'm currently writing some software to
> traverse links in Web pages (ultimately to turn into a link checker and
> report generator), and using Perl + libwwwperl-5.00 to do this. 
> 
> I've just been running my program "in anger" for the first time, and it
> was working very well...
> 
> ...Until, that is, it hit a small number of pages belonigng to one of my
> colleagues.  These are documentation for a utility called "xgen", and the
> HTML files themselves are produced by a utility called LaTeX2HTML.
> 
> These cause the $HTML::Parse module's "parse_html" function to turn into a
> *terrific* memory hog.  For example, one of the documentation files is
> 150Kb.  Feeding it to parse_html takes an incredibly long time, during
> which time the process' VM usage shoots up horrendously (eg from 2500
> pages to 5000 pages, where 1 "page" = 4Kb).
> 
> Even more interestingly if, having checked one of these pages I know
> obtain the next page and parse that within the same process memory shoots
> up even further in even bigger steps (eg, from 5000 up to 50000 pages!)
> 
> I was wondering if you would be interested in receiving a copy of these
> problem files to test against parse_html on your own system?
> 
> My gut feeling is that there is some significant problem in parse_html. 
> And I'm not 100% convinced that using "delete" on the parsed object tree
> properly releases the space used by the object tree.

I believe the problem to be that the parser builds a huge number of
interlinked HTML::Element objects.  Each element would be an anonymous
hash looking like this:

   bless {
      _tag  => 'body',
      _parent => \$parent
      _content => [\$child1, $child2,...],
   }, HTML::Element;

I am not surprised that this turns out to eat a lot of memory for
large HTML documents with many tags.  Does anybody know what the
typical perl memory consumption for an object like the HTML::Element
would be?


What you should do it to try to use the new HTML::Parser object
directly.  It allows you to look for links without consuming any
memory for building a parse tree.  We don't really need that tree for
this kind of application.

Untested code below:

  package LinkExtractor;
  require HTML::Parser;
  @ISA=qw(HTML::Parser);

  sub start  # called when start tags are recognized
  {
     my($self,$tag,$attr) = @_;
     # Should really you something like the %HTML::Element::linkElements
     # has to extract links.
     if ($tag eq 'a') {
	print "Link: $attr->{href}\n";
     } elsif ($tag eq 'img') {
	print "Link: $attr->{src}\n";
     } ...
  }

  package main;
  $p = new LinkExtractor;
  $p->parse_file("foo.html");

Regards,
Gisle.