Re: Hacking HTML::TreeBuilder and HTML::Element

Justin Mason (jm@jmason.org)
Thu, 01 Feb 2001 12:04:21 +0000


Jason Henry Parker said:

> I'm working on a module that will be used to intelligently extract the
> content from HTML pages like slashdot, lwn, or CNN---sites that use
> large tables to sandwich content between columns of mostly static and
> uninteresting text.
> [...]
> Has anyone on the list been here before?

Yep!  Sitescooper (http://sitescooper.org/) has some code for "table
smarts", which effectively translates to removing any tables that are less
than a certain # of pixels or % of page width.

This trims out most "sidebars", which in most sites contain the content
you don't want.  You'll still get any "wide" tables that appear above and
below the main content text, but:

  - most sites have recognised that these should be kept to a minimum for
    usability purposes;

  - often they aren't even rendered as tables, they're just part of the
    <td> that the text is in, so table cleverness may not help. :(

But the sitescooper implementation is a hell of a lot more simple-minded
than what you describe, so it may not be helpful.

Anyway, take a look at
http://sitescooper.org/dist/lib/Sitescooper/StripTablesFilter.pm to see
the HTML::Filter object which does this.


BTW sitescooper also includes descriptions of patterns in the HTML which
act as upper or lower bounds for content areas, in its "site files".  You
may want to think about doing it this way, it's pretty simple (but
increases the load on you to update them when the site is redesigned).

g'luck,

--j.