auto-parsing web pages from templates
Daniel Reeves (dreeves@eecs.umich.edu)
Mon, 26 Jul 1999 17:18:11 -0400 (EDT)
I'm working on a general approach to parsing data out of web pages.
The idea is to first grab a copy of the html and run quotemeta on it.
Then tag the dynamic parts to produce a template of the web page you are
interested in.
Then you could call a function, providing the template and the actual page
as input and it would output a hash indexed by the tags in the template
and with values from the corresponding actual text in the html.
One tricky thing is that sometimes a web page has lists or other repeating
content. You can't tag each piece of data that can change because it can
be arbitrarily long. There needs to be a special syntax for that in the
template.
Another potential difficulty (but that I'm not too concerned about right
now) is that a page can potentially have a lot of dynamic information,
most of which you don't care about. (like banner ads) It would be nice
to automatically build the template from multiple instances of the page.
It could then automatically put dummy tags on everything that changes from
one instance to the next, and you would manually change to real tags for
the parts you cared about.
If anyone has done something like this, or thought about it, or would be
interested in using my implementation when I finish, please let me know.
Thanks,
Daniel
-- -- -- -- -- -- -- -- -- -- -- --
Daniel Reeves http://ai.eecs.umich.edu/people/dreeves/
Whenever anyone says, "theoretically", they really mean, "not really".
-- Dave Parnas