HTML::Element
Bill Moseley (moseley@hank.org)
Thu, 09 Nov 2000 09:50:42 -0800
I just read Sean Burke's fine article in The Perl Journal and was
wondering: "Can I use HTML::TreeBuilder and HTML::Element to highlight
phrases in HTML?"
Sorry if this is a bit long.
I have a search engine (http://lii.org) that highlights search keywords or
phrases in search results. Since the search can use stemming (search for
"running" and it finds "run" and "runs" in addition to "running"), I have
to parse the HTML, extract out all the text, convert it into "words" as the
search engine defines them, and then convert the words into their stems. I
then compare those words against the search query.
So, currently I use HTML::Parser to tokenize the HTML into a long list of
records, and along the way extract out another list of just the text that
may need highlighting. This allows me to assign word position to each word
so that I can match phrases. This second list of words has references back
to the original HTML list so when I find a phrase to highlight I set a flag
on the first word to turn on highlighting and a flag on the last word to
turn off highlighting. Then I have another asHTML() routine that prints
out the HTML and inserts highlighting where needed. That make sense?
It's a bit too simple, as it generates broken HTML when the highlight spans
link text and non-link text.
For example, source HTML of:
<p>Learn more about low-polluting cars at
<a href="http://honda.com">Honda of America</a>.
And the search phrase might be "cars at honda", so I need to be able to end
up with this:
<p>To learn more about low-polluting
<highlight>cars at</highlight>
<a href="http://honda.com"><highlight>Honda</highlight>
of America</a>.
Currently, in my broken code I end up with
<p>To learn more about low-polluting <highlight>cars at
<a href="http://honda.com">Honda</highlight>
of America</a>.
I can fix this in my current code but at the expense of a bit processing
time. But, I'm wondering if the tree structure of HTML::Element might offer
a solution. It might be faster to insert new HTML::Elements to turn on and
off the highlighting. And, with less logic perhaps, I might be able to
avoid spanning <A> tags with highlighting as in the example above.
I'm not experienced with HTML::Element, so I'm looking for comments on if
this problem could be solved with this module, and if so, some pointers.
Thanks very much,
Bill Moseley
mailto:moseley@hank.org