Re: HTML::Element -- traversing and highlighting in a tree

Bill Moseley (moseley@hank.org)
Fri, 10 Nov 2000 09:59:31 -0800


At 05:48 PM 11/09/00 -0700, Sean M. Burke wrote:
>sub highlighting {
>  # Visit every node in the given tree, applying highlighting
>  # to things matching the RE

Hi Sean,

Your code gives me some idea on how to traverse and insert the elements I
need.

I don't think I can use single regular expressions to match words in a
phrase.  I need to check each word one-by-one making sure that all words
match in a phrase before saying this is a phrase to highlight.  And the
words my be in (span) different HTML::Elements.  For example, try
highlighting the phrase "driving a nitro" in the example code as a phrase
and not just the three words individually as in \b(driving|a|nitro)\b, but
as \b(driving a nitro)\b. 

And once I find a phrase to highlight I want to start highlighting at the
first word and turn if off on the last word -- so white space and
punctuation in between is also highlighted.  I don't want to just highlight
the words in the phrase, but the entire phrase.

Again, I have that much working in my model, but it breaks the html when
the phrase my be partly inside an <a> text tag and partly outside.  It
probably breaks in lots of cases when there are tags within the phrase.

Seems like I'd could subclass HTML::Element so that the text segment is
really a list of words.  The write a routine that will extract out the
words into a list that will assign word positions and provide references
back to the Elements.  Really, it's harder than that in my case as I need
to parse out the text and divide it up into words and non-words as the
search engine defines them.

Then, finally, the trick is to find a phrase and then go back to the
HTML::Elements and insert start and end highlighting codes only at the
start and end of the phrase.  But the key is that I'd need to transverse
along the phrase and check the parent of each word, and if the parent
changes then I need to insert highlighting off and highlighting on codes at
the change in parents.  Whew.



Bill Moseley
mailto:moseley@hank.org