Re: HTML::Parser removing &nbsp

Gisle Aas (gisle@activestate.com)
11 Jan 2001 10:34:19 -0800


John Aughey <ajh4@cec.wustl.edu> writes:

> I'm using HTML::Parser to process an HTML file to re-write URL's and such.
> I've discovered that it seems to be changing &nbsp to a space character
> instead of passing the actual "&nbsp" text.  It also appears to be
> escaping non-printable characters too.

So what do you actually do?  Can you show some code?

It is true that the hash passed to you with the 'attr' argspec
automatically has entities (like &nbsp) decoded.  I don't understand
what you mean by "escaping non-printable characters" though.

> Can I turn this feature off?

You can ask for 'tokens' or perhaps even 'tokenpos' in argspec.

>                   And if I cannot, what would be the best way
> to parse the HTML so I can re-write selected tags.

The HTML::Parser comes with two example scripts ('eg/hrefsub' and
'eg/hstrip') that do this.  You might look at them for ideas for
approaches you can take.

Regards,
Gisle