Re: Changes to HTML-Parser callbacks interface

Michael A. Chase (mchase@ix.netcom.com)
Thu, 25 Nov 1999 07:44:19 -0800


Comments below.

I had been planning to include some questions and comments about marked
sections and parameter entities with my next patch, but I'd better get them
out the door while you are considering the changed interface.
--
Mac :})
----- Original Message -----
From: Gisle Aas <gisle@aas.no>
To: Michael A. Chase <mchase@ix.netcom.com>
Cc: libwww <libwww@perl.org>
Sent: 11/25/99 5:21
Subject: Re: [PATCH]HTML::Parser-XS-2.9913_mac-1

> BTW, I have decided that I want to modify the new callback interface.
> The main new thing is that you should always tell the parser what
> arguments you want the parser to pass to the callback handlers.
>
> Example:
>
>   $p->callback(start => "self,tagname,line", sub { ... });

Does this mean new() can't be used to define callbacks?  That actually may
not be a bad idea since they do complicate the explanation and code.  On the
other hand, being able to set everything up in the constructor is handy,
too.

How about
    $p->callback(start => [qw(self tagname line), sub { ... }]);

Any keywords found would set options or enable parameters to the callback, a
coderef would be saved as the callback, and an arrayref would be saved as
the accumulator array.  Keywords or references could appear in any order,
but any values
passed to the coderef or accumulator array would always be in the same order
when they are present.

> This would set up a callback for start tags and tell the parser that
> it should pass $p, the name of the tag and the line number where the
> tag starts to the subroutine given.
>
> The argspec allows me to get rid of several of the new parser options
> (decode_text_entities, v2_compat, pass_self, attr_pos) and allow
> further optimizations as we don't have to build the stuff to
> represents arguments that the parser user don't need.  The interface
> also become much easier extensible.

I don't see this as getting rid of the options, more like making them
specific to each callback.

> The things that I think can go into argspec are:
>
>   self
>   tagname (element_name, gi)
>   origtext
>   decodedtext
>   cdata_flag
>   attr_arrayref
>   attrpos_arrayref
>   attr_hashref
>   attrpos_hashref
>   tokens_arrayref
>   tokens  # separate arguments (tagname @$attr_arrayref)
>   charpos
>   line
>
> Does anybody have some other ideas of how the argspec interface might
> look?  An array?  Just peek at the prototype of the callback function?

Just some more ideas:
   v2_compat: qw( self tagname attr_arrayref attr_hashref origtext ),
sub{shift ->start(@_)}
   keep_tag: don't force tag names to lowercase
   keep_attr: don't force attribute names to lowercase
   keep: qw( keep_tag keep_attr )
   no_decode_value: don't decode attribute values

If no keywords are given, it should be equivalent to qw( tagname
tokens_arrayref origtext ) or whatever is finally agreed on.

> This stuff probably also mean that the $p->accum() stuff should go.
> One idea would be to allow a array ref as the third $p->callback
> argument and then push stuff instead of doing a call.

You wouldn't have to abandon $p->accum, it would just take the same options
as $p->callback.  Or it could be invoked if an arrayref is one of the
elements provided to $p->callback (not just the third):

   $p->callback( declaration => [ qw(tagname tokens_arrayref), \@accum ] );

This would allow some elements to be handled by callbacks and others by the
array.

==== Marked Sections support

Do you have an estimate of how much the penalty is if marked sections are
compiled in?  I suspect that at a lot of sites, it won't be included by the
system administrators and then will be needed by users.  Since the penalty
is likely to be low, I think it would be better to always include marked
section support.

==== Parameter Entities

It also looks like marked sections won't be as useful as they might be
unless you can decode parameter entities.  That would also require you to
either parse parameter entity declarations (<!ENTITY % ...>) or provide a
means for assigning values to parameter entities or both.  I was thinking it
might useful to add a couple more callbacks:

   unknown_parameter:  Called when a parameter entity (%param;) is
encountered whose value is not known.  The callback would return the
parameter entity's value.

   parameter_entity:  Called when an <!ENTITY % ...> element is recognized
prior to any other processing so the program can determine the value of the
parameter entity.  The default action would save the parameter name and
value for future use.  A user provided callback would normally call
$p->parameter entity() to save the parameter entity name value, but it could
accommodate more complex situations.

This would require an access method $p->parameter_entity( $key, $value )
which could be used to set and get the parameter entity names and values.