Re: Changes to HTML-Parser callbacks interface

Gisle Aas (gisle@aas.no)
25 Nov 1999 20:47:50 +0100


"Michael A. Chase" <mchase@ix.netcom.com> writes:

> > Example:
> >
> >   $p->callback(start => "self,tagname,line", sub { ... });
> 
> Does this mean new() can't be used to define callbacks?  That actually may
> not be a bad idea since they do complicate the explanation and code.  On the
> other hand, being able to set everything up in the constructor is handy,
> too.

I agree.  We also need to have some constructor arguments in order to
not fall back into v2 compatibility mode.

 HTML::Parser->new(start_cb => ["self,tagname,line", sub { ... }]);

> How about
>     $p->callback(start => [qw(self tagname line), sub { ... }]);

Something like this will work.  I still like the stringified attrspec
better.  Wrapping it up in an array also make it easier to explain the
$p->callback return value.

> Any keywords found would set options or enable parameters to the callback, a
> coderef would be saved as the callback, and an arrayref would be saved as
> the accumulator array.  Keywords or references could appear in any order,
> but any values
> passed to the coderef or accumulator array would always be in the same order
> when they are present.

Hmm.  Perhaps we need some more syntax to make it clear what is what.
One possibility is to introduce a ":" and let everything before it be
arguments and everything after be callback specific options:

   "tagname,attr: keep_case"


> Just some more ideas:
>    v2_compat: qw( self tagname attr_arrayref attr_hashref origtext ),

My idea was that the stuff in &HTML::Parser::new that set up
compatibility callbacks should just ask for exactly those arguments
that used to be passed to the old method callbacks.  Is this different
or did you suggest that something like "v2_compat" was recognized
directly?

We could probably also special case if we see "self" as the first
argument and make a direct method call from XS?  The code-ref argument
should then be just a plain string if you want method resolution to
take place.

>    keep_tag: don't force tag names to lowercase
>    keep_attr: don't force attribute names to lowercase
>    keep: qw( keep_tag keep_attr )

Perhaps?  Can you think of any reason anybody would want one and not
the other?

>    no_decode_value: don't decode attribute values

This has merit. Perhaps it should also keep the quotes if there are
any in this case or should it be a separate option?  >>TODO

> If no keywords are given, it should be equivalent to qw( tagname
> tokens_arrayref origtext ) or whatever is finally agreed on.

I originally wanted to make empty callbacks the default.  Otherwise we
would have to have different rules for different stuff I think,
i.e. more confusing documentation.

> > This stuff probably also mean that the $p->accum() stuff should go.
> > One idea would be to allow a array ref as the third $p->callback
> > argument and then push stuff instead of doing a call.
> 
> You wouldn't have to abandon $p->accum, it would just take the same options
> as $p->callback.  Or it could be invoked if an arrayref is one of the
> elements provided to $p->callback (not just the third):
> 
>    $p->callback( declaration => [ qw(tagname tokens_arrayref), \@accum ] );
> 
> This would allow some elements to be handled by callbacks and others by the
> array.

This is the way I think we will go.  It should probably also be
possible to put literals into argspec so we could easily add those
"S", "E", "C" strings that we used to add to accum before.

  $p->callback( start => qw("S",tagname,attr_hash,origtext), \@accum);

This is probably also handy if you want to use the same callback
procedure to handle multiple types of markup:

  $p->callback( start => qw("start",tagname), \&handler);
  $p->callback( end   => qw("end",tagname),   \&handler);


> ==== Marked Sections support
> 
> Do you have an estimate of how much the penalty is if marked sections are
> compiled in?

No.

> I suspect that at a lot of sites, it won't be included by the
> system administrators and then will be needed by users.  Since the penalty
> is likely to be low, I think it would be better to always include marked
> section support.

I think I will do that.  The reason I made it a compile time option is
that I did not think it would of use to many since few browsers
support if properly, but it is nice to be able to say that we actually
do support it.

> ==== Parameter Entities
> 
> It also looks like marked sections won't be as useful as they might be
> unless you can decode parameter entities.

<![CDATA[...]]> might be useful anyway.  Parameters are kind of
complex I think.  They will at least not be in the 3.00 release if I
have to implement them.

>  That would also require you to
> either parse parameter entity declarations (<!ENTITY % ...>) or provide a
> means for assigning values to parameter entities or both.  I was thinking it
> might useful to add a couple more callbacks:
> 
>    unknown_parameter:  Called when a parameter entity (%param;) is
> encountered whose value is not known.  The callback would return the
> parameter entity's value.
> 
>    parameter_entity:  Called when an <!ENTITY % ...> element is recognized
> prior to any other processing so the program can determine the value of the
> parameter entity.  The default action would save the parameter name and
> value for future use.  A user provided callback would normally call
> $p->parameter entity() to save the parameter entity name value, but it could
> accommodate more complex situations.
> 
> This would require an access method $p->parameter_entity( $key, $value )
> which could be used to set and get the parameter entity names and values.

Not for now I think.

Regards,
Gisle