Re: Parsing user-defined tags for customized use.....

WWW projekt (wwwproj@dna.lth.se)
Tue, 06 Aug 1996 09:29:43 +0200


Moshiul Shovon wrote:
> I am not sure what I should use: HTML::Parse or HTML::Parser to parse
> documents. The man page says I should use Parser. Can any one please
> tell me where I can look at some examples? I need some pointers. Man
> pages alone don't work for me, I need examples (beyond synopsis of the
> man page!).

> ### I am stuck here ....
> ### How do I Parse $res and find out the value (content) of MyTag1 and
> MyTag2?
> ###

Use HTML::TreeBuilder, like this:

#!/usr/common/bin/perl
use HTML::TreeBuilder;

$doc = <<'END';
Test 1 This is Tag 1 This is Tag 2
END

$tb = new HTML::TreeBuilder;

$tb->ignore_unknown(0);          # Do not ignore unknown tags
$tb->parse($doc);
print $tb->as_HTML;
$tb->traverse(
   sub 
   {
       my($node, $start, $depth) = @_;
       
       return 1 unless $start;    # Do nothing on end-tags
       if (ref $node and $node->tag =~ /^mytag/) {
	   print $node->tag, " = \n";
	   foreach (@{$node->content}) {
	       print "\t$_\n";
	   }
       }
       return 1;                  # tracerse stops if false return value 
    }, 0);                        # skip text elements

This gives the output:

Test 1 This is Tag 1 This is Tag 2 mytag1 = This is Tag 1 mytag2 = This is Tag 2 Not very well formatted HTML-code, but the tags are extracted. TreeBuilder works fine with tags that has an end-tag, but to make it work with simple tags, you will have to put TreeBuilder out of the 'implicit tags mode' (see documentation) and then some weird things can happen. (such as that

tags will be ended with a

in bad places). I made a subclass to TreeBuilder, changed some hardcoded things in TreeBuilder and Element and made a class that recognized my own language. This might be a good idea if you are going to use it a lot. Should the MyTagX be a variable name and this be the way of setting the variable? I have done the same thing corresponding to the INPUT and TEXTAREA tags: and Here's my value It's multiline. LETTEXT does the same things as your tags, but is easier to parse due to the fact that it always have the same tag name. I hope this helps, mail me if you would like to have a look at my classes. --- Stefan Eriksson, Lund university, Sweden wwwproj@dna.lth.se, dat93ser@ludat.lth.se