Boris' parsing woes ..

JP May (jpm@rootworks.com)
Sun, 12 Apr 1998 22:23:22 -0600


Boris ... you could easily parse it using regular expressions in Perl.
Would that do it for you?

Email me offlist if you need help creating the munge in question! or try
http://rootworks/rx  You did not make it exactly clear what text you are
trying to get at.

Actually the following would probably work for you as a cheap solution:

$all_the_text =~ /<body>(.*)<\/body>/;
$what_boris_wants = $1;
$what_boris_wants_sans_any_tags = $what_boris_wants;
$what_boris_wants_sans_any_tags =~ s/<.*?>//g;

(html is such a trivial language its often unnecessary to make a whole
parse tree when you can just get what you want heuristically with ease)

Also, FWIW the html below is WRONG I think.  You need to have the
</frameset> BEFORE the <noframes> tag, NOT enclosing the body pair.

>I've asked earlier about parsing out link text...  To be sure, the example
>that Gisle quotes (for the reasons below, I suppose) does not parse this
>(real-world and, I believe, perfectly valid) page:
>
>--------------
>
><html>
>
><head>
><title>Elli Angelopoulou's home page</title>
></head>
>
><frameset cols="16%,48%">
>    <frame src="menu.html" name="menu"
>    marginwidth="10" marginheight="5">
>    <frame src="contact.html" name="main"
>    marginwidth="25" marginheight="5">
>    <noframes>
>    <body>
>    <p>This web page uses frames, but your browser doesn't support
>them.</p>
>    </body>
>    </noframes>
></frameset>
></html>
>
>--------------
>
>LinkExtor will parse the links if the tag that I specify is not "a" but
>"frame", but it cannot parse out the text (it throws it away I suppose).
>Any suggestions?
>
>Boris
>

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Rootworks                                                2400