I found a script written by Tom Christiansen called striphtml. Unfortunately, I get an error when running the script, and I was wondering if any regular expression folks out there could help me out. When I run the script: 'cat htmlfile | perl striphtml', I get "regexp *+ operand could be empty at striphtml line 66". htmlfile contains the following: <!-- This is a comment -->
Now is the time for all good men to...
The snippet of code that is in question is:
55: s{ < # opening angle bracket
56:
57: (?: # Non-backreffing grouping paren
58: [^>'"] * # 0 or more things that are neither > nor '
nor "
59: | # or else
60: ".*?" # a section between double quotes (stingy
match)
61: | # or else
62: '.*?' # a section between single quotes (stingy
match)
63: ) + # repetire ad libitum
64: # hm.... are null tags <> legal? XXX
65: > # closing angle bracket
66: }{}gsx; # mutate into nada, nothing, and niente
The full source of Tom's script can be found on CPAN at
http://www.perl.com/CPAN-local/authors/Tom_Christiansen/scripts/striphtm
l.gz
I do have libwww, and have used HTML:Parse, but I am also looking for an
alternate way to do a simple "tag strip".
Any help would appreciated.
Regards,
Bob