how to sanitize MS Word HTML output?
Matt Graham
danceswithcrows at usa.net
Mon May 4 11:55:54 MST 2009
From: "Steven A. DuChene" <linux-clusters at mindspring.com>
> It is filled with a lot of un-needed style and formating tags
> as well as all kinds of stupid extra characters due to some MS
> "standard" character formatting stuff. Things like braking lines
> in the middle of words and then adding an equal sign at the end
> of the broken line or replacing equal signs in the html code with
> "=3D'
That's not HTML. That's quoted-printable encoding. The mail client
should've automatically converted that to UTF-8 or whatever when
it saved the file. If you have MIME::QuotedPrint installed, you
can decode that with a Perl one-liner and see if it looks any better.
> Does anyone know of a tool that will clean this crappy excuse for
> html code up into something more standard?
"Demoroniser" is probably not what you want. I've seen a few things
like that over the years, and have gotten rid of most of the junk
with a bunch of regular expressions. Without a look at what the
mangled HTML looks like, I couldn't give you a list of sed commands
to feed this data through.
--
Matt G / Dances With Crows
The Crow202 Blog: http://crow202.org/wordpress/
There is no Darkness in Eternity/But only Light too dim for us to see
More information about the PLUG-discuss
mailing list