how to sanitize MS Word HTML output?

Matt Graham danceswithcrows at usa.net
Mon May 4 11:55:54 MST 2009


From: "Steven A. DuChene" <linux-clusters at mindspring.com>
> It is filled with a lot of un-needed style and formating tags
> as well as all kinds of stupid extra characters due to some MS
> "standard" character formatting stuff. Things like braking lines
> in the middle of words and then adding an equal sign at the end
> of the broken line or replacing equal signs in the html code with
> "=3D' 

That's not HTML.  That's quoted-printable encoding.  The mail client
should've automatically converted that to UTF-8 or whatever when
it saved the file.  If you have MIME::QuotedPrint installed, you
can decode that with a Perl one-liner and see if it looks any better.

> Does anyone know of a tool that will clean this crappy excuse for
> html code up into something more standard?

"Demoroniser" is probably not what you want.  I've seen a few things
like that over the years, and have gotten rid of most of the junk
with a bunch of regular expressions.  Without a look at what the
mangled HTML looks like, I couldn't give you a list of sed commands
to feed this data through.

-- 
Matt G / Dances With Crows
The Crow202 Blog:  http://crow202.org/wordpress/
There is no Darkness in Eternity/But only Light too dim for us to see




More information about the PLUG-discuss mailing list