how to sanitize MS Word HTML output?

Joseph Sinclair plug-discussion at stcaz.net
Mon May 4 13:44:35 MST 2009


Default save-as for many Outlook (full) installations is MHT, which is an MS-specific MIME archive format, sort of.
Usually, you can run the extracted source through a mime decoder to get a message-plus-attachments output, and then pull the HTML doc from there.
Once you have "clean" MSHTML (it's not HTML, it's a MS-specific XML format that just looks close enough to HTML that browsers can figure it out in quirks mode), then you can usually pass it through one of several "cleaner" apps available.  All work to some extent, but none are perfect...  Tidy is probably the most complete, but it can be a bit of a pain to get all the options to what you want.

Links:
  Using HTML tidy from [http://tidy.sourceforge.net/] with the "word-2000" configuration option set to "yes" will go to great lengths to remove
  MS-Word garbage while doing all of it's other nifty cleanups of the HTML.
Other options:
  Quick cleaner written in C# (Requires Mono) [http://www.codinghorror.com/blog/archives/000485.html]
  Javascript-based cleaner [http://ethilien.net/websoft/wordcleaner/cleaner.htm]
  Service to clean docs, may store and retain documents, so don't use for anything you care about [http://www.wordhtmlcleaner.co.uk/]
  Another service, only for small documents [http://textism.com/wordcleaner/]



Matt Graham wrote:
> From: Lisa Kachold <lisakachold at obnosis.com>
>> http://www.toastedspam.com/decodeqp 
>> It should be noted that quoted-printable encoded text is generally
>> associated with EMAIL, not Word?
> 
> The composer of the message probably wrote the message in Word, then
> pasted it into a mail client.  Word->HTML->quoted-printable-encoded
> HTML in the message body.  I don't know why the quoted-printable
> encoding wasn't decoded when the OP did "save as" in his mail client,
> though.
> 


More information about the PLUG-discuss mailing list