how to sanitize MS Word HTML output?

Steven A. DuChene linux-clusters at mindspring.com
Mon May 4 14:15:25 MST 2009


Thanks for all of the links and info. If I run into this again I will
know what to do.

Until then I did find a Firefox extension called UnMHT that did a pretty
damn good job at rendering the original file from the MCC website.

-----Original Message-----
>From: Joseph Sinclair <plug-discussion at stcaz.net>
>Sent: May 4, 2009 4:44 PM
>To: Main PLUG discussion list <plug-discuss at lists.plug.phoenix.az.us>
>Subject: Re: how to sanitize MS Word HTML output?
>
>Default save-as for many Outlook (full) installations is MHT, which is an MS-specific MIME archive format, sort of.
>Usually, you can run the extracted source through a mime decoder to get a message-plus-attachments output, and then pull the HTML doc from there.
>Once you have "clean" MSHTML (it's not HTML, it's a MS-specific XML format that just looks close enough to HTML that browsers can figure it out in quirks mode), then you can usually pass it through one of several "cleaner" apps available.  All work to some extent, but none are perfect...  Tidy is probably the most complete, but it can be a bit of a pain to get all the options to what you want.
>
>Links:
>  Using HTML tidy from [http://tidy.sourceforge.net/] with the "word-2000" configuration option set to "yes" will go to great lengths to remove
>  MS-Word garbage while doing all of it's other nifty cleanups of the HTML.
>Other options:
>  Quick cleaner written in C# (Requires Mono) [http://www.codinghorror.com/blog/archives/000485.html]
>  Javascript-based cleaner [http://ethilien.net/websoft/wordcleaner/cleaner.htm]
>  Service to clean docs, may store and retain documents, so don't use for anything you care about [http://www.wordhtmlcleaner.co.uk/]
>  Another service, only for small documents [http://textism.com/wordcleaner/]
>





More information about the PLUG-discuss mailing list