Re: how to sanitize MS Word HTML output?

Top Page
Attachments:
Message as email
+ (text/plain)
Delete this message
Reply to this message
Author: Matt Graham
Date:  
To: Steven A. DuChene, Main PLUG discussion list
Subject: Re: how to sanitize MS Word HTML output?
From: "Steven A. DuChene" <>
> It is filled with a lot of un-needed style and formating tags
> as well as all kinds of stupid extra characters due to some MS
> "standard" character formatting stuff. Things like braking lines
> in the middle of words and then adding an equal sign at the end
> of the broken line or replacing equal signs in the html code with
> "=3D'


That's not HTML. That's quoted-printable encoding. The mail client
should've automatically converted that to UTF-8 or whatever when
it saved the file. If you have MIME::QuotedPrint installed, you
can decode that with a Perl one-liner and see if it looks any better.

> Does anyone know of a tool that will clean this crappy excuse for
> html code up into something more standard?


"Demoroniser" is probably not what you want. I've seen a few things
like that over the years, and have gotten rid of most of the junk
with a bunch of regular expressions. Without a look at what the
mangled HTML looks like, I couldn't give you a list of sed commands
to feed this data through.

--
Matt G / Dances With Crows
The Crow202 Blog: http://crow202.org/wordpress/
There is no Darkness in Eternity/But only Light too dim for us to see


---------------------------------------------------
PLUG-discuss mailing list -
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss