OT: Copying text from PDF to text on Mac corrupts words cont…

Top Page
Attachments:
Message as email
+ (text/plain)
+ (text/html)
+ (text/plain)
Delete this message
Reply to this message
Author: Victor Odhner
Date:  
To: Main PLUG discussion list
Subject: OT: Copying text from PDF to text on Mac corrupts words containing "t"
OK, I have an amusing OSX puzzle for y’all.

I am taking messages from Thunderbird and pasting them as text into Word.

Many words containing “ti” or “tt” or some other combinations with the letter “t” get corrupted when I use copy and paste, from PDF text that looks normal. Some software interprets the PDF correctly for display and printing, and some software fails to understand this encoding involving the letter “t”.

The same corruption happens when I paste into TextEdit and even into MacVim, and when I open it as input to LibreOffice under OSX and Linux Mint.

The Linux “Document Viewer” program displays it without the corruption. So apparently this funky interpretation of the letter “t” is a Thing[TM] in PDF, understood by some software outside of the “preview” application.

One sender has sent his messages in rich text from HotMail, multipart/alternative; I’m working with the HTML version, content-Type: text/html; charset="iso-8859-1”, Content-Transfer-Encoding: quoted-printable. When I view the HTML message source, the text in question doesn’t show any funky encoding for the words that get corrupted. If I open the PDF in MacVim it’s all encoded into gibberish, and the preview application is what displays it correctly, but corrupts my selected text going into the clipboard.

What I’m trying to do: To pick up the text with headers from Thunderbird, I do print > PDF > Open PDF in preview. Then I select the message (which appears nicely formatted), and paste it into the Word document. It mostly works, but . . .

Here’s what I get if I select and copy the text, or open the PDF in software that’s not in on the joke: The word “attempting” becomes “a:emp4ng”, “painting” becomes "pain<ng”, “putting” becomes “pu?ng”, etc. I have not encountered any corruptions that did not involve the letter “t”.

After cleaning up the corrupted texts, I saved the file as .docx and opened it with Libre Office, then exported it to a new PDF. Since Libre doesn’t grok the whole “t” thing, of course that PDF can be copied and pasted without corruption.

I’m almost done this collecting project. But this is too interesting to ignore, so I figured I should post it. So I’ve blown about 90 minutes to define what is happening, so that I could toss some red meat into the group.

Of course if someone can tell me a better way to save Thunderbird messages with headers into a document, without going through PDF, I’m open to that as well. All methods I can find seem to want to basically give me just the body, or only certain of the fields that I want to see all of in the document, such as Date, From, To and Subject. Maybe there’s a plug-in or something.

But the PDF mystery is what brought me here.

;-)


---------------------------------------------------
PLUG-discuss mailing list -
To subscribe, unsubscribe, or to change your mail settings:
http://lists.phxlinux.org/mailman/listinfo/plug-discuss