Re: data conversion strangeness

Attachments:
Message as email (text/plain)

Author: A LeDonne
Date:
To: Main PLUG discussion list
Subject: Re: data conversion strangeness

> > --- Craig White <craigwhite@azapple.com> wrote:
> >
> > > I have a text file which I exported in tab delimited format from
> > > Filemaker Pro on Windows, cleaned up in openoffice.org and want to
> > > import into postgres.
> > >
> > > the first few characters in the file are killing me and I haven't a clue
> > > on how to rid the file of them...
> > >
> > > 0000000 357 273 277 1 \t B l o o d B o r n e

> > >
> > > it's the 357 273 277 that don't belong...the data should start with "1"
> > >
> > > where did they come from and how do I get rid of them?
> > >
> > > Craig

To answer part one of your question, those three bytes (Hex: EF BB BF)
are a UTF-8 encoded Byte Order Mark. (
http://www.unicode.org/faq/utf_bom.html#BOM ). They're an indicator
that the file you're looking at is, in fact, UTF-8-encoded Unicode
text, rather than something in some other local codepage. Notepad.exe
adds them as a matter of course when saving as Unicode text; perhaps
OO.o is adding them when it exports to UTF-8 text as well.

Unicode-compliant text processors will ignore the BOM when considering
text. If there's a way to tell the Postgres import process that the
file is UTF-8, the import *should* ignore those bytes completely.

Or you can safely remove them any time they appear in a text stream,
if you no longer need signalling in the stream that it is UTF-8
encoded. (The BOM is "default ignorable", and should never appear in
the midst of Unicode text.)

-A
---------------------------------------------------
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change you mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

This message is part of the following thread:
	the complete thread tree sorted by date
	Craig White at
	Craig White at