data conversion strangeness

Fri Jun 16 07:10:06 MST 2006

> > --- Craig White <craigwhite at azapple.com> wrote:
> >
> > > I have a text file which I exported in tab delimited format from
> > > Filemaker Pro on Windows, cleaned up in openoffice.org and want to
> > > import into postgres.
> > >
> > > the first few characters in the file are killing me and I haven't a clue
> > > on how to rid the file of them...
> > >
> > > 0000000 357 273 277   1  \t   B   l   o   o   d       B   o   r   n   e
> > >
> > > it's the 357 273 277 that don't belong...the data should start with "1"
> > >
> > > where did they come from and how do I get rid of them?
> > >
> > > Craig

To answer part one of your question, those three bytes (Hex: EF BB BF)
are a UTF-8 encoded Byte Order Mark. (
http://www.unicode.org/faq/utf_bom.html#BOM ). They're an indicator
that the file you're looking at is, in fact, UTF-8-encoded Unicode
text, rather than something in some other local codepage. Notepad.exe
adds them as a matter of course when saving as Unicode text; perhaps
OO.o is adding them when it exports to UTF-8 text as well.

Unicode-compliant text processors will ignore the BOM when considering
text. If there's a way to tell the Postgres import process that the
file is UTF-8, the import *should* ignore those bytes completely.

Or you can safely remove them any time they appear in a text stream,
if you no longer need signalling in the stream that it is UTF-8
encoded. (The BOM is "default ignorable", and should never appear in
the midst of Unicode text.)

-A