data conversion strangeness

Craig White craigwhite at azapple.com
Fri Jun 16 08:44:17 MST 2006


On Fri, 2006-06-16 at 10:10 -0400, A LeDonne wrote:
> > > --- Craig White <craigwhite at azapple.com> wrote:
> > >
> > > > I have a text file which I exported in tab delimited format from
> > > > Filemaker Pro on Windows, cleaned up in openoffice.org and want to
> > > > import into postgres.
> > > >
> > > > the first few characters in the file are killing me and I haven't a clue
> > > > on how to rid the file of them...
> > > >
> > > > 0000000 357 273 277   1  \t   B   l   o   o   d       B   o   r   n   e
> > > >
> > > > it's the 357 273 277 that don't belong...the data should start with "1"
> > > >
> > > > where did they come from and how do I get rid of them?
> > > >
> > > > Craig
> 
> To answer part one of your question, those three bytes (Hex: EF BB BF)
> are a UTF-8 encoded Byte Order Mark. (
> http://www.unicode.org/faq/utf_bom.html#BOM ). They're an indicator
> that the file you're looking at is, in fact, UTF-8-encoded Unicode
> text, rather than something in some other local codepage. Notepad.exe
> adds them as a matter of course when saving as Unicode text; perhaps
> OO.o is adding them when it exports to UTF-8 text as well.
> 
> Unicode-compliant text processors will ignore the BOM when considering
> text. If there's a way to tell the Postgres import process that the
> file is UTF-8, the import *should* ignore those bytes completely.
> 
> Or you can safely remove them any time they appear in a text stream,
> if you no longer need signalling in the stream that it is UTF-8
> encoded. (The BOM is "default ignorable", and should never appear in
> the midst of Unicode text.)
----
thanks for the info. I was able to remove them (the UTF-8 BOM) with vi
whereas kate/emacs/etc. simply gave no indication that they were there
and when postgres gagged on the start of the file, 'od' was a good
viewer to tell me what I was dealing with.

I have changed the methodology of cleaning up the exported text and
thankfully, Notepad.exe is no longer part of the process  ;-)

I only brought in Notepad.exe because of something that I can't explain
within openoffice.org... I could use regular expressions to use "\n" as
a [return/linefeed] in OOo's 'Replace' but couldn't figure out how to
'Find' "\n" - I finally gave up. It does have a really nice feature '^$'
to find blank lines though so I had to shift my thinking and now I am
working.

Craig



More information about the PLUG-discuss mailing list