[PLUG-Devel] tokenizing a string

Mon Jun 9 11:31:51 MST 2008

moin moin,

I have a bunch of data that's been glopped together as a string by someone
else. I need to pull it apart, do mysterious things on it ( the magic
happens step ) and then turn it over to others.

Just to make it more fun for everyone involved, it's actually a glopping
of other text fields that have already been glopped together. The
different gloppings using different tokens and potentially different rules
for defining the data[0].

The primary glopping is mostly the same every time, but there are a couple
of variables in how to determine the number of fields and the type of each
field. Some of the parts also vary. In each case the first field can be
broken up to determine the layout for the rest of the data.

Once I've seperated the fields I need to apply different parts of magic,
depending on the field.

Any suggestions for a good, maintainable way to break the data into its
various pieces?

I will be using Perl. That's not negotable. Bash is an option, but
wouldn't fit in with the rest of the code base.

I'm certain this can be done better than how I'd do it, so I'm soliciting
suggestions :).

I'm certain the code will need to be adapted to future similar
formats. Heck, I'm pretty certain we'll find the current format doesn't
quite conform to the spec, so I'll have to adjust just to fit the actual
data...

The data has had delimeters turned into HTML entities, so I theoretically
don't have to worry about a token being in the content, but I might have
to de-entify each of the sub-sub-fields.

Bogus example of potential data:

f2:1234|fred=uwe,stuff,$georg=foo,$bar=$georg,udo=$HOME/some/path,:r
War_and_Peace.txt|some~tilde~seperated~text|v1_5678;but;why;stick;with;a,single,scheme,when,you,can;change;midstream;for;fun;and;profit|do at not@forget at special@characters at and@@@empty at fields

The primary tokenizer is the pipe.

The first field is broken up by a colon with the first character of the
first sub-field indicating the record type ( f == fred, of course ), the
second character indicates the version type, so we have a version 2 fred.
The second subfield is the record ID.

The second field is a comma delimited set of data with unknown rules of
what might be in the fields. I just need to break those up and display
them. Well, they might ask for more, but that's all they're going to get.

The 3rd field uses tildes as the field seperator.

The 4th field also has an record type entry, but the record type might
or might not actually break up differently than the primary record type
entry. It'll definitely be a different type of record. No recursive
records and I'm not going to bring that concept up to them :). The 4th
field has semi-colon seperators, but one of the sub-fields is in turn a
collection of comma delimited entries.

And finally using @s just to make things annoying for Perl syntax
highlighting :), especially with empty fields leading to multiple @@@@
combos.

[0] The gloppings with undefined data descriptions will get passed on as
is, but I still need to be able to seperate the parts at the tokens.

ciao,

der.hans
-- 
#  http://www.LuftHans.com/              https://LOPSA.org/
#  But getting smart is a tricky business. The smartest people I've ever met
#  are the ones who knew exactly what they were ignorant of. -- Alan Alda
#  Southamton commencement speech, 2007May18