[PLUG-Devel] tokenizing a string

Mon Jun 9 12:52:03 MST 2008

If I were solving this, I'd create a collection of objects to model the data.

Specifically I'd:

1) Create an base class has methods to set/get a chunk of data, and a
method for parsing the data.   Call the method parseFld.
     a) This method would take an integer value as its only argument
and return the data from the corresponding field.
     b) parseFld doesn't do anything in the base class, but will be
overridden in subclasses

2) Create a class that represents a single line of data, as a subclass
of the class from step 1.
    a) The method, parseFld, is implemented so that it returns an
instance of the objects created in step 3.
    b) The logic to determine the order and type of each sub-field can go here

3) Create a subclass for each sub-field, and impliment the the
parseFld method appropriately.

When you're done, the code would be clean, and easy to maintain.
Creating a new-sub field parser means creating a new class, and
implementing the parseFld method.

Any strange parsing rules can be separated out into individual
parseFld routines, so you won't end up with miles of spaghetti code.
Because the code for parsing each type of field is separated, you can
easily test the parsing code.   I'd strongly recommend unit testing
with the Test::More module.

When finished,  getting the 4th sub-field from the 5th field would look like:

$line_obj=new LineObj;
$line_obj->setData($line_data);
$the_thing_you_want=$line_obj->parseFld(5)->parseFld(4);

Some potential issues:
   a) Perl does not have a consistent way store empty values.  Choose
ahead of time if you want to use empty strings or undef to represent
empty fields.
   b) This solution parses the data when it is retrieved, so beware
performance issues if you're retrieving the same field over and over
again

Just my 2 cents...

On Mon, Jun 9, 2008 at 11:36 AM, Jeffrey Uurtamo <kd7jny at gmail.com> wrote:
> I'd say use some form of regex to split the string into its four sub-strings
> and then split the substrings into fields you need.
>
> On Mon, Jun 9, 2008 at 11:31 AM, der.hans <PLUG at lufthans.com> wrote:
>>
>> moin moin,
>>
>> I have a bunch of data that's been glopped together as a string by someone
>> else. I need to pull it apart, do mysterious things on it ( the magic
>> happens step ) and then turn it over to others.
>>
>> Just to make it more fun for everyone involved, it's actually a glopping
>> of other text fields that have already been glopped together. The
>> different gloppings using different tokens and potentially different rules
>> for defining the data[0].
>>
>> The primary glopping is mostly the same every time, but there are a couple
>> of variables in how to determine the number of fields and the type of each
>> field. Some of the parts also vary. In each case the first field can be
>> broken up to determine the layout for the rest of the data.
>>
>> Once I've seperated the fields I need to apply different parts of magic,
>> depending on the field.
>>
>> Any suggestions for a good, maintainable way to break the data into its
>> various pieces?
>>
>> I will be using Perl. That's not negotable. Bash is an option, but
>> wouldn't fit in with the rest of the code base.
>>
>> I'm certain this can be done better than how I'd do it, so I'm soliciting
>> suggestions :).
>>
>> I'm certain the code will need to be adapted to future similar
>> formats. Heck, I'm pretty certain we'll find the current format doesn't
>> quite conform to the spec, so I'll have to adjust just to fit the actual
>> data...
>>
>> The data has had delimeters turned into HTML entities, so I theoretically
>> don't have to worry about a token being in the content, but I might have
>> to de-entify each of the sub-sub-fields.
>>
>> Bogus example of potential data:
>>
>> f2:1234|fred=uwe,stuff,$georg=foo,$bar=$georg,udo=$HOME/some/path,:r
>>
>> War_and_Peace.txt|some~tilde~seperated~text|v1_5678;but;why;stick;with;a,single,scheme,when,you,can;change;midstream;for;fun;and;profit|do at not@forget at special@characters at and@@@empty at fields
>>
>> The primary tokenizer is the pipe.
>>
>> The first field is broken up by a colon with the first character of the
>> first sub-field indicating the record type ( f == fred, of course ), the
>> second character indicates the version type, so we have a version 2 fred.
>> The second subfield is the record ID.
>>
>> The second field is a comma delimited set of data with unknown rules of
>> what might be in the fields. I just need to break those up and display
>> them. Well, they might ask for more, but that's all they're going to get.
>>
>> The 3rd field uses tildes as the field seperator.
>>
>> The 4th field also has an record type entry, but the record type might
>> or might not actually break up differently than the primary record type
>> entry. It'll definitely be a different type of record. No recursive
>> records and I'm not going to bring that concept up to them :). The 4th
>> field has semi-colon seperators, but one of the sub-fields is in turn a
>> collection of comma delimited entries.
>>
>> And finally using @s just to make things annoying for Perl syntax
>> highlighting :), especially with empty fields leading to multiple @@@@
>> combos.
>>
>> [0] The gloppings with undefined data descriptions will get passed on as
>> is, but I still need to be able to seperate the parts at the tokens.
>>
>> ciao,
>>
>> der.hans
>> --
>> #  http://www.LuftHans.com/              https://LOPSA.org/
>> #  But getting smart is a tricky business. The smartest people I've ever
>> met
>> #  are the ones who knew exactly what they were ignorant of. -- Alan Alda
>> #  Southamton commencement speech, 2007May18
>> _______________________________________________
>> PLUG-devel mailing list  -  PLUG-devel at lists.PLUG.phoenix.az.us
>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-devel
>
>
> _______________________________________________
> PLUG-devel mailing list  -  PLUG-devel at lists.PLUG.phoenix.az.us
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-devel
>
>