[PLUG-Devel] tokenizing a string
Michael R. Crusoe
michael at qrivy.net
Mon Jun 9 12:38:28 MST 2008
Looks like ANTLR 3 has a Perl target:
http://www.antlr.org/wiki/display/ANTLR3/Antlr3PerlTarget
I'm using it on a current project and am in love.
On Mon, Jun 9, 2008 at 11:31 AM, der.hans <PLUG at lufthans.com> wrote:
> moin moin,
>
> I have a bunch of data that's been glopped together as a string by someone
> else. I need to pull it apart, do mysterious things on it ( the magic
> happens step ) and then turn it over to others.
>
> Just to make it more fun for everyone involved, it's actually a glopping
> of other text fields that have already been glopped together. The
> different gloppings using different tokens and potentially different rules
> for defining the data[0].
>
> The primary glopping is mostly the same every time, but there are a couple
> of variables in how to determine the number of fields and the type of each
> field. Some of the parts also vary. In each case the first field can be
> broken up to determine the layout for the rest of the data.
>
> Once I've seperated the fields I need to apply different parts of magic,
> depending on the field.
>
> Any suggestions for a good, maintainable way to break the data into its
> various pieces?
>
> I will be using Perl. That's not negotable. Bash is an option, but
> wouldn't fit in with the rest of the code base.
>
> I'm certain this can be done better than how I'd do it, so I'm soliciting
> suggestions :).
>
> I'm certain the code will need to be adapted to future similar
> formats. Heck, I'm pretty certain we'll find the current format doesn't
> quite conform to the spec, so I'll have to adjust just to fit the actual
> data...
>
> The data has had delimeters turned into HTML entities, so I theoretically
> don't have to worry about a token being in the content, but I might have
> to de-entify each of the sub-sub-fields.
>
> Bogus example of potential data:
>
> f2:1234|fred=uwe,stuff,$georg=foo,$bar=$georg,udo=$HOME/some/path,:r
> War_and_Peace.txt|some~tilde~seperated~text|v1_5678;but;why;stick;with;a,single,scheme,when,you,can;change;midstream;for;fun;and;profit|do at not@forget at special@characters at and@@@empty at fields
>
> The primary tokenizer is the pipe.
>
> The first field is broken up by a colon with the first character of the
> first sub-field indicating the record type ( f == fred, of course ), the
> second character indicates the version type, so we have a version 2 fred.
> The second subfield is the record ID.
>
> The second field is a comma delimited set of data with unknown rules of
> what might be in the fields. I just need to break those up and display
> them. Well, they might ask for more, but that's all they're going to get.
>
> The 3rd field uses tildes as the field seperator.
>
> The 4th field also has an record type entry, but the record type might
> or might not actually break up differently than the primary record type
> entry. It'll definitely be a different type of record. No recursive
> records and I'm not going to bring that concept up to them :). The 4th
> field has semi-colon seperators, but one of the sub-fields is in turn a
> collection of comma delimited entries.
>
> And finally using @s just to make things annoying for Perl syntax
> highlighting :), especially with empty fields leading to multiple @@@@
> combos.
>
> [0] The gloppings with undefined data descriptions will get passed on as
> is, but I still need to be able to seperate the parts at the tokens.
>
> ciao,
>
> der.hans
> --
> # http://www.LuftHans.com/ https://LOPSA.org/
> # But getting smart is a tricky business. The smartest people I've ever met
> # are the ones who knew exactly what they were ignorant of. -- Alan Alda
> # Southamton commencement speech, 2007May18
> _______________________________________________
> PLUG-devel mailing list - PLUG-devel at lists.PLUG.phoenix.az.us
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-devel
>
More information about the PLUG-devel
mailing list