Help with Regular Expression

Alexander LeDonne plug-discuss@lists.plug.phoenix.az.us
Tue, 3 Dec 2002 18:48:56 -0800 (PST)


One quick note... you indicated that any of \n, \r, or \r\n should be
considered a newline, and that you wished to preserve any double
newlines. The two steps suggested will lose any \r\r doubles. Perhaps
three steps...

# convert \r\n to \n
# convert remaining \r to \n (that is, where \r is a newline on its
own)
# remove isolated newlines (David's second step)

This assumes that in a single unit of text to be matched against, \r
and \n cannot both be standalone newlines (a reasonable assumption, I
think).

-Alex

PS - I couldn't resist the exercise. I think at least in Perl (I don't
know about the PHP implementation) [ignore line wrapping...]

s/((?<!(\r\n))(\r\n)(?!(\r\n)))|((?<!(\r(?!\n)))(\r(?!\n))(?!(\r(?!\n))))|((?<!((?<!\r)\n))((?<!\r)\n)(?!((?<!\r)\n)))
//mg

 , but don't do that. Nested negative zero-width assertions are
amusing, but ugly and slow. :)

--- plug-discuss-request@lists.plug.phoenix.az.us wrote:

> Thanks David!!
> 
> I was trying to do it all in one shot and was getting some "amusing"
> results.
> Your method is much more straightforward and easier to understand.
> Peter
> 
> On 3 Dec 2002 at 11:21, David A. Sinck wrote:
> 
> > 
> > 
> > \_ SMTP quoth az_pete@cactusfamily.com on 12/3/2002 11:04 as having
>  spake thusly:
> > \_
> > \_ Hi All, 
> > \_ 
> > \_ I seem to be having a lot of trouble with what seems should be a
> > \_ simple regex.
> > \_ 
> > \_ I have a database full of research paper abstracts and I would
like
> > \_ to strip all newlines from them. This would include \n, \r, and
> > \_ \r\n characters.  However, if there are two consecutive newlines
> > \_ (i.e. new paragraph) I would like to keep those in tact.
> > \_ 
> > \_ I have written the script in PHP to pull each field from the
> > \_ database, perform said regex and then update the field with the
new
> > \_ data.  All I need is a regex that works.  I'm using the Perl
> > \_ compatible regex within PHP.
> > \_ 
> > \_ Any help would be appreciated.
> > 
> > I'd do two passes for ease of thought:
> > 
> > s/\r//g;  # lose all \r's, regardless
> > 
> > s/[^\n][\n][^\n]/ /g;  # non-newline newline non-newline goes to
>  space
> > 
> > YMMV.
> > 
> > Trying to do both in one could prove more amusing and is left as an
> > exercise for the reader.
> > 
> > Backups are your friend.
> > 
> > David


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com