Python help (finding duplicates)

Sat Aug 28 11:50:48 MST 2010

if you are just looking for a list of unique values, why not just do:

cat file1 file2 | sort | uniq > file3

Obviously you could have reasons why this won't suffice for your need, but
I've not seen that in your description yet

On Sat, Aug 28, 2010 at 9:48 AM, Kevin Faulkner <
kondor6c at encryptedforest.net> wrote:

> Sorry about the time issue.
> On Friday 27 August 2010 23:50:00 you wrote:
> > I hope these are small files, the algorithm you wrote is not going to run
> > well as file size gets large (over 10,000 entries) Have you checked the
> > space/tab situation?  Python uses indentation changes to indicate the end
> > of a block, so inconsistent use of tabs and spaces freaks it out. Here
> are
> > a couple questions:
> This is not a school project, so you won't be doing my homework or anything
> :)
> The space/tab issue is okay, but the script does not even get to the
> print(i),
> I even tried for line in secondaryfile: and the for loop still wouldn't be
> executed.
> > Are these always numbers?
> Yes, they are IP's from an Apache error log.
> > Do the files have to remain in their original order, or can you reorder
> > them during processing? How often does this have to run?
> they are not in order because one list is 852 entries and another list is
> 3300
> entries. This script only needs to run once.
> > Do you have to "comment" the duplicate, or can you remove it?
> The plan is to remove it, but I wanted to see if my removal method would
> work,
> so I was trying to put a comment next to it.
> > Are there any other requirements not obvious from the description below?
> No real requirements, if anyone would like the original files I can give
> them
> to you, a lot of them are bots.
> Thank you :)
> -Kevin
> >
> > Kevin Faulkner wrote:
> > > I was trying to pull duplicates out of 2 different files. Needless to
> say
> > > there are duplicates I would place a # next to the duplicate. Example
> > > files: file 1:      file 2:
> > > 433.3       947.3
> > > 543.1       749.0
> > > 741.1       859.2
> > > 238.5       433.3
> > > 839.2       229.1
> > > 583.6       990.1
> > > 863.4       741.1
> > > 859.2       101.8
> > >
> > > import string
> > > i=1
> > > primaryfile = open('/tmp/extract','r')
> > > secondaryfile = open('/tmp/unload')
> > >
> > > for line in primaryfile:
> > >    pcompare = line
> > >    print(pcompare)
> > >
> > >    for row in secondaryfile:
> > >      i = i + 1
> > >      print(i)
> > >      scompare = row
> > >
> > >      if pcompare == scompare:
> > >        print(scompare)
> > >        secondaryfile.write('#')
> > >
> > > With this code it should go through the files and find a duplicate and
> > > place a '#' next to it. But for some reasonson it doesn't even get to
> > > the second for statement. I don't know what else to do. Please offer
> > > some assistance. :) ---------------------------------------------------
> > > PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
> > > To subscribe, unsubscribe, or to change your mail settings:
> > > http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>

-- 
Dazed_75 a.k.a. Larry

The spirit of resistance to government is so valuable on certain occasions,
that I wish it always to be kept alive.
  - Thomas Jefferson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20100828/321d7740/attachment.html>