Python help (finding duplicates)

Kevin Faulkner kondor6c at encryptedforest.net
Sat Aug 28 16:12:23 MST 2010


On Saturday 28 August 2010 11:48:10 Joseph Sinclair wrote:
> OK,
>   I've attached a complete program that works, if you want to just get it
> done, but I've also described what went wrong in your first attempt below.
> 
I really appreciate what you have done. I more so like the description of what 
I did wrong. Using readlines() is a better approach like you said, less disk 
thrashing. I was using /usr/bin/python3, so print() is now a function. My next 
step is to take the host list and identify where the IP is using pygeoip.
Thank you again. :)
-Kevin
> # the i value was just for debugging, so I dropped it
> primaryfile = open('/tmp/extract','r')
> # read the primary file into a list for speed and so you aren't reading
> more than once primary_lines = primaryfile.readlines()
> # you didn't specify a mode for this, so it defaulted to read-only.  Be
> explicit for clarity secondaryfile = open('/tmp/unload', 'r')
> # Open a separate file for output, otherwise you would have been writing
> and reading the same file over and over again, which usually causes errors
> outputfile = open('/tmp/result-file', 'w')
> # read the second file into a list, then you can scan through it over and
> over without hammering disk and re-reading a file you might have modified.
> secondary_lines = secondaryfile.readlines()
> # print is a statement, not a function.
> print 'opened files'
> # loop through the list, not the file
> for line in primary_lines:
>    pcompare = line
>    # print is a statement, use the formatting operator to print variable
> values print 'primary line = %s' % (pcompare)
>    # loop through the list, not the file
>    for row in secondary_lines:
>      scompare = row
>      if pcompare == scompare:
>        # print as a statement, not a function
>        print 'secondary line = %s' % (scompare)
>        # you were writing random # characters in a file (most likely after
> the line read), this writes a comment to a new file, which is usually
> clearer. # invert the test, and add the line to a set here then write out
> the set at the end to get an output of lines without duplication.
> outputfile.write('#%s' % (scompare))
> print 'Done'
> 

> Kevin Faulkner wrote:
> > Sorry about the time issue.
> > 
> > On Friday 27 August 2010 23:50:00 you wrote:
> >> I hope these are small files, the algorithm you wrote is not going to
> >> run well as file size gets large (over 10,000 entries) Have you checked
> >> the space/tab situation?  Python uses indentation changes to indicate
> >> the end of a block, so inconsistent use of tabs and spaces freaks it
> >> out. Here are
> > 
> >> a couple questions:
> > This is not a school project, so you won't be doing my homework or
> > anything :) The space/tab issue is okay, but the script does not even
> > get to the print(i), I even tried for line in secondaryfile: and the for
> > loop still wouldn't be executed.
> > 
> >> Are these always numbers?
> > 
> > Yes, they are IP's from an Apache error log.
> > 
> >> Do the files have to remain in their original order, or can you reorder
> >> them during processing? How often does this have to run?
> > 
> > they are not in order because one list is 852 entries and another list is
> > 3300 entries. This script only needs to run once.
> > 
> >> Do you have to "comment" the duplicate, or can you remove it?
> > 
> > The plan is to remove it, but I wanted to see if my removal method would
> > work, so I was trying to put a comment next to it.
> > 
> >> Are there any other requirements not obvious from the description below?
> > 
> > No real requirements, if anyone would like the original files I can give
> > them to you, a lot of them are bots.
> > Thank you :)
> > -Kevin
> > 
> >> Kevin Faulkner wrote:
> >>> I was trying to pull duplicates out of 2 different files. Needless to
> >>> say there are duplicates I would place a # next to the duplicate.
> >>> Example files: file 1:	file 2:
> >>> 433.3	947.3
> >>> 543.1	749.0
> >>> 741.1	859.2
> >>> 238.5	433.3
> >>> 839.2	229.1
> >>> 583.6	990.1
> >>> 863.4	741.1
> >>> 859.2	101.8
> >>> 
> >>> import string
> >>> i=1
> >>> primaryfile = open('/tmp/extract','r')
> >>> secondaryfile = open('/tmp/unload')
> >>> 
> >>> for line in primaryfile:
> >>>    pcompare = line
> >>>    print(pcompare)
> >>>    
> >>>    for row in secondaryfile:
> >>>      i = i + 1
> >>>      print(i)
> >>>      scompare = row
> >>>      
> >>>      if pcompare == scompare:
> >>>        print(scompare)
> >>>        secondaryfile.write('#')
> >>> 
> >>> With this code it should go through the files and find a duplicate and
> >>> place a '#' next to it. But for some reasonson it doesn't even get to
> >>> the second for statement. I don't know what else to do. Please offer
> >>> some assistance. :) ---------------------------------------------------
> >>> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
> >>> To subscribe, unsubscribe, or to change your mail settings:
> >>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
> > 
> > ---------------------------------------------------
> > PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
> > To subscribe, unsubscribe, or to change your mail settings:
> > http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss


More information about the PLUG-discuss mailing list