Python help (finding duplicates)

Joseph Sinclair plug-discussion at
Sat Aug 28 18:14:17 MST 2010

Glad I could help.
Python 3 is still uncommon in most distros, so I generally assume Python 2.6 or 2.7 unless you specify otherwise.
Look at the complete implementation as well, there are a couple nifty tricks there that make the code simpler and a lot faster.

Kevin Faulkner wrote:
> On Saturday 28 August 2010 11:48:10 Joseph Sinclair wrote:
>> OK,
>>   I've attached a complete program that works, if you want to just get it
>> done, but I've also described what went wrong in your first attempt below.
> I really appreciate what you have done. I more so like the description of what 
> I did wrong. Using readlines() is a better approach like you said, less disk 
> thrashing. I was using /usr/bin/python3, so print() is now a function. My next 
> step is to take the host list and identify where the IP is using pygeoip.
> Thank you again. :)
> -Kevin
>> # the i value was just for debugging, so I dropped it
>> primaryfile = open('/tmp/extract','r')
>> # read the primary file into a list for speed and so you aren't reading
>> more than once primary_lines = primaryfile.readlines()
>> # you didn't specify a mode for this, so it defaulted to read-only.  Be
>> explicit for clarity secondaryfile = open('/tmp/unload', 'r')
>> # Open a separate file for output, otherwise you would have been writing
>> and reading the same file over and over again, which usually causes errors
>> outputfile = open('/tmp/result-file', 'w')
>> # read the second file into a list, then you can scan through it over and
>> over without hammering disk and re-reading a file you might have modified.
>> secondary_lines = secondaryfile.readlines()
>> # print is a statement, not a function.
>> print 'opened files'
>> # loop through the list, not the file
>> for line in primary_lines:
>>    pcompare = line
>>    # print is a statement, use the formatting operator to print variable
>> values print 'primary line = %s' % (pcompare)
>>    # loop through the list, not the file
>>    for row in secondary_lines:
>>      scompare = row
>>      if pcompare == scompare:
>>        # print as a statement, not a function
>>        print 'secondary line = %s' % (scompare)
>>        # you were writing random # characters in a file (most likely after
>> the line read), this writes a comment to a new file, which is usually
>> clearer. # invert the test, and add the line to a set here then write out
>> the set at the end to get an output of lines without duplication.
>> outputfile.write('#%s' % (scompare))
>> print 'Done'
>> Kevin Faulkner wrote:
>>> Sorry about the time issue.
>>> On Friday 27 August 2010 23:50:00 you wrote:
>>>> I hope these are small files, the algorithm you wrote is not going to
>>>> run well as file size gets large (over 10,000 entries) Have you checked
>>>> the space/tab situation?  Python uses indentation changes to indicate
>>>> the end of a block, so inconsistent use of tabs and spaces freaks it
>>>> out. Here are
>>>> a couple questions:
>>> This is not a school project, so you won't be doing my homework or
>>> anything :) The space/tab issue is okay, but the script does not even
>>> get to the print(i), I even tried for line in secondaryfile: and the for
>>> loop still wouldn't be executed.
>>>> Are these always numbers?
>>> Yes, they are IP's from an Apache error log.
>>>> Do the files have to remain in their original order, or can you reorder
>>>> them during processing? How often does this have to run?
>>> they are not in order because one list is 852 entries and another list is
>>> 3300 entries. This script only needs to run once.
>>>> Do you have to "comment" the duplicate, or can you remove it?
>>> The plan is to remove it, but I wanted to see if my removal method would
>>> work, so I was trying to put a comment next to it.
>>>> Are there any other requirements not obvious from the description below?
>>> No real requirements, if anyone would like the original files I can give
>>> them to you, a lot of them are bots.
>>> Thank you :)
>>> -Kevin
>>>> Kevin Faulkner wrote:
>>>>> I was trying to pull duplicates out of 2 different files. Needless to
>>>>> say there are duplicates I would place a # next to the duplicate.
>>>>> Example files: file 1:	file 2:
>>>>> 433.3	947.3
>>>>> 543.1	749.0
>>>>> 741.1	859.2
>>>>> 238.5	433.3
>>>>> 839.2	229.1
>>>>> 583.6	990.1
>>>>> 863.4	741.1
>>>>> 859.2	101.8
>>>>> import string
>>>>> i=1
>>>>> primaryfile = open('/tmp/extract','r')
>>>>> secondaryfile = open('/tmp/unload')
>>>>> for line in primaryfile:
>>>>>    pcompare = line
>>>>>    print(pcompare)
>>>>>    for row in secondaryfile:
>>>>>      i = i + 1
>>>>>      print(i)
>>>>>      scompare = row
>>>>>      if pcompare == scompare:
>>>>>        print(scompare)
>>>>>        secondaryfile.write('#')
>>>>> With this code it should go through the files and find a duplicate and
>>>>> place a '#' next to it. But for some reasonson it doesn't even get to
>>>>> the second for statement. I don't know what else to do. Please offer
>>>>> some assistance. :) ---------------------------------------------------
>>>>> PLUG-discuss mailing list - PLUG-discuss at
>>>>> To subscribe, unsubscribe, or to change your mail settings:
>>> ---------------------------------------------------
>>> PLUG-discuss mailing list - PLUG-discuss at
>>> To subscribe, unsubscribe, or to change your mail settings:
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss at
> To subscribe, unsubscribe, or to change your mail settings:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
URL: <>

More information about the PLUG-discuss mailing list