Re: Python help (finding duplicates)

Top Page
Attachments:
Message as email
+ (text/plain)
+ dedup_files.py (text/x-python)
+ signature.asc (application/pgp-signature)
+ (text/plain)
Delete this message
Reply to this message
Author: Joseph Sinclair
Date:  
To: Main PLUG discussion list
Subject: Re: Python help (finding duplicates)
OK,
I've attached a complete program that works, if you want to just get it done, but I've also described what went wrong in your first attempt below.

# the i value was just for debugging, so I dropped it
primaryfile = open('/tmp/extract','r')
# read the primary file into a list for speed and so you aren't reading more than once
primary_lines = primaryfile.readlines()
# you didn't specify a mode for this, so it defaulted to read-only.  Be explicit for clarity
secondaryfile = open('/tmp/unload', 'r')
# Open a separate file for output, otherwise you would have been writing and reading the same file over and over again, which usually causes errors
outputfile = open('/tmp/result-file', 'w')
# read the second file into a list, then you can scan through it over and over without hammering disk and re-reading a file you might have modified.
secondary_lines = secondaryfile.readlines()
# print is a statement, not a function.
print 'opened files'
# loop through the list, not the file
for line in primary_lines:
   pcompare = line
   # print is a statement, use the formatting operator to print variable values
   print 'primary line = %s' % (pcompare)
   # loop through the list, not the file
   for row in secondary_lines:
     scompare = row
     if pcompare == scompare:
       # print as a statement, not a function
       print 'secondary line = %s' % (scompare)
       # you were writing random # characters in a file (most likely after the line read), this writes a comment to a new file, which is usually clearer.
       # invert the test, and add the line to a set here then write out the set at the end to get an output of lines without duplication.
       outputfile.write('#%s' % (scompare))
print 'Done'


Kevin Faulkner wrote:
> Sorry about the time issue.
> On Friday 27 August 2010 23:50:00 you wrote:
>> I hope these are small files, the algorithm you wrote is not going to run
>> well as file size gets large (over 10,000 entries) Have you checked the
>> space/tab situation?  Python uses indentation changes to indicate the end
>> of a block, so inconsistent use of tabs and spaces freaks it out. Here are
>> a couple questions:
> This is not a school project, so you won't be doing my homework or anything :)
> The space/tab issue is okay, but the script does not even get to the print(i), 
> I even tried for line in secondaryfile: and the for loop still wouldn't be 
> executed.
>> Are these always numbers?
> Yes, they are IP's from an Apache error log. 
>> Do the files have to remain in their original order, or can you reorder
>> them during processing? How often does this have to run?
> they are not in order because one list is 852 entries and another list is 3300 
> entries. This script only needs to run once.
>> Do you have to "comment" the duplicate, or can you remove it?
> The plan is to remove it, but I wanted to see if my removal method would work, 
> so I was trying to put a comment next to it.
>> Are there any other requirements not obvious from the description below?
> No real requirements, if anyone would like the original files I can give them 
> to you, a lot of them are bots.
> Thank you :)
> -Kevin
>> Kevin Faulkner wrote:
>>> I was trying to pull duplicates out of 2 different files. Needless to say
>>> there are duplicates I would place a # next to the duplicate. Example
>>> files: file 1:    file 2:
>>> 433.3    947.3
>>> 543.1    749.0
>>> 741.1    859.2
>>> 238.5    433.3
>>> 839.2    229.1
>>> 583.6    990.1
>>> 863.4    741.1
>>> 859.2    101.8

>>>
>>> import string
>>> i=1
>>> primaryfile = open('/tmp/extract','r')
>>> secondaryfile = open('/tmp/unload')
>>>
>>> for line in primaryfile:
>>>    pcompare = line
>>>    print(pcompare)

>>>
>>>    for row in secondaryfile:
>>>      i = i + 1
>>>      print(i)
>>>      scompare = row

>>>
>>>      if pcompare == scompare:
>>>        print(scompare)
>>>        secondaryfile.write('#')

>>>
>>> With this code it should go through the files and find a duplicate and
>>> place a '#' next to it. But for some reasonson it doesn't even get to
>>> the second for statement. I don't know what else to do. Please offer
>>> some assistance. :) ---------------------------------------------------
>>> PLUG-discuss mailing list -
>>> To subscribe, unsubscribe, or to change your mail settings:
>>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
> ---------------------------------------------------
> PLUG-discuss mailing list -
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>

def sort_and_compare_files(extract, unload, clean_extract, clean_unload, clean_combined):
  try:
    input_extract = open(extract, 'r')
    input_unload = open(unload, 'r')
    output_extract = open(clean_extract, 'w')
    output_unload = open(clean_unload, 'w')
    output_combined = open(clean_combined, 'w')
    extract_set = set(input_extract)
    unload_set = set(input_unload)
    extract_unique = extract_set.difference(unload_set)
    unload_unique = unload_set.difference(extract_set)
    combined_unique = extract_set.symmetric_difference(unload_set)
    output_extract.writelines(extract_unique)
    output_unload.writelines(unload_unique)
    output_combined.writelines(combined_unique)
  except IOError:
    print 'IO Error accessing files'
  finally:
    if input_extract != None:
      input_extract.close()
    if input_unload != None:
      input_unload.close()
    if output_extract != None:
      output_extract.close()
    if output_unload != None:
      output_unload.close()
    if output_combined != None:
      output_combined.close()


#This code is for debugging and unit testing
if __name__ == '__main__':
sort_and_compare_files('/tmp/extract', '/tmp/unload', '/tmp/clean-extract', '/tmp/clean-unload', '/tmp/combined')

---------------------------------------------------
PLUG-discuss mailing list -
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss