OK,
I've attached a complete program that works, if you want to just get it done, but I've also described what went wrong in your first attempt below.
# the i value was just for debugging, so I dropped it
primaryfile = open('/tmp/extract','r')
# read the primary file into a list for speed and so you aren't reading more than once
primary_lines = primaryfile.readlines()
# you didn't specify a mode for this, so it defaulted to read-only. Be explicit for clarity
secondaryfile = open('/tmp/unload', 'r')
# Open a separate file for output, otherwise you would have been writing and reading the same file over and over again, which usually causes errors
outputfile = open('/tmp/result-file', 'w')
# read the second file into a list, then you can scan through it over and over without hammering disk and re-reading a file you might have modified.
secondary_lines = secondaryfile.readlines()
# print is a statement, not a function.
print 'opened files'
# loop through the list, not the file
for line in primary_lines:
pcompare = line
# print is a statement, use the formatting operator to print variable values
print 'primary line = %s' % (pcompare)
# loop through the list, not the file
for row in secondary_lines:
scompare = row
if pcompare == scompare:
# print as a statement, not a function
print 'secondary line = %s' % (scompare)
# you were writing random # characters in a file (most likely after the line read), this writes a comment to a new file, which is usually clearer.
# invert the test, and add the line to a set here then write out the set at the end to get an output of lines without duplication.
outputfile.write('#%s' % (scompare))
print 'Done'
Kevin Faulkner wrote:
> Sorry about the time issue.
> On Friday 27 August 2010 23:50:00 you wrote:
>> I hope these are small files, the algorithm you wrote is not going to run
>> well as file size gets large (over 10,000 entries) Have you checked the
>> space/tab situation? Python uses indentation changes to indicate the end
>> of a block, so inconsistent use of tabs and spaces freaks it out. Here are
>> a couple questions:
> This is not a school project, so you won't be doing my homework or anything :)
> The space/tab issue is okay, but the script does not even get to the print(i),
> I even tried for line in secondaryfile: and the for loop still wouldn't be
> executed.
>> Are these always numbers?
> Yes, they are IP's from an Apache error log.
>> Do the files have to remain in their original order, or can you reorder
>> them during processing? How often does this have to run?
> they are not in order because one list is 852 entries and another list is 3300
> entries. This script only needs to run once.
>> Do you have to "comment" the duplicate, or can you remove it?
> The plan is to remove it, but I wanted to see if my removal method would work,
> so I was trying to put a comment next to it.
>> Are there any other requirements not obvious from the description below?
> No real requirements, if anyone would like the original files I can give them
> to you, a lot of them are bots.
> Thank you :)
> -Kevin
>> Kevin Faulkner wrote:
>>> I was trying to pull duplicates out of 2 different files. Needless to say
>>> there are duplicates I would place a # next to the duplicate. Example
>>> files: file 1: file 2:
>>> 433.3 947.3
>>> 543.1 749.0
>>> 741.1 859.2
>>> 238.5 433.3
>>> 839.2 229.1
>>> 583.6 990.1
>>> 863.4 741.1
>>> 859.2 101.8
>>>
>>> import string
>>> i=1
>>> primaryfile = open('/tmp/extract','r')
>>> secondaryfile = open('/tmp/unload')
>>>
>>> for line in primaryfile:
>>> pcompare = line
>>> print(pcompare)
>>>
>>> for row in secondaryfile:
>>> i = i + 1
>>> print(i)
>>> scompare = row
>>>
>>> if pcompare == scompare:
>>> print(scompare)
>>> secondaryfile.write('#')
>>>
>>> With this code it should go through the files and find a duplicate and
>>> place a '#' next to it. But for some reasonson it doesn't even get to
>>> the second for statement. I don't know what else to do. Please offer
>>> some assistance. :) ---------------------------------------------------
>>> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
>>> To subscribe, unsubscribe, or to change your mail settings:
>>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>
def sort_and_compare_files(extract, unload, clean_extract, clean_unload, clean_combined):
try:
input_extract = open(extract, 'r')
input_unload = open(unload, 'r')
output_extract = open(clean_extract, 'w')
output_unload = open(clean_unload, 'w')
output_combined = open(clean_combined, 'w')
extract_set = set(input_extract)
unload_set = set(input_unload)
extract_unique = extract_set.difference(unload_set)
unload_unique = unload_set.difference(extract_set)
combined_unique = extract_set.symmetric_difference(unload_set)
output_extract.writelines(extract_unique)
output_unload.writelines(unload_unique)
output_combined.writelines(combined_unique)
except IOError:
print 'IO Error accessing files'
finally:
if input_extract != None:
input_extract.close()
if input_unload != None:
input_unload.close()
if output_extract != None:
output_extract.close()
if output_unload != None:
output_unload.close()
if output_combined != None:
output_combined.close()
#This code is for debugging and unit testing
if __name__ == '__main__':
sort_and_compare_files('/tmp/extract', '/tmp/unload', '/tmp/clean-extract', '/tmp/clean-unload', '/tmp/combined')
---------------------------------------------------
PLUG-discuss mailing list -
PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss