Python help (finding duplicates)
Joseph Sinclair
plug-discussion at stcaz.net
Sat Aug 28 11:48:10 MST 2010
OK,
I've attached a complete program that works, if you want to just get it done, but I've also described what went wrong in your first attempt below.
# the i value was just for debugging, so I dropped it
primaryfile = open('/tmp/extract','r')
# read the primary file into a list for speed and so you aren't reading more than once
primary_lines = primaryfile.readlines()
# you didn't specify a mode for this, so it defaulted to read-only. Be explicit for clarity
secondaryfile = open('/tmp/unload', 'r')
# Open a separate file for output, otherwise you would have been writing and reading the same file over and over again, which usually causes errors
outputfile = open('/tmp/result-file', 'w')
# read the second file into a list, then you can scan through it over and over without hammering disk and re-reading a file you might have modified.
secondary_lines = secondaryfile.readlines()
# print is a statement, not a function.
print 'opened files'
# loop through the list, not the file
for line in primary_lines:
pcompare = line
# print is a statement, use the formatting operator to print variable values
print 'primary line = %s' % (pcompare)
# loop through the list, not the file
for row in secondary_lines:
scompare = row
if pcompare == scompare:
# print as a statement, not a function
print 'secondary line = %s' % (scompare)
# you were writing random # characters in a file (most likely after the line read), this writes a comment to a new file, which is usually clearer.
# invert the test, and add the line to a set here then write out the set at the end to get an output of lines without duplication.
outputfile.write('#%s' % (scompare))
print 'Done'
Kevin Faulkner wrote:
> Sorry about the time issue.
> On Friday 27 August 2010 23:50:00 you wrote:
>> I hope these are small files, the algorithm you wrote is not going to run
>> well as file size gets large (over 10,000 entries) Have you checked the
>> space/tab situation? Python uses indentation changes to indicate the end
>> of a block, so inconsistent use of tabs and spaces freaks it out. Here are
>> a couple questions:
> This is not a school project, so you won't be doing my homework or anything :)
> The space/tab issue is okay, but the script does not even get to the print(i),
> I even tried for line in secondaryfile: and the for loop still wouldn't be
> executed.
>> Are these always numbers?
> Yes, they are IP's from an Apache error log.
>> Do the files have to remain in their original order, or can you reorder
>> them during processing? How often does this have to run?
> they are not in order because one list is 852 entries and another list is 3300
> entries. This script only needs to run once.
>> Do you have to "comment" the duplicate, or can you remove it?
> The plan is to remove it, but I wanted to see if my removal method would work,
> so I was trying to put a comment next to it.
>> Are there any other requirements not obvious from the description below?
> No real requirements, if anyone would like the original files I can give them
> to you, a lot of them are bots.
> Thank you :)
> -Kevin
>> Kevin Faulkner wrote:
>>> I was trying to pull duplicates out of 2 different files. Needless to say
>>> there are duplicates I would place a # next to the duplicate. Example
>>> files: file 1: file 2:
>>> 433.3 947.3
>>> 543.1 749.0
>>> 741.1 859.2
>>> 238.5 433.3
>>> 839.2 229.1
>>> 583.6 990.1
>>> 863.4 741.1
>>> 859.2 101.8
>>>
>>> import string
>>> i=1
>>> primaryfile = open('/tmp/extract','r')
>>> secondaryfile = open('/tmp/unload')
>>>
>>> for line in primaryfile:
>>> pcompare = line
>>> print(pcompare)
>>>
>>> for row in secondaryfile:
>>> i = i + 1
>>> print(i)
>>> scompare = row
>>>
>>> if pcompare == scompare:
>>> print(scompare)
>>> secondaryfile.write('#')
>>>
>>> With this code it should go through the files and find a duplicate and
>>> place a '#' next to it. But for some reasonson it doesn't even get to
>>> the second for statement. I don't know what else to do. Please offer
>>> some assistance. :) ---------------------------------------------------
>>> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
>>> To subscribe, unsubscribe, or to change your mail settings:
>>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss at lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dedup_files.py
Type: text/x-python
Size: 1259 bytes
Desc: not available
URL: <http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20100828/4c732585/attachment.py>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
URL: <http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20100828/4c732585/attachment.pgp>
More information about the PLUG-discuss
mailing list