Re: Python help (finding duplicates)

Attachments:
Message as email (text/plain)

Author: Kevin Faulkner
Date:
To: Joseph Sinclair, plug-discuss
Subject: Re: Python help (finding duplicates)

Sorry about the time issue.
On Friday 27 August 2010 23:50:00 you wrote:
> I hope these are small files, the algorithm you wrote is not going to run
> well as file size gets large (over 10,000 entries) Have you checked the
> space/tab situation? Python uses indentation changes to indicate the end
> of a block, so inconsistent use of tabs and spaces freaks it out. Here are
> a couple questions:
This is not a school project, so you won't be doing my homework or anything :)
The space/tab issue is okay, but the script does not even get to the print(i),
I even tried for line in secondaryfile: and the for loop still wouldn't be
executed.
> Are these always numbers?
Yes, they are IP's from an Apache error log.
> Do the files have to remain in their original order, or can you reorder
> them during processing? How often does this have to run?
they are not in order because one list is 852 entries and another list is 3300
entries. This script only needs to run once.
> Do you have to "comment" the duplicate, or can you remove it?
The plan is to remove it, but I wanted to see if my removal method would work,
so I was trying to put a comment next to it.
> Are there any other requirements not obvious from the description below?
No real requirements, if anyone would like the original files I can give them
to you, a lot of them are bots.
Thank you :)
-Kevin
>
> Kevin Faulkner wrote: > > I was trying to pull duplicates out of 2 different files. Needless to say > > there are duplicates I would place a # next to the duplicate. Example > > files: file 1: file 2: > > 433.3 947.3 > > 543.1 749.0 > > 741.1 859.2 > > 238.5 433.3 > > 839.2 229.1 > > 583.6 990.1 > > 863.4 741.1 > > 859.2 101.8

> >
> > import string
> > i=1
> > primaryfile = open('/tmp/extract','r')
> > secondaryfile = open('/tmp/unload')
> >
> > for line in primaryfile: > > pcompare = line > > print(pcompare)

> >
> > for row in secondaryfile: > > i = i + 1 > > print(i) > > scompare = row

> >
> > if pcompare == scompare: > > print(scompare) > > secondaryfile.write('#')

> >
> > With this code it should go through the files and find a duplicate and
> > place a '#' next to it. But for some reasonson it doesn't even get to
> > the second for statement. I don't know what else to do. Please offer
> > some assistance. :) ---------------------------------------------------
> > PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
> > To subscribe, unsubscribe, or to change your mail settings:
> > http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
---------------------------------------------------
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

This message is part of the following thread:
	the complete thread tree sorted by date
	Joseph Sinclair at
	Joseph Sinclair at