Re: Python help (finding duplicates)

Attachments:
Message as email (text/plain) (text/html) (text/plain)

Author: Dazed_75
Date:
To: kondor6c, Main PLUG discussion list
Subject: Re: Python help (finding duplicates)

if you are just looking for a list of unique values, why not just do:

cat file1 file2 | sort | uniq > file3

Obviously you could have reasons why this won't suffice for your need, but
I've not seen that in your description yet

On Sat, Aug 28, 2010 at 9:48 AM, Kevin Faulkner <
kondor6c@encryptedforest.net> wrote:

> Sorry about the time issue.
> On Friday 27 August 2010 23:50:00 you wrote:
> > I hope these are small files, the algorithm you wrote is not going to run
> > well as file size gets large (over 10,000 entries) Have you checked the
> > space/tab situation? Python uses indentation changes to indicate the end
> > of a block, so inconsistent use of tabs and spaces freaks it out. Here
> are
> > a couple questions:
> This is not a school project, so you won't be doing my homework or anything
> :)
> The space/tab issue is okay, but the script does not even get to the
> print(i),
> I even tried for line in secondaryfile: and the for loop still wouldn't be
> executed.
> > Are these always numbers?
> Yes, they are IP's from an Apache error log.
> > Do the files have to remain in their original order, or can you reorder
> > them during processing? How often does this have to run?
> they are not in order because one list is 852 entries and another list is
> 3300
> entries. This script only needs to run once.
> > Do you have to "comment" the duplicate, or can you remove it?
> The plan is to remove it, but I wanted to see if my removal method would
> work,
> so I was trying to put a comment next to it.
> > Are there any other requirements not obvious from the description below?
> No real requirements, if anyone would like the original files I can give
> them
> to you, a lot of them are bots.
> Thank you :)
> -Kevin
> >
> > Kevin Faulkner wrote: > > > I was trying to pull duplicates out of 2 different files. Needless to > say > > > there are duplicates I would place a # next to the duplicate. Example > > > files: file 1: file 2: > > > 433.3 947.3 > > > 543.1 749.0 > > > 741.1 859.2 > > > 238.5 433.3 > > > 839.2 229.1 > > > 583.6 990.1 > > > 863.4 741.1 > > > 859.2 101.8

> > >
> > > import string
> > > i=1
> > > primaryfile = open('/tmp/extract','r')
> > > secondaryfile = open('/tmp/unload')
> > >
> > > for line in primaryfile: > > > pcompare = line > > > print(pcompare)

> > >
> > > for row in secondaryfile: > > > i = i + 1 > > > print(i) > > > scompare = row

> > >
> > > if pcompare == scompare: > > > print(scompare) > > > secondaryfile.write('#')

> > >
> > > With this code it should go through the files and find a duplicate and
> > > place a '#' next to it. But for some reasonson it doesn't even get to
> > > the second for statement. I don't know what else to do. Please offer
> > > some assistance. :) ---------------------------------------------------
> > > PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
> > > To subscribe, unsubscribe, or to change your mail settings:
> > > http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>

--
Dazed_75 a.k.a. Larry

The spirit of resistance to government is so valuable on certain occasions,
that I wish it always to be kept alive.
- Thomas Jefferson
---------------------------------------------------
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

This message is part of the following thread:
	the complete thread tree sorted by date
	Joseph Sinclair at
	Kevin Faulkner at