Glad I could help.
Python 3 is still uncommon in most distros, so I generally assume Python 2.6 or 2.7 unless you specify otherwise.
Look at the complete implementation as well, there are a couple nifty tricks there that make the code simpler and a lot faster.
Kevin Faulkner wrote:
> On Saturday 28 August 2010 11:48:10 Joseph Sinclair wrote:
>> OK,
>> I've attached a complete program that works, if you want to just get it
>> done, but I've also described what went wrong in your first attempt below.
>>
> I really appreciate what you have done. I more so like the description of what
> I did wrong. Using readlines() is a better approach like you said, less disk
> thrashing. I was using /usr/bin/python3, so print() is now a function. My next
> step is to take the host list and identify where the IP is using pygeoip.
> Thank you again. :)
> -Kevin
>> # the i value was just for debugging, so I dropped it
>> primaryfile = open('/tmp/extract','r')
>> # read the primary file into a list for speed and so you aren't reading
>> more than once primary_lines = primaryfile.readlines()
>> # you didn't specify a mode for this, so it defaulted to read-only. Be
>> explicit for clarity secondaryfile = open('/tmp/unload', 'r')
>> # Open a separate file for output, otherwise you would have been writing
>> and reading the same file over and over again, which usually causes errors
>> outputfile = open('/tmp/result-file', 'w')
>> # read the second file into a list, then you can scan through it over and
>> over without hammering disk and re-reading a file you might have modified.
>> secondary_lines = secondaryfile.readlines()
>> # print is a statement, not a function.
>> print 'opened files'
>> # loop through the list, not the file
>> for line in primary_lines:
>> pcompare = line
>> # print is a statement, use the formatting operator to print variable
>> values print 'primary line = %s' % (pcompare)
>> # loop through the list, not the file
>> for row in secondary_lines:
>> scompare = row
>> if pcompare == scompare:
>> # print as a statement, not a function
>> print 'secondary line = %s' % (scompare)
>> # you were writing random # characters in a file (most likely after
>> the line read), this writes a comment to a new file, which is usually
>> clearer. # invert the test, and add the line to a set here then write out
>> the set at the end to get an output of lines without duplication.
>> outputfile.write('#%s' % (scompare))
>> print 'Done'
>>
>
>> Kevin Faulkner wrote:
>>> Sorry about the time issue.
>>>
>>> On Friday 27 August 2010 23:50:00 you wrote:
>>>> I hope these are small files, the algorithm you wrote is not going to
>>>> run well as file size gets large (over 10,000 entries) Have you checked
>>>> the space/tab situation? Python uses indentation changes to indicate
>>>> the end of a block, so inconsistent use of tabs and spaces freaks it
>>>> out. Here are
>>>> a couple questions:
>>> This is not a school project, so you won't be doing my homework or
>>> anything :) The space/tab issue is okay, but the script does not even
>>> get to the print(i), I even tried for line in secondaryfile: and the for
>>> loop still wouldn't be executed.
>>>
>>>> Are these always numbers?
>>> Yes, they are IP's from an Apache error log.
>>>
>>>> Do the files have to remain in their original order, or can you reorder
>>>> them during processing? How often does this have to run?
>>> they are not in order because one list is 852 entries and another list is
>>> 3300 entries. This script only needs to run once.
>>>
>>>> Do you have to "comment" the duplicate, or can you remove it?
>>> The plan is to remove it, but I wanted to see if my removal method would
>>> work, so I was trying to put a comment next to it.
>>>
>>>> Are there any other requirements not obvious from the description below?
>>> No real requirements, if anyone would like the original files I can give
>>> them to you, a lot of them are bots.
>>> Thank you :)
>>> -Kevin
>>>
>>>> Kevin Faulkner wrote:
>>>>> I was trying to pull duplicates out of 2 different files. Needless to
>>>>> say there are duplicates I would place a # next to the duplicate.
>>>>> Example files: file 1: file 2:
>>>>> 433.3 947.3
>>>>> 543.1 749.0
>>>>> 741.1 859.2
>>>>> 238.5 433.3
>>>>> 839.2 229.1
>>>>> 583.6 990.1
>>>>> 863.4 741.1
>>>>> 859.2 101.8
>>>>>
>>>>> import string
>>>>> i=1
>>>>> primaryfile = open('/tmp/extract','r')
>>>>> secondaryfile = open('/tmp/unload')
>>>>>
>>>>> for line in primaryfile:
>>>>> pcompare = line
>>>>> print(pcompare)
>>>>>
>>>>> for row in secondaryfile:
>>>>> i = i + 1
>>>>> print(i)
>>>>> scompare = row
>>>>>
>>>>> if pcompare == scompare:
>>>>> print(scompare)
>>>>> secondaryfile.write('#')
>>>>>
>>>>> With this code it should go through the files and find a duplicate and
>>>>> place a '#' next to it. But for some reasonson it doesn't even get to
>>>>> the second for statement. I don't know what else to do. Please offer
>>>>> some assistance. :) ---------------------------------------------------
>>>>> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
>>>>> To subscribe, unsubscribe, or to change your mail settings:
>>>>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>>> ---------------------------------------------------
>>> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
>>> To subscribe, unsubscribe, or to change your mail settings:
>>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>
---------------------------------------------------
PLUG-discuss mailing list -
PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss