Glad I could help. Python 3 is still uncommon in most distros, so I generally assume Python 2.6 or 2.7 unless you specify otherwise. Look at the complete implementation as well, there are a couple nifty tricks there that make the code simpler and a lot faster. Kevin Faulkner wrote: > On Saturday 28 August 2010 11:48:10 Joseph Sinclair wrote: >> OK, >> I've attached a complete program that works, if you want to just get it >> done, but I've also described what went wrong in your first attempt below. >> > I really appreciate what you have done. I more so like the description of what > I did wrong. Using readlines() is a better approach like you said, less disk > thrashing. I was using /usr/bin/python3, so print() is now a function. My next > step is to take the host list and identify where the IP is using pygeoip. > Thank you again. :) > -Kevin >> # the i value was just for debugging, so I dropped it >> primaryfile = open('/tmp/extract','r') >> # read the primary file into a list for speed and so you aren't reading >> more than once primary_lines = primaryfile.readlines() >> # you didn't specify a mode for this, so it defaulted to read-only. Be >> explicit for clarity secondaryfile = open('/tmp/unload', 'r') >> # Open a separate file for output, otherwise you would have been writing >> and reading the same file over and over again, which usually causes errors >> outputfile = open('/tmp/result-file', 'w') >> # read the second file into a list, then you can scan through it over and >> over without hammering disk and re-reading a file you might have modified. >> secondary_lines = secondaryfile.readlines() >> # print is a statement, not a function. >> print 'opened files' >> # loop through the list, not the file >> for line in primary_lines: >> pcompare = line >> # print is a statement, use the formatting operator to print variable >> values print 'primary line = %s' % (pcompare) >> # loop through the list, not the file >> for row in secondary_lines: >> scompare = row >> if pcompare == scompare: >> # print as a statement, not a function >> print 'secondary line = %s' % (scompare) >> # you were writing random # characters in a file (most likely after >> the line read), this writes a comment to a new file, which is usually >> clearer. # invert the test, and add the line to a set here then write out >> the set at the end to get an output of lines without duplication. >> outputfile.write('#%s' % (scompare)) >> print 'Done' >> > >> Kevin Faulkner wrote: >>> Sorry about the time issue. >>> >>> On Friday 27 August 2010 23:50:00 you wrote: >>>> I hope these are small files, the algorithm you wrote is not going to >>>> run well as file size gets large (over 10,000 entries) Have you checked >>>> the space/tab situation? Python uses indentation changes to indicate >>>> the end of a block, so inconsistent use of tabs and spaces freaks it >>>> out. Here are >>>> a couple questions: >>> This is not a school project, so you won't be doing my homework or >>> anything :) The space/tab issue is okay, but the script does not even >>> get to the print(i), I even tried for line in secondaryfile: and the for >>> loop still wouldn't be executed. >>> >>>> Are these always numbers? >>> Yes, they are IP's from an Apache error log. >>> >>>> Do the files have to remain in their original order, or can you reorder >>>> them during processing? How often does this have to run? >>> they are not in order because one list is 852 entries and another list is >>> 3300 entries. This script only needs to run once. >>> >>>> Do you have to "comment" the duplicate, or can you remove it? >>> The plan is to remove it, but I wanted to see if my removal method would >>> work, so I was trying to put a comment next to it. >>> >>>> Are there any other requirements not obvious from the description below? >>> No real requirements, if anyone would like the original files I can give >>> them to you, a lot of them are bots. >>> Thank you :) >>> -Kevin >>> >>>> Kevin Faulkner wrote: >>>>> I was trying to pull duplicates out of 2 different files. Needless to >>>>> say there are duplicates I would place a # next to the duplicate. >>>>> Example files: file 1: file 2: >>>>> 433.3 947.3 >>>>> 543.1 749.0 >>>>> 741.1 859.2 >>>>> 238.5 433.3 >>>>> 839.2 229.1 >>>>> 583.6 990.1 >>>>> 863.4 741.1 >>>>> 859.2 101.8 >>>>> >>>>> import string >>>>> i=1 >>>>> primaryfile = open('/tmp/extract','r') >>>>> secondaryfile = open('/tmp/unload') >>>>> >>>>> for line in primaryfile: >>>>> pcompare = line >>>>> print(pcompare) >>>>> >>>>> for row in secondaryfile: >>>>> i = i + 1 >>>>> print(i) >>>>> scompare = row >>>>> >>>>> if pcompare == scompare: >>>>> print(scompare) >>>>> secondaryfile.write('#') >>>>> >>>>> With this code it should go through the files and find a duplicate and >>>>> place a '#' next to it. But for some reasonson it doesn't even get to >>>>> the second for statement. I don't know what else to do. Please offer >>>>> some assistance. :) --------------------------------------------------- >>>>> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us >>>>> To subscribe, unsubscribe, or to change your mail settings: >>>>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss >>> --------------------------------------------------- >>> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us >>> To subscribe, unsubscribe, or to change your mail settings: >>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss > --------------------------------------------------- > PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us > To subscribe, unsubscribe, or to change your mail settings: > http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss >