finding duplicates?

Top Page
Attachments:
Message as email
+ (text/plain)
Delete this message
Reply to this message
Author: Kevin Buettner
Date:  
Subject: finding duplicates?
On Feb 28, 10:33am, J.Francois wrote:

> On Thu, Feb 28, 2002 at 12:07:38PM -0500, Mike wrote:
> > I am looking for a command I can run on the command line (from cron)
> > which finds/searches (recursively) for duplicate files.
>
> http://www.google.com/search?hl=en&q=linux+find+duplicate+files
> http://www.perlmonks.org/index.pl?node_id=2712&lastnode_id=1747


The script below is similar to the solution on the perlmonks page,
but is perhaps somewhat simpler:

--- find-dups ---
#!/usr/bin/perl -w

use File::Find;
use Digest::MD5 qw(md5_hex);

undef $/;                    # slurp entire files


my %h;

find(
    sub { 
    if (! -d && -r) {
        open F, $File::Find::name        or return;
        push @{$h{md5_hex(<F>)}}, $File::Find::name;
        close F;
    }
    },
    shift || "."
);


while (my ($k, $v) = each %h) {
    print join("\n  ", sort(@$v)), "\n"        if @$v > 1;
}
--- end find-dups ---


When I run it in my ptests directory (which is where I keep most of the
perl scripts that I write before deploying them to some bin directory),
I see the following:

    ocotillo:ptests$ ./find-dups 
    ./logfile
      ./logfile2
    ./flashcards.pl
      ./mathproblems.pl


I.e, this means that logfile and logfile2 are the same and that
flashcards.pl and mathproblems.pl are the same. If I do the following...

ocotillo:ptests$ cp find-dups dup1
ocotillo:ptests$ cp find-dups dup2

and then run find-dups again, I see:

    ocotillo:ptests$ ./find-dups 
    ./logfile
      ./logfile2
    ./dup1
      ./dup2
      ./find-dups
    ./flashcards.pl
      ./mathproblems.pl