finding duplicates?

Kevin Buettner plug-discuss@lists.plug.phoenix.az.us
Thu, 28 Feb 2002 12:39:38 -0700


On Feb 28, 10:33am, J.Francois wrote:

> On Thu, Feb 28, 2002 at 12:07:38PM -0500, Mike wrote:
> > I am looking for a command I can run on the command line (from cron)
> > which finds/searches (recursively) for duplicate files.
>
> http://www.google.com/search?hl=en&q=linux+find+duplicate+files
> http://www.perlmonks.org/index.pl?node_id=2712&lastnode_id=1747

The script below is similar to the solution on the perlmonks page,
but is perhaps somewhat simpler:

--- find-dups ---
#!/usr/bin/perl -w

use File::Find;
use Digest::MD5 qw(md5_hex);

undef $/;					# slurp entire files

my %h;

find(
    sub { 
	if (! -d && -r) {
	    open F, $File::Find::name		or return;
	    push @{$h{md5_hex(<F>)}}, $File::Find::name;
	    close F;
	}
    },
    shift || "."
);

while (my ($k, $v) = each %h) {
    print join("\n  ", sort(@$v)), "\n"		if @$v > 1;
}
--- end find-dups ---

When I run it in my ptests directory (which is where I keep most of the
perl scripts that I write before deploying them to some bin directory),
I see the following:

    ocotillo:ptests$ ./find-dups 
    ./logfile
      ./logfile2
    ./flashcards.pl
      ./mathproblems.pl

I.e, this means that logfile and logfile2 are the same and that
flashcards.pl and mathproblems.pl are the same.  If I do the following...

ocotillo:ptests$ cp find-dups dup1
ocotillo:ptests$ cp find-dups dup2 

and then run find-dups again, I see:

    ocotillo:ptests$ ./find-dups 
    ./logfile
      ./logfile2
    ./dup1
      ./dup2
      ./find-dups
    ./flashcards.pl
      ./mathproblems.pl