finding duplicates?
Kevin Buettner
plug-discuss@lists.plug.phoenix.az.us
Thu, 28 Feb 2002 12:39:38 -0700
On Feb 28, 10:33am, J.Francois wrote:
> On Thu, Feb 28, 2002 at 12:07:38PM -0500, Mike wrote:
> > I am looking for a command I can run on the command line (from cron)
> > which finds/searches (recursively) for duplicate files.
>
> http://www.google.com/search?hl=en&q=linux+find+duplicate+files
> http://www.perlmonks.org/index.pl?node_id=2712&lastnode_id=1747
The script below is similar to the solution on the perlmonks page,
but is perhaps somewhat simpler:
--- find-dups ---
#!/usr/bin/perl -w
use File::Find;
use Digest::MD5 qw(md5_hex);
undef $/; # slurp entire files
my %h;
find(
sub {
if (! -d && -r) {
open F, $File::Find::name or return;
push @{$h{md5_hex(<F>)}}, $File::Find::name;
close F;
}
},
shift || "."
);
while (my ($k, $v) = each %h) {
print join("\n ", sort(@$v)), "\n" if @$v > 1;
}
--- end find-dups ---
When I run it in my ptests directory (which is where I keep most of the
perl scripts that I write before deploying them to some bin directory),
I see the following:
ocotillo:ptests$ ./find-dups
./logfile
./logfile2
./flashcards.pl
./mathproblems.pl
I.e, this means that logfile and logfile2 are the same and that
flashcards.pl and mathproblems.pl are the same. If I do the following...
ocotillo:ptests$ cp find-dups dup1
ocotillo:ptests$ cp find-dups dup2
and then run find-dups again, I see:
ocotillo:ptests$ ./find-dups
./logfile
./logfile2
./dup1
./dup2
./find-dups
./flashcards.pl
./mathproblems.pl