I needed to save some space, and my disk had Debian and Ubuntu mirrors. Most source packages are identical, so hardlinking the relevant files was a good way to save a few tens of gigabytes.

The fdupes program was the only tool I found for this that was packaged for Debian. Unfortunately, it can only list or delete duplicates, not hardlink them to each other.

I started writing a program to parse fdupes output and do the hardlinking, but it evolved into a program that did everything.

I named it dupfiles.

It has a test suite, but I don't know if it covers all cases. Probably not. Worked for me, but please be careful. Patches most welcome.

See also:

Downloads:

Benchmark results

I wrote a script to let me run benchmarks semi-easily (speed-test in the source tree). I made a copy of my laptop home directory to a server (about 140 GiB of data), and compared all four tools I know of:

30.8 fdupes
22.9 ./dupfiles
20.5 finddup
13.9 hardlink

Times are in seconds. hardlink is the clear winner in this case.

I'd be interested in results from other data sets.

Other tools

finddup is serving me well. It is in the perforate package.

Typical use:

finddup -l -i -d ~/a -d ~/b

-l to link, -i to ignore perms, -d to list several directories.

A tricky case is unifying inodes that already have multiple hardlinks. If you haven't found all the inodes and relink the file, you can actually end up using more space.

Comment by Tobu [getopenid.com] Sun Jul 18 23:31:56 2010

Another program already packaged for Debian is fdupes.

I did a little benchmarking: fdupes is clearly fastest, dupfiles and finddup are about as fast, with dupfiles a bit faster. Most of the time dupfiles spends goes into actually reading the files (posix.read in Python), not comparing, so the way to speed it up would be to figure out ways to read less, but I haven't come up with anything yet.

Comment by Lars Wirzenius Sat Dec 4 09:35:38 2010

There is another tool besides fdupes in Debian, also written in Python: hardlink.

Difficult to find, if you search for obvious terms like "duplicate" or "dedup" etc.

Comment by Sven Sat Jan 15 06:20:48 2011

Sven, thanks for pointing out hardlink! It looks like a very capable tool.

I am now curious about the relative speeds of these files. Someone(TM) should make benchmark for this. While I'm not going to do that right now, I wonder what would be a good one? When I developed dupfiles, I used copies of Debian and Ubuntu mirrors, which have many identical files (upstream tarballs). Is that a good benchmark?

The outline of the benchmark would be:

  • set up original data, drop caches, etc
  • run the program to be benchmarked
  • measure time and memory used
  • verify that results are correct
  • verify that each program produces the same result

The benchmark report would be a table listing the measurements for each tool, for easy comparison purposes.

Comment by Lars Wirzenius Sat Jan 15 12:38:54 2011

dupes.py

http://bouncybouncy.net/blog/2008/02/21/how-my-dupe-finding-program-works/

It is a LOT faster than fdupes when you have many files of the same size, but I haven't tested it against newer implementations.

Comment by Justin Thu Jun 2 04:32:35 2011
Since this blog is collecting duplicate finding tools: I tried a few and chose fslint; it provides a GUI that lets you look at the dupes and decide whether to delete, symlink, or leave them alone.
Comment by Steve Sat Jan 21 18:33:47 2012