I needed to save some space, and my disk had Debian and Ubuntu mirrors. Most source packages are identical, so hardlinking the relevant files was a good way to save a few tens of gigabytes.
The fdupes program was the only tool I found for this that was packaged for Debian. Unfortunately, it can only list or delete duplicates, not hardlink them to each other.
I started writing a program to parse fdupes output and do the hardlinking, but it evolved into a program that did everything.
I named it dupfiles.
It has a test suite, but I don't know if it covers all cases. Probably not. Worked for me, but please be careful. Patches most welcome.
See also:
Downloads:
- Using stuff on code.liw.fi
- Version control:
bzr get http://code.liw.fi/dupfiles/bzr/trunk/ - Tarball and Debian packages: http://code.liw.fi/debian/pool/main/d/dupfiles/
Benchmark results
I wrote a script to let me run benchmarks semi-easily (speed-test in
the source tree). I made a copy of my laptop home directory to a server
(about 140 GiB of data), and compared all four tools I know of:
30.8 fdupes
22.9 ./dupfiles
20.5 finddup
13.9 hardlink
Times are in seconds. hardlink is the clear winner in this case.
I'd be interested in results from other data sets.
Other tools
- pmatch seems very versatile
finddupis serving me well. It is in theperforatepackage.Typical use:
-l to link, -i to ignore perms, -d to list several directories.
A tricky case is unifying inodes that already have multiple hardlinks. If you haven't found all the inodes and relink the file, you can actually end up using more space.
Another program already packaged for Debian is
fdupes.I did a little benchmarking: fdupes is clearly fastest, dupfiles and finddup are about as fast, with dupfiles a bit faster. Most of the time dupfiles spends goes into actually reading the files (
posix.readin Python), not comparing, so the way to speed it up would be to figure out ways to read less, but I haven't come up with anything yet.There is another tool besides fdupes in Debian, also written in Python: hardlink.
Difficult to find, if you search for obvious terms like "duplicate" or "dedup" etc.
Sven, thanks for pointing out hardlink! It looks like a very capable tool.
I am now curious about the relative speeds of these files. Someone(TM) should make benchmark for this. While I'm not going to do that right now, I wonder what would be a good one? When I developed dupfiles, I used copies of Debian and Ubuntu mirrors, which have many identical files (upstream tarballs). Is that a good benchmark?
The outline of the benchmark would be:
The benchmark report would be a table listing the measurements for each tool, for easy comparison purposes.
dupes.py
http://bouncybouncy.net/blog/2008/02/21/how-my-dupe-finding-program-works/
It is a LOT faster than fdupes when you have many files of the same size, but I haven't tested it against newer implementations.