Skip to main content

A duplicate file finder that may be faster in environments with millions of files and terabytes of data.

Project description

pydupes is yet another duplicate file finder like rdfind/fdupes et al that may be faster in environments with millions of files and terabytes of data or over high latency filesystems (e.g. NFS).

PyPI version


The algorithm is similar to rdfind with threading and consolidation of filtering logic (instead of separate passes).

  • traverse the input paths, collecting the inodes and file sizes
  • for each set of files with the same size:
    • further split by matching 4KB on beginning/ends of files
    • for each non-unique (by size, boundaries) candidate set, compute the sha256 and emit pairs with matching hash

Constraints:

  • traversals do not span multiple devices
  • symlink following not implemented
  • concurrent modification of a traversed directory could produce false duplicate pairs (modification after hash computation)

Setup

# via pip
pip3 install --user --upgrade pydupes

# or simply if pipx installed:
pipx run pydupes --help

Usage

# Collect counts and stage the duplicate files, null-delimited source-target pairs:
pydupes /path1 /path2 --progress --output dupes.txt

# Sanity check a hardlinking of all matches:
xargs -0 -n2 echo ln --force --verbose < dupes.txt

Benchmarks

Hardware is a 6 spinning disk RAID5 ext4 with 250GB memory, Ubuntu 18.04. Peak memory and runtimes via: /usr/bin/time -v.

Dataset 1:

  • Directories: ~33k
  • Files: ~14 million, 1 million duplicate
  • Total size: ~11TB, 300GB duplicate

pydupes

  • Elapsed (wall clock) time (h:mm:ss or m:ss): 39:04.73
  • Maximum resident set size (kbytes): 3356936 (~3GB)
INFO:pydupes:Traversing input paths: ['/raid/erik']
INFO:pydupes:Traversal time: 209.6s
INFO:pydupes:Cursory file count: 14416742 (10.9TiB), excluding symlinks and dupe inodes
INFO:pydupes:Directory count: 33376
INFO:pydupes:Number of candidate groups: 720263
INFO:pydupes:Size filter reduced file count to: 14114518 (7.3TiB)
INFO:pydupes:Comparison time: 2134.6s
INFO:pydupes:Total time elapsed: 2344.2s
INFO:pydupes:Number of duplicate files: 936948
INFO:pydupes:Size of duplicate content: 304.1GiB

rdfind

  • Elapsed (wall clock) time (h:mm:ss or m:ss): 1:57:20
  • Maximum resident set size (kbytes): 3636396 (~3GB)
Now scanning "/raid/erik", found 14419182 files.
Now have 14419182 files in total.
Removed 44 files due to nonunique device and inode.
Now removing files with zero size from list...removed 2396 files
Total size is 11961280180699 bytes or 11 TiB
Now sorting on size:removed 301978 files due to unique sizes from list.14114764 files left.
Now eliminating candidates based on first bytes:removed 8678999 files from list.5435765 files left.
Now eliminating candidates based on last bytes:removed 3633992 files from list.1801773 files left.
Now eliminating candidates based on md5 checksum:removed 158638 files from list.1643135 files left.
It seems like you have 1643135 files that are not unique
Totally, 304 GiB can be reduced.

fdupes

Note that this isn't a fair comparison since fdupes additionally performs a byte-by-byte comparison on MD5 match. Invocation with "fdupes --size --summarize --recurse --quiet".

  • Elapsed (wall clock) time (h:mm:ss or m:ss): 2:58:32
  • Maximum resident set size (kbytes): 3649420 (~3GB)
939588 duplicate files (in 705943 sets), occupying 326547.7 megabytes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydupes-0.6.1.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydupes-0.6.1-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file pydupes-0.6.1.tar.gz.

File metadata

  • Download URL: pydupes-0.6.1.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.9.1 Linux/5.4.151-16908-gff376e5d5ee1

File hashes

Hashes for pydupes-0.6.1.tar.gz
Algorithm Hash digest
SHA256 c235f1439561047317a7d0ec780cbd2181bd1d66e0f5382ecc255def57bf7d42
MD5 c376892f485634ac8718e6164cc20c2f
BLAKE2b-256 e362bd1c0b51e0081399eab63d66c640f274a19c6d44cec2b9ba28b81260d722

See more details on using hashes here.

File details

Details for the file pydupes-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: pydupes-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.9.1 Linux/5.4.151-16908-gff376e5d5ee1

File hashes

Hashes for pydupes-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bffd9c3048cd9ccec542b8263bef0901212df77685878cb230fa6f2dede16554
MD5 556277a82fb3f0f6d523c373e2332739
BLAKE2b-256 139bb434a0f8c9f2e2cac40b20586d6362c981c122a4a2017d0be8a95b210145

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page