Find identical files in subdirectories
Project description
duplicates
Scan for identical files (duplicates) in subdirectories.
Requirements
- Python >= 3.11
- POSIX (Linux, macOS); MS Windows is not supported.
Installation
$ uv tool install duplicates
Or, if you prefer pipx:
$ pipx install duplicates
Description
To find files with identical content, the given directories are scanned and files of the same size have their SHA-256 fingerprints compared. Two files with identical fingerprints are considered to have the same content. There is a tiny chance for two files with the same fingerprint to have different content, but that chance is very remote.
Large files (≥ 64 KiB) are first compared by a cheap "partial" SHA-256 over their first and last 4 KiB; only files that survive that prefilter are read in full. For collections of large near-duplicates (videos, archives) this avoids reading most of the data.
Symbolic links and hidden entries are ignored by default. This behavior can
be changed with the CLI options --follow / --hidden or the constructor
options ignore_symlinks / ignore_hidden.
CLI examples
Print a short command overview:
$ duplicates --help
Scan directories dirA, dirB and dirC and report identical files:
$ duplicates dirA dirB dirC
dirA/file01
dirA/file01.bak
dirB/file.bak
dirA/file02
dirB/file02~
The oldest file is printed without indent; identical files are listed indented by a tab. The oldest file is treated as the original.
If you are willing to take risks, you can delete all duplicates at once. I wouldn't dare, but you get the picture:
$ duplicates --dups-only dirA dirB | while read dups ; do xargs -0 rm $dups ; done
With --dups-only, all duplicates for one original are printed on a single
line separated by \0 (ASCII NUL).
For the fish shell the syntax is almost identical:
$ duplicates --dups-only dirA dirB | while read -la dups ; xargs -0 rm $dups ; end
JSON output
For scripted consumption, --json emits the full result on stdout
including a statistics block with counts and the scan's elapsed time:
$ duplicates --json dirA dirB
{
"scanned_paths": ["dirA", "dirB"],
"duplicates": [
{
"hash": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"size": 1234,
"files": [
{"path": "dirA/file01", "age": 1700000000.0},
{"path": "dirA/file01.bak", "age": 1710000000.0}
]
}
],
"statistics": {
"total_files": 12,
"unique_files": 10,
"duplicate_groups": 1,
"duplicate_copies": 1,
"duplicate_bytes": 1234,
"unreadable_files": 0,
"elapsed_seconds": 0.0123
}
}
--json is mutually exclusive with --dups-only and --summary.
Combine with --unique to also include the unique files in the output.
Progress on long scans
--verbose surfaces phase markers and per-file logs to stderr — useful
when running over a slow filesystem (SMB, large library) where the tool
might otherwise look stuck:
INFO: Scanning 1 path(s)...
INFO: Scanned 1284 file(s) so far...
INFO: Discovered 4012 file(s) in 3987 size group(s)
INFO: Partial-hashing 12 file(s)...
INFO: Partial-hashing /films/movie.mp4 (5.2 GiB)
INFO: Full-hashing 4 file(s)...
INFO: Full-hashing /films/movie.mp4 (5.2 GiB)
--debug adds per-directory and per-ignored-entry messages on top.
Python API
from duplicates import DupFinder
uniq, dups, unreadable = DupFinder().scan(".")
uniq is a list of unique FileEntry objects. dups is a list of duplicate
groups, where each group is a list of FileEntry objects with identical
content. Use entry.age to identify the oldest file in a group. unreadable
collects files that could not be fingerprinted (permission denied, I/O error);
they cannot be classified and are returned separately instead of being
silently dropped.
A FileEntry is a dataclass with the following fields:
path: apathlib.Pathsize: file size in bytesage: modification time in seconds (Unix time)hash: the SHA-256 fingerprint (Nonefor unique files where no hash was needed)
Progress messages are emitted via the logging module on the duplicates
logger; configure logging in your application to see them.
Development
$ uv sync
$ uv run pytest
$ uv run ruff check .
$ uv run basedpyright
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file duplicates-0.4.3.tar.gz.
File metadata
- Download URL: duplicates-0.4.3.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
041a3ee97761e3c529bce3bce5f3edb862ee76c8bb1a524328499bff8d671f2d
|
|
| MD5 |
0625785e00389f684e177908ccb70100
|
|
| BLAKE2b-256 |
1878b00af6eecb4a0f22bf8e9f8749ae247802dd1fd86e28898b348e560e90f1
|
File details
Details for the file duplicates-0.4.3-py3-none-any.whl.
File metadata
- Download URL: duplicates-0.4.3-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
545e722ff8712bb217b7b609bfc3f1b3e00aec28c31987cdce325feee08807bc
|
|
| MD5 |
d84cb788428a6cb72a99e25eacd1f550
|
|
| BLAKE2b-256 |
7b85796f1ead3847b68b604e4779a6ac01dd2a82456b989b2d66cfecf738436c
|