A simple, safe, zero-dependency duplicate file finder for Python.
Project description
dupefinder
dupefinder is a small, zero-dependency Python library and CLI tool for finding duplicate files using content hashes.
Requires Python 3.10 or later. Detects exact duplicates (identical byte content) only.
Features
- Simple: one function for common use, a full report for advanced use.
- Safe by default: read-only. Never deletes, moves, or modifies files.
- Zero dependency: uses only the Python standard library.
- Modular: each responsibility lives in its own module.
- Memory-friendly: files are hashed in configurable chunks, not loaded fully into RAM.
- Fast: groups by file size before hashing — only candidates are hashed.
- Typed: ships with inline type annotations.
- Observable: typed event system for progress callbacks and integrations.
- Cancellable: abort scans via a callback or timeout.
- Cached: optional SQLite hash cache for repeated scans.
Installation
pip install dupefinder
For development:
git clone https://github.com/igors93/dupefinder.git
cd dupefinder
pip install -e ".[dev]"
Quick start
As a library
from dupefinder import find_duplicates
groups = find_duplicates("./Downloads")
for group in groups:
print(f"{group.count} duplicate files — {group.size} bytes each")
for path in group.files:
print(f" {path}")
Full report
from dupefinder import scan
from dupefinder.models import ScanOptions
from dupefinder.report import format_report
report = scan(
"./Downloads",
options=ScanOptions(
min_size=1024, # ignore files smaller than 1 KB
ignore_hidden=True, # skip dotfiles and dotfolders
follow_symlinks=False, # safe default
),
)
print(format_report(report))
print(f"Wasted space: {report.total_wasted_space} bytes")
JSON output
from dupefinder import scan
from dupefinder.report import report_to_json
report = scan("./Downloads")
print(report_to_json(report))
DupeFinder with events
from dupefinder import DupeFinder, ScanOptions
def on_event(event):
if event.type == "file_discovered":
print(f"\rFound {event.scanned_files} files...", end="", flush=True)
elif event.type == "issue":
print(f"\nWarning: {event.message}")
elif event.type == "scan_completed":
print(f"\nDone in {event.elapsed_seconds:.2f}s")
finder = DupeFinder(
options=ScanOptions(min_size=1024),
on_event=on_event,
)
report = finder.scan("./Downloads")
Progress callback
from dupefinder import DupeFinder, ScanOptions
def on_progress(progress):
print(f"[{progress.phase}] {progress.scanned_files} files scanned, "
f"{progress.hashed_files}/{progress.total_candidates} hashed")
finder = DupeFinder(
options=ScanOptions(min_size=1024),
on_progress=on_progress,
)
report = finder.scan("./Downloads")
print(f"Total bytes read: {report.total_bytes_read:,}")
Cancellation
import threading
from dupefinder import DupeFinder
cancel_flag = threading.Event()
threading.Timer(5.0, cancel_flag.set).start() # cancel after 5 seconds
finder = DupeFinder(should_cancel=cancel_flag.is_set)
report = finder.scan("./Downloads")
if report.cancelled:
print(f"Cancelled after {report.elapsed_seconds:.2f}s — partial results")
SQLite cache
from dupefinder import DupeFinder
from dupefinder.cache import SQLiteHashCache
with SQLiteHashCache(".dupefinder-cache.sqlite") as cache:
finder = DupeFinder(cache=cache)
report = finder.scan("./media") # second run will be much faster
Scan limits
from dupefinder import DupeFinder, ScanOptions
finder = DupeFinder(options=ScanOptions(
max_files=1000, # stop after 1000 files
max_depth=3, # scan at most 3 levels deep
timeout_seconds=30.0, # stop after 30 seconds
))
report = finder.scan("./data")
CLI
# Basic scan
dupefinder ./Downloads
# JSON output
dupefinder ./Downloads --json
# Ignore files smaller than 1 MB
dupefinder ./Downloads --min-size 1MB
# Only scan images
dupefinder ./Pictures --include-ext .jpg,.jpeg,.png,.webp
# Ignore temp and log files
dupefinder . --ignore-ext .tmp,.log
# Exit with code 2 if any duplicates are found (useful in scripts/CI)
dupefinder . --fail-on-duplicates
# Strict mode: raise errors instead of skipping inaccessible files
dupefinder . --strict
# Follow symbolic links (disabled by default)
dupefinder . --follow-symlinks
Run dupefinder --help to see all options.
Exit codes
| Code | Meaning |
|---|---|
0 |
Scan completed. Non-fatal issues may still be present in the report. |
1 |
Scan failed because of an invalid option, invalid path, cache error, or strict-mode error. |
2 |
Scan completed and duplicates were found while --fail-on-duplicates was enabled. |
3 |
Scan was cancelled or stopped by the configured timeout. |
Notes:
--strictturns otherwise non-fatal file access errors into exit code1.- JSON output can contain an
issueslist even when the exit code is0. - Exit code
3takes priority over--fail-on-duplicates.
CLI reference
| Flag | Description |
|---|---|
path |
File or directory to scan |
--algorithm |
Hash algorithm (default: sha256) |
--chunk-size |
Read chunk size, e.g. 1MB (default: 1 MiB) |
--min-size |
Skip files smaller than this, e.g. 10KB |
--max-size |
Skip files larger than this, e.g. 5GB |
--include-ext |
Only scan these extensions, e.g. .jpg,.png |
--ignore-ext |
Skip these extensions, e.g. .tmp,.log |
--no-ignore-hidden |
Do not skip hidden dotfiles and dotfolders |
--follow-symlinks |
Follow symbolic links |
--max-files N |
Stop after discovering N files |
--max-depth N |
Maximum directory depth to scan |
--timeout SECONDS |
Stop scan after this many seconds |
--cache PATH |
SQLite cache file for file hashes |
--progress |
Print progress to stderr |
--strict |
Raise errors instead of skipping bad files |
--json |
Print JSON output |
--fail-on-duplicates |
Exit with code 2 when duplicates are found |
--version |
Show version and exit |
API summary
| Symbol | Description |
|---|---|
find_duplicates(path, options) |
Return a tuple of DuplicateGroup |
scan(path, options) |
Return a full ScanReport |
DupeFinder |
Engine with events, progress, cache, and cancellation |
ScanEvent |
Typed event emitted during scanning |
ScanProgress |
Simplified progress snapshot for the on_progress callback |
ScanOptions |
Frozen dataclass with all scan settings |
ScanReport |
Result of a scan — groups, counts, issues, bytes read |
DuplicateGroup |
One group of files with identical content |
FileInfo |
Path and size of a single file |
ScanIssue |
A non-fatal error recorded during a scan |
SQLiteHashCache |
SQLite-backed hash cache — import from dupefinder.cache |
JSON schema
All JSON output includes a schema_version field (currently "1.1") for forward compatibility.
See docs/api.md for the full reference.
Project structure
src/dupefinder/
├── api.py public functions: scan, find_duplicates
├── cli.py terminal command
├── engine.py DupeFinder class: events, cancellation, cache
├── events.py ScanEvent dataclass
├── cache.py HashCache protocol and SQLiteHashCache
├── scanner.py file discovery (os.scandir, loop detection, max_depth)
├── hashing.py chunked file hashing with optional cache
├── grouping.py group by size then by hash
├── filters.py ignore/include rules
├── models.py frozen dataclasses
├── report.py text and JSON output
├── safety.py path/options validation, helpers
├── constants.py default values
└── errors.py custom exceptions
Safety and design constraints
dupefinder is intentionally read-only:
- Does not delete, move, or rename files.
- Does not connect to the internet.
- Does not follow symbolic links by default. Pass
--follow-symlinksorScanOptions(follow_symlinks=True)to opt in. - Reads files in chunks — no large allocations.
- Permission errors are recorded and skipped by default.
- SQLite cache writes occur only when the user explicitly passes
--cache PATHor constructsSQLiteHashCache. The cache file is written to the path chosen by the user. - Detects exact duplicates only — files with identical byte content. Near-duplicates, similar images, or renamed files are not detected.
See SECURITY.md for more details.
Running tests
pytest
With coverage:
pytest --cov=dupefinder --cov-report=term-missing
Contributing
Contributions are welcome. Please open an issue first to discuss what you want to change.
- Fork the repository.
- Create a branch:
git checkout -b feature/your-feature. - Make your changes and add tests.
- Run
pytestand make sure all tests pass. - Open a pull request.
License
MIT — Igor Souza
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dupefinder-0.4.0.tar.gz.
File metadata
- Download URL: dupefinder-0.4.0.tar.gz
- Upload date:
- Size: 36.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
044b3e35267f4c082665b5a6513dc04f922ecdaa803557f155a8d1beeb585995
|
|
| MD5 |
0d15cf32b66e37967c9af27cd95f4c3a
|
|
| BLAKE2b-256 |
9a93bc9f6a9bd05a18aa05bb0bb2c3e89a4e958e9847821516f0d8e188b77825
|
Provenance
The following attestation bundles were made for dupefinder-0.4.0.tar.gz:
Publisher:
publish.yml on igors93/dupefinder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dupefinder-0.4.0.tar.gz -
Subject digest:
044b3e35267f4c082665b5a6513dc04f922ecdaa803557f155a8d1beeb585995 - Sigstore transparency entry: 1736456787
- Sigstore integration time:
-
Permalink:
igors93/dupefinder@5fe287c416ed89b7df11f2d0aa7eb04295b10cf9 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/igors93
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5fe287c416ed89b7df11f2d0aa7eb04295b10cf9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file dupefinder-0.4.0-py3-none-any.whl.
File metadata
- Download URL: dupefinder-0.4.0-py3-none-any.whl
- Upload date:
- Size: 23.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8bdc75a28f770881d8973c3e842d512a7ba5ae3c1d46d692db3adda7830d032
|
|
| MD5 |
98ef63b85f0fc7b1be99808540646da2
|
|
| BLAKE2b-256 |
6885857560818a6367441f161ed293c13759d85753f2e535bfad231866df478b
|
Provenance
The following attestation bundles were made for dupefinder-0.4.0-py3-none-any.whl:
Publisher:
publish.yml on igors93/dupefinder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dupefinder-0.4.0-py3-none-any.whl -
Subject digest:
d8bdc75a28f770881d8973c3e842d512a7ba5ae3c1d46d692db3adda7830d032 - Sigstore transparency entry: 1736457007
- Sigstore integration time:
-
Permalink:
igors93/dupefinder@5fe287c416ed89b7df11f2d0aa7eb04295b10cf9 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/igors93
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5fe287c416ed89b7df11f2d0aa7eb04295b10cf9 -
Trigger Event:
release
-
Statement type: