Skip to main content

A simple, safe, zero-dependency duplicate file finder for Python.

Project description

dupefinder

PyPI version Python versions License: MIT Tests

dupefinder is a small, zero-dependency Python library and CLI tool for finding duplicate files using content hashes.

Requires Python 3.10 or later. Detects exact duplicates (identical byte content) only.

Features

  • Simple: one function for common use, a full report for advanced use.
  • Safe by default: read-only. Never deletes, moves, or modifies files.
  • Zero dependency: uses only the Python standard library.
  • Modular: each responsibility lives in its own module.
  • Memory-friendly: files are hashed in configurable chunks, not loaded fully into RAM.
  • Fast: groups by file size before hashing — only candidates are hashed.
  • Typed: ships with inline type annotations.
  • Observable: typed event system for progress callbacks and integrations.
  • Cancellable: abort scans via a callback or timeout.
  • Cached: optional SQLite hash cache for repeated scans.

Installation

pip install dupefinder

For development:

git clone https://github.com/igors93/dupefinder.git
cd dupefinder
pip install -e ".[dev]"

Quick start

As a library

from dupefinder import find_duplicates

groups = find_duplicates("./Downloads")

for group in groups:
    print(f"{group.count} duplicate files — {group.size} bytes each")
    for path in group.files:
        print(f"  {path}")

Full report

from dupefinder import scan
from dupefinder.models import ScanOptions
from dupefinder.report import format_report

report = scan(
    "./Downloads",
    options=ScanOptions(
        min_size=1024,          # ignore files smaller than 1 KB
        ignore_hidden=True,     # skip dotfiles and dotfolders
        follow_symlinks=False,  # safe default
    ),
)

print(format_report(report))
print(f"Wasted space: {report.total_wasted_space} bytes")

JSON output

from dupefinder import scan
from dupefinder.report import report_to_json

report = scan("./Downloads")
print(report_to_json(report))

DupeFinder with events

from dupefinder import DupeFinder, ScanOptions

def on_event(event):
    if event.type == "file_discovered":
        print(f"\rFound {event.scanned_files} files...", end="", flush=True)
    elif event.type == "issue":
        print(f"\nWarning: {event.message}")
    elif event.type == "scan_completed":
        print(f"\nDone in {event.elapsed_seconds:.2f}s")

finder = DupeFinder(
    options=ScanOptions(min_size=1024),
    on_event=on_event,
)
report = finder.scan("./Downloads")

Progress callback

from dupefinder import DupeFinder, ScanOptions

def on_progress(progress):
    print(f"[{progress.phase}] {progress.scanned_files} files scanned, "
          f"{progress.hashed_files}/{progress.total_candidates} hashed")

finder = DupeFinder(
    options=ScanOptions(min_size=1024),
    on_progress=on_progress,
)
report = finder.scan("./Downloads")
print(f"Total bytes read: {report.total_bytes_read:,}")

Cancellation

import threading
from dupefinder import DupeFinder

cancel_flag = threading.Event()
threading.Timer(5.0, cancel_flag.set).start()  # cancel after 5 seconds

finder = DupeFinder(should_cancel=cancel_flag.is_set)
report = finder.scan("./Downloads")

if report.cancelled:
    print(f"Cancelled after {report.elapsed_seconds:.2f}s — partial results")

SQLite cache

from dupefinder import DupeFinder
from dupefinder.cache import SQLiteHashCache

with SQLiteHashCache(".dupefinder-cache.sqlite") as cache:
    finder = DupeFinder(cache=cache)
    report = finder.scan("./media")  # second run will be much faster

Scan limits

from dupefinder import DupeFinder, ScanOptions

finder = DupeFinder(options=ScanOptions(
    max_files=1000,       # stop after 1000 files
    max_depth=3,          # scan at most 3 levels deep
    timeout_seconds=30.0, # stop after 30 seconds
))
report = finder.scan("./data")

CLI

# Basic scan
dupefinder ./Downloads

# JSON output
dupefinder ./Downloads --json

# Ignore files smaller than 1 MB
dupefinder ./Downloads --min-size 1MB

# Only scan images
dupefinder ./Pictures --include-ext .jpg,.jpeg,.png,.webp

# Ignore temp and log files
dupefinder . --ignore-ext .tmp,.log

# Exit with code 2 if any duplicates are found (useful in scripts/CI)
dupefinder . --fail-on-duplicates

# Strict mode: raise errors instead of skipping inaccessible files
dupefinder . --strict

# Follow symbolic links (disabled by default)
dupefinder . --follow-symlinks

Run dupefinder --help to see all options.

Exit codes

Code Meaning
0 Scan completed. Non-fatal issues may still be present in the report.
1 Scan failed because of an invalid option, invalid path, cache error, or strict-mode error.
2 Scan completed and duplicates were found while --fail-on-duplicates was enabled.
3 Scan was cancelled or stopped by the configured timeout.

Notes:

  • --strict turns otherwise non-fatal file access errors into exit code 1.
  • JSON output can contain an issues list even when the exit code is 0.
  • Exit code 3 takes priority over --fail-on-duplicates.

CLI reference

Flag Description
path File or directory to scan
--algorithm Hash algorithm (default: sha256)
--chunk-size Read chunk size, e.g. 1MB (default: 1 MiB)
--min-size Skip files smaller than this, e.g. 10KB
--max-size Skip files larger than this, e.g. 5GB
--include-ext Only scan these extensions, e.g. .jpg,.png
--ignore-ext Skip these extensions, e.g. .tmp,.log
--no-ignore-hidden Do not skip hidden dotfiles and dotfolders
--follow-symlinks Follow symbolic links
--max-files N Stop after discovering N files
--max-depth N Maximum directory depth to scan
--timeout SECONDS Stop scan after this many seconds
--cache PATH SQLite cache file for file hashes
--progress Print progress to stderr
--strict Raise errors instead of skipping bad files
--json Print JSON output
--fail-on-duplicates Exit with code 2 when duplicates are found
--version Show version and exit

API summary

Symbol Description
find_duplicates(path, options) Return a tuple of DuplicateGroup
scan(path, options) Return a full ScanReport
DupeFinder Engine with events, progress, cache, and cancellation
ScanEvent Typed event emitted during scanning
ScanProgress Simplified progress snapshot for the on_progress callback
ScanOptions Frozen dataclass with all scan settings
ScanReport Result of a scan — groups, counts, issues, bytes read
DuplicateGroup One group of files with identical content
FileInfo Path and size of a single file
ScanIssue A non-fatal error recorded during a scan
SQLiteHashCache SQLite-backed hash cache — import from dupefinder.cache

JSON schema

All JSON output includes a schema_version field (currently "1.1") for forward compatibility.

See docs/api.md for the full reference.

Project structure

src/dupefinder/
├── api.py        public functions: scan, find_duplicates
├── cli.py        terminal command
├── engine.py     DupeFinder class: events, cancellation, cache
├── events.py     ScanEvent dataclass
├── cache.py      HashCache protocol and SQLiteHashCache
├── scanner.py    file discovery (os.scandir, loop detection, max_depth)
├── hashing.py    chunked file hashing with optional cache
├── grouping.py   group by size then by hash
├── filters.py    ignore/include rules
├── models.py     frozen dataclasses
├── report.py     text and JSON output
├── safety.py     path/options validation, helpers
├── constants.py  default values
└── errors.py     custom exceptions

Safety and design constraints

dupefinder is intentionally read-only:

  • Does not delete, move, or rename files.
  • Does not connect to the internet.
  • Does not follow symbolic links by default. Pass --follow-symlinks or ScanOptions(follow_symlinks=True) to opt in.
  • Reads files in chunks — no large allocations.
  • Permission errors are recorded and skipped by default.
  • SQLite cache writes occur only when the user explicitly passes --cache PATH or constructs SQLiteHashCache. The cache file is written to the path chosen by the user.
  • Detects exact duplicates only — files with identical byte content. Near-duplicates, similar images, or renamed files are not detected.

See SECURITY.md for more details.

Running tests

pytest

With coverage:

pytest --cov=dupefinder --cov-report=term-missing

Contributing

Contributions are welcome. Please open an issue first to discuss what you want to change.

  1. Fork the repository.
  2. Create a branch: git checkout -b feature/your-feature.
  3. Make your changes and add tests.
  4. Run pytest and make sure all tests pass.
  5. Open a pull request.

License

MIT — Igor Souza

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dupefinder-0.4.0.tar.gz (36.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dupefinder-0.4.0-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file dupefinder-0.4.0.tar.gz.

File metadata

  • Download URL: dupefinder-0.4.0.tar.gz
  • Upload date:
  • Size: 36.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dupefinder-0.4.0.tar.gz
Algorithm Hash digest
SHA256 044b3e35267f4c082665b5a6513dc04f922ecdaa803557f155a8d1beeb585995
MD5 0d15cf32b66e37967c9af27cd95f4c3a
BLAKE2b-256 9a93bc9f6a9bd05a18aa05bb0bb2c3e89a4e958e9847821516f0d8e188b77825

See more details on using hashes here.

Provenance

The following attestation bundles were made for dupefinder-0.4.0.tar.gz:

Publisher: publish.yml on igors93/dupefinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dupefinder-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: dupefinder-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 23.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dupefinder-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d8bdc75a28f770881d8973c3e842d512a7ba5ae3c1d46d692db3adda7830d032
MD5 98ef63b85f0fc7b1be99808540646da2
BLAKE2b-256 6885857560818a6367441f161ed293c13759d85753f2e535bfad231866df478b

See more details on using hashes here.

Provenance

The following attestation bundles were made for dupefinder-0.4.0-py3-none-any.whl:

Publisher: publish.yml on igors93/dupefinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page