A simple, safe, zero-dependency duplicate file finder for Python.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

dupefinder

dupefinder is a small, zero-dependency Python library and CLI tool for finding duplicate files using content hashes.

Requires Python 3.10 or later. Detects exact duplicates (identical byte content) only.

Features

Simple: one function for common use, a full report for advanced use.
Safe by default: read-only. Never deletes, moves, or modifies files.
Zero dependency: uses only the Python standard library.
Modular: each responsibility lives in its own module.
Memory-friendly: files are hashed in configurable chunks, not loaded fully into RAM.
Fast: groups by file size before hashing — only candidates are hashed.
Typed: ships with inline type annotations.
Observable: typed event system for progress callbacks and integrations.
Cancellable: abort scans via a callback or timeout.
Cached: optional SQLite hash cache for repeated scans.

Installation

pip install dupefinder

For development:

git clone https://github.com/igors93/dupefinder.git
cd dupefinder
pip install -e ".[dev]"

Quick start

As a library

from dupefinder import find_duplicates

groups = find_duplicates("./Downloads")

for group in groups:
    print(f"{group.count} duplicate files — {group.size} bytes each")
    for path in group.files:
        print(f"  {path}")

Full report

from dupefinder import scan
from dupefinder.models import ScanOptions
from dupefinder.report import format_report

report = scan(
    "./Downloads",
    options=ScanOptions(
        min_size=1024,          # ignore files smaller than 1 KB
        ignore_hidden=True,     # skip dotfiles and dotfolders
        follow_symlinks=False,  # safe default
    ),
)

print(format_report(report))
print(f"Wasted space: {report.total_wasted_space} bytes")

JSON output

from dupefinder import scan
from dupefinder.report import report_to_json

report = scan("./Downloads")
print(report_to_json(report))

DupeFinder with events

from dupefinder import DupeFinder, ScanOptions

def on_event(event):
    if event.type == "file_discovered":
        print(f"\rFound {event.scanned_files} files...", end="", flush=True)
    elif event.type == "issue":
        print(f"\nWarning: {event.message}")
    elif event.type == "scan_completed":
        print(f"\nDone in {event.elapsed_seconds:.2f}s")

finder = DupeFinder(
    options=ScanOptions(min_size=1024),
    on_event=on_event,
)
report = finder.scan("./Downloads")

Progress callback

from dupefinder import DupeFinder, ScanOptions

def on_progress(progress):
    print(f"[{progress.phase}] {progress.scanned_files} files scanned, "
          f"{progress.hashed_files}/{progress.total_candidates} hashed")

finder = DupeFinder(
    options=ScanOptions(min_size=1024),
    on_progress=on_progress,
)
report = finder.scan("./Downloads")
print(f"Total bytes read: {report.total_bytes_read:,}")

Cancellation

import threading
from dupefinder import DupeFinder

cancel_flag = threading.Event()
threading.Timer(5.0, cancel_flag.set).start()  # cancel after 5 seconds

finder = DupeFinder(should_cancel=cancel_flag.is_set)
report = finder.scan("./Downloads")

if report.cancelled:
    print(f"Cancelled after {report.elapsed_seconds:.2f}s — partial results")

SQLite cache

from dupefinder import DupeFinder
from dupefinder.cache import SQLiteHashCache

with SQLiteHashCache(".dupefinder-cache.sqlite") as cache:
    finder = DupeFinder(cache=cache)
    report = finder.scan("./media")  # second run will be much faster

Scan limits

from dupefinder import DupeFinder, ScanOptions

finder = DupeFinder(options=ScanOptions(
    max_files=1000,       # stop after 1000 files
    max_depth=3,          # scan at most 3 levels deep
    timeout_seconds=30.0, # stop after 30 seconds
))
report = finder.scan("./data")

CLI

# Basic scan
dupefinder ./Downloads

# JSON output
dupefinder ./Downloads --json

# Ignore files smaller than 1 MB
dupefinder ./Downloads --min-size 1MB

# Only scan images
dupefinder ./Pictures --include-ext .jpg,.jpeg,.png,.webp

# Ignore temp and log files
dupefinder . --ignore-ext .tmp,.log

# Exit with code 2 if any duplicates are found (useful in scripts/CI)
dupefinder . --fail-on-duplicates

# Strict mode: raise errors instead of skipping inaccessible files
dupefinder . --strict

# Follow symbolic links (disabled by default)
dupefinder . --follow-symlinks

Run dupefinder --help to see all options.

Exit codes

Code	Meaning
`0`	Scan completed. Non-fatal issues may still be present in the report.
`1`	Scan failed because of an invalid option, invalid path, cache error, or strict-mode error.
`2`	Scan completed and duplicates were found while `--fail-on-duplicates` was enabled.
`3`	Scan was cancelled or stopped by the configured timeout.

Notes:

--strict turns otherwise non-fatal file access errors into exit code 1.
JSON output can contain an issues list even when the exit code is 0.
Exit code 3 takes priority over --fail-on-duplicates.

CLI reference

Flag	Description
`path`	File or directory to scan
`--algorithm`	Hash algorithm (default: `sha256`)
`--chunk-size`	Read chunk size, e.g. `1MB` (default: 1 MiB)
`--min-size`	Skip files smaller than this, e.g. `10KB`
`--max-size`	Skip files larger than this, e.g. `5GB`
`--include-ext`	Only scan these extensions, e.g. `.jpg,.png`
`--ignore-ext`	Skip these extensions, e.g. `.tmp,.log`
`--no-ignore-hidden`	Do not skip hidden dotfiles and dotfolders
`--follow-symlinks`	Follow symbolic links
`--max-files N`	Stop after discovering N files
`--max-depth N`	Maximum directory depth to scan
`--timeout SECONDS`	Stop scan after this many seconds
`--cache PATH`	SQLite cache file for file hashes
`--progress`	Print progress to stderr
`--strict`	Raise errors instead of skipping bad files
`--json`	Print JSON output
`--fail-on-duplicates`	Exit with code `2` when duplicates are found
`--version`	Show version and exit

API summary

Symbol	Description
`find_duplicates(path, options)`	Return a tuple of `DuplicateGroup`
`scan(path, options)`	Return a full `ScanReport`
`DupeFinder`	Engine with events, progress, cache, and cancellation
`ScanEvent`	Typed event emitted during scanning
`ScanProgress`	Simplified progress snapshot for the `on_progress` callback
`ScanOptions`	Frozen dataclass with all scan settings
`ScanReport`	Result of a scan — groups, counts, issues, bytes read
`DuplicateGroup`	One group of files with identical content
`FileInfo`	Path and size of a single file
`ScanIssue`	A non-fatal error recorded during a scan
`SQLiteHashCache`	SQLite-backed hash cache — import from `dupefinder.cache`

JSON schema

All JSON output includes a schema_version field (currently "1.1") for forward compatibility.

See docs/api.md for the full reference.

Project structure

src/dupefinder/
├── api.py        public functions: scan, find_duplicates
├── cli.py        terminal command
├── engine.py     DupeFinder class: events, cancellation, cache
├── events.py     ScanEvent dataclass
├── cache.py      HashCache protocol and SQLiteHashCache
├── scanner.py    file discovery (os.scandir, loop detection, max_depth)
├── hashing.py    chunked file hashing with optional cache
├── grouping.py   group by size then by hash
├── filters.py    ignore/include rules
├── models.py     frozen dataclasses
├── report.py     text and JSON output
├── safety.py     path/options validation, helpers
├── constants.py  default values
└── errors.py     custom exceptions

Safety and design constraints

dupefinder is intentionally read-only:

Does not delete, move, or rename files.
Does not connect to the internet.
Does not follow symbolic links by default. Pass --follow-symlinks or ScanOptions(follow_symlinks=True) to opt in.
Reads files in chunks — no large allocations.
Permission errors are recorded and skipped by default.
SQLite cache writes occur only when the user explicitly passes --cache PATH or constructs SQLiteHashCache. The cache file is written to the path chosen by the user.
Detects exact duplicates only — files with identical byte content. Near-duplicates, similar images, or renamed files are not detected.

See SECURITY.md for more details.

Running tests

pytest

With coverage:

pytest --cov=dupefinder --cov-report=term-missing

Contributing

Contributions are welcome. Please open an issue first to discuss what you want to change.

Fork the repository.
Create a branch: git checkout -b feature/your-feature.
Make your changes and add tests.
Run pytest and make sure all tests pass.
Open a pull request.

License

MIT — Igor Souza

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

igors93

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

Jun 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dupefinder-0.4.0.tar.gz (36.7 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dupefinder-0.4.0-py3-none-any.whl (23.4 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file dupefinder-0.4.0.tar.gz.

File metadata

Download URL: dupefinder-0.4.0.tar.gz
Upload date: Jun 5, 2026
Size: 36.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dupefinder-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`044b3e35267f4c082665b5a6513dc04f922ecdaa803557f155a8d1beeb585995`
MD5	`0d15cf32b66e37967c9af27cd95f4c3a`
BLAKE2b-256	`9a93bc9f6a9bd05a18aa05bb0bb2c3e89a4e958e9847821516f0d8e188b77825`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dupefinder-0.4.0.tar.gz:

Publisher: publish.yml on igors93/dupefinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dupefinder-0.4.0.tar.gz
- Subject digest: 044b3e35267f4c082665b5a6513dc04f922ecdaa803557f155a8d1beeb585995
- Sigstore transparency entry: 1736456787
- Sigstore integration time: Jun 5, 2026
Source repository:
- Permalink: igors93/dupefinder@5fe287c416ed89b7df11f2d0aa7eb04295b10cf9
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/igors93
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5fe287c416ed89b7df11f2d0aa7eb04295b10cf9
- Trigger Event: release

File details

Details for the file dupefinder-0.4.0-py3-none-any.whl.

File metadata

Download URL: dupefinder-0.4.0-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 23.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dupefinder-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d8bdc75a28f770881d8973c3e842d512a7ba5ae3c1d46d692db3adda7830d032`
MD5	`98ef63b85f0fc7b1be99808540646da2`
BLAKE2b-256	`6885857560818a6367441f161ed293c13759d85753f2e535bfad231866df478b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dupefinder-0.4.0-py3-none-any.whl:

Publisher: publish.yml on igors93/dupefinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dupefinder-0.4.0-py3-none-any.whl
- Subject digest: d8bdc75a28f770881d8973c3e842d512a7ba5ae3c1d46d692db3adda7830d032
- Sigstore transparency entry: 1736457007
- Sigstore integration time: Jun 5, 2026
Source repository:
- Permalink: igors93/dupefinder@5fe287c416ed89b7df11f2d0aa7eb04295b10cf9
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/igors93
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5fe287c416ed89b7df11f2d0aa7eb04295b10cf9
- Trigger Event: release

dupefinder 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

dupefinder

Features

Installation

Quick start

As a library

Full report

JSON output

DupeFinder with events

Progress callback

Cancellation

SQLite cache

Scan limits

CLI

Exit codes

CLI reference

API summary

JSON schema

Project structure

Safety and design constraints

Running tests

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance