Skip to main content

Content-hash duplicate file detection with two-pass efficiency

Project description

philiprehberger-duplicate-finder

Tests PyPI version License

Content-hash duplicate file detection with two-pass efficiency.

Installation

pip install philiprehberger-duplicate-finder

Usage

from philiprehberger_duplicate_finder import find_duplicates

# Find duplicates in a directory
groups = find_duplicates("~/Documents")

for group in groups:
    print(f"Size: {group.size} bytes, {group.count} copies, wasted: {group.wasted_bytes} bytes")
    for path in group.paths:
        print(f"  {path}")

# Multiple directories with filters
groups = find_duplicates(
    paths=["~/Documents", "~/Downloads"],
    min_size=1024,
    extensions=[".pdf", ".jpg", ".png"],
    algorithm="sha256",
)

# Progress tracking
groups = find_duplicates(
    "~/Pictures",
    on_progress=lambda current, total: print(f"{current}/{total}"),
)

How it works

Two-pass approach for efficiency:

  1. Groups files by size (fast — eliminates most files immediately)
  2. Hashes only size-matched files (uses partial hashing for large files first)

Hard links to the same file are automatically detected and excluded from duplicate results.

API

Function / Class Description
find_duplicates(paths, *, min_size, max_size, extensions, exclude_patterns, algorithm, recursive, follow_symlinks, on_progress) Scan directories for duplicate files and return groups
Option Default Description
min_size 1 Minimum file size in bytes
max_size None Maximum file size in bytes
extensions None Filter by extensions
exclude_patterns None Directory/file patterns to skip (e.g., [".git", "node_modules"])
algorithm "sha256" Hash algorithm (sha256, md5, sha1)
recursive True Scan subdirectories
follow_symlinks False Follow symbolic links

Development

pip install -e .
python -m pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

philiprehberger_duplicate_finder-0.2.2.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file philiprehberger_duplicate_finder-0.2.2.tar.gz.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.2.2.tar.gz
Algorithm Hash digest
SHA256 70cf3c90c537cf7ee5d62a4f2dfe4db7ec0dac6fa6a7d97aca5cb72f00ad6138
MD5 e12ee0a9da6cc8830aa3d9d7e899055e
BLAKE2b-256 935c0bd23bf143d946adcf2b8ffbdfb76b0d0cd740e9b232d5401af58ee5c490

See more details on using hashes here.

File details

Details for the file philiprehberger_duplicate_finder-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2074a3bd0198f3105371959b3b2c588d5920f08688cb14ee232accec878c4d3a
MD5 b681b524017538e4e4a912dcf9913a22
BLAKE2b-256 fcb3af675b9103a39ad6f5125f6cbb8b9ccaa7b36b0dc8fee0f07c943a5bce1b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page