Skip to main content

Content-hash duplicate file detection with two-pass efficiency

Project description

philiprehberger-duplicate-finder

Content-hash duplicate file detection with two-pass efficiency.

Install

pip install philiprehberger-duplicate-finder

Usage

from philiprehberger_duplicate_finder import find_duplicates

# Find duplicates in a directory
groups = find_duplicates("~/Documents")

for group in groups:
    print(f"Size: {group.size} bytes, {group.count} copies, wasted: {group.wasted_bytes} bytes")
    for path in group.paths:
        print(f"  {path}")

# Multiple directories with filters
groups = find_duplicates(
    paths=["~/Documents", "~/Downloads"],
    min_size=1024,
    extensions=[".pdf", ".jpg", ".png"],
    algorithm="sha256",
)

# Progress tracking
groups = find_duplicates(
    "~/Pictures",
    on_progress=lambda current, total: print(f"{current}/{total}"),
)

How It Works

Two-pass approach for efficiency:

  1. Groups files by size (fast — eliminates most files immediately)
  2. Hashes only size-matched files (uses partial hashing for large files first)

Options

Option Default Description
min_size 1 Minimum file size in bytes
max_size None Maximum file size in bytes
extensions None Filter by extensions
algorithm "sha256" Hash algorithm (sha256, md5, sha1)
recursive True Scan subdirectories
follow_symlinks False Follow symbolic links

Development

pip install -e .
python -m pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

philiprehberger_duplicate_finder-0.1.2.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file philiprehberger_duplicate_finder-0.1.2.tar.gz.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.1.2.tar.gz
Algorithm Hash digest
SHA256 46cd4e39bea5ace69fb1ca6e7022134661576e3ae091c68d36ae880fa8c66dc3
MD5 45a142ad66f5ffb73361e9f65a90dd09
BLAKE2b-256 635aeb31cb40bea84e086fcc3c2d40b4220bf88129ab65b5cfcd74df95ca3e61

See more details on using hashes here.

File details

Details for the file philiprehberger_duplicate_finder-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 466091edc1fdddf4054adbe658346179a73b0ab137ae503b8e8ab88f1302c4c9
MD5 f8454d9d637e629345a9711d80a57052
BLAKE2b-256 1fdcc1308e4075ce4aeaa6424a48fcc0474a99fd28cac5655eedd11261032db9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page