Skip to main content

Content-hash duplicate file detection with two-pass efficiency

Project description

philiprehberger-duplicate-finder

Content-hash duplicate file detection with two-pass efficiency.

Install

pip install philiprehberger-duplicate-finder

Usage

from philiprehberger_duplicate_finder import find_duplicates

# Find duplicates in a directory
groups = find_duplicates("~/Documents")

for group in groups:
    print(f"Size: {group.size} bytes, {group.count} copies, wasted: {group.wasted_bytes} bytes")
    for path in group.paths:
        print(f"  {path}")

# Multiple directories with filters
groups = find_duplicates(
    paths=["~/Documents", "~/Downloads"],
    min_size=1024,
    extensions=[".pdf", ".jpg", ".png"],
    algorithm="sha256",
)

# Progress tracking
groups = find_duplicates(
    "~/Pictures",
    on_progress=lambda current, total: print(f"{current}/{total}"),
)

How It Works

Two-pass approach for efficiency:

  1. Groups files by size (fast — eliminates most files immediately)
  2. Hashes only size-matched files (uses partial hashing for large files first)

Options

Option Default Description
min_size 1 Minimum file size in bytes
max_size None Maximum file size in bytes
extensions None Filter by extensions
exclude_patterns None Directory/file patterns to skip (e.g., [".git", "node_modules"])
algorithm "sha256" Hash algorithm (sha256, md5, sha1)
recursive True Scan subdirectories
follow_symlinks False Follow symbolic links

Hard links to the same file are automatically detected and excluded from duplicate results.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

philiprehberger_duplicate_finder-0.2.0.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file philiprehberger_duplicate_finder-0.2.0.tar.gz.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.2.0.tar.gz
Algorithm Hash digest
SHA256 225e2a2d10efee317cc3ec9829720e0393553cd4c71ddbacaf8bbe1213e34962
MD5 8f22212d869aeef889c6cb5a5109cf5c
BLAKE2b-256 f3a7c3da916d25db997b30ea4990be63cb968c0645a9522de0fde2c2ef484504

See more details on using hashes here.

File details

Details for the file philiprehberger_duplicate_finder-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ece7509399229ebc510e513d706734340fc45753c1f102ee3ae91ab2b48b1aa4
MD5 8b4f048d14ced52903b64dcc1d5ff036
BLAKE2b-256 5a30cf43d25fd509f0de4d4a19d4f87fa0cc8dc7bd5734fe2681b37e49f8a493

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page