Skip to main content

Content-hash duplicate file detection with two-pass efficiency

Project description

philiprehberger-duplicate-finder

Content-hash duplicate file detection with two-pass efficiency.

Install

pip install philiprehberger-duplicate-finder

Usage

from philiprehberger_duplicate_finder import find_duplicates

# Find duplicates in a directory
groups = find_duplicates("~/Documents")

for group in groups:
    print(f"Size: {group.size} bytes, {group.count} copies, wasted: {group.wasted_bytes} bytes")
    for path in group.paths:
        print(f"  {path}")

# Multiple directories with filters
groups = find_duplicates(
    paths=["~/Documents", "~/Downloads"],
    min_size=1024,
    extensions=[".pdf", ".jpg", ".png"],
    algorithm="sha256",
)

# Progress tracking
groups = find_duplicates(
    "~/Pictures",
    on_progress=lambda current, total: print(f"{current}/{total}"),
)

How It Works

Two-pass approach for efficiency:

  1. Groups files by size (fast — eliminates most files immediately)
  2. Hashes only size-matched files (uses partial hashing for large files first)

Options

Option Default Description
min_size 1 Minimum file size in bytes
max_size None Maximum file size in bytes
extensions None Filter by extensions
algorithm "sha256" Hash algorithm (sha256, md5, sha1)
recursive True Scan subdirectories
follow_symlinks False Follow symbolic links

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

philiprehberger_duplicate_finder-0.1.1.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file philiprehberger_duplicate_finder-0.1.1.tar.gz.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.1.1.tar.gz
Algorithm Hash digest
SHA256 95181a94d0b08bb2b6cb5d2cef31587e17103b96185b905f7afb465f82bf4aee
MD5 aba7ff28c6765fe3732c653a0d83d4d3
BLAKE2b-256 743108e4ec37d56f1a73ef25f04bdc3074ba4948dca8170b4974da7667c023eb

See more details on using hashes here.

File details

Details for the file philiprehberger_duplicate_finder-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c62a3eb2db6e3b3455bc1ed2692dc266b8b51da9d4140efce33ee5ce202ad129
MD5 f46a4f05cb60abff8b3acf4ad58c420b
BLAKE2b-256 82778e7f5edf44de306d976179b6f9b5868f48c6dba8b5ca966bcd6fac9f8471

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page