Skip to main content

Content-hash duplicate file detection with two-pass efficiency

Project description

philiprehberger-duplicate-finder

Content-hash duplicate file detection with two-pass efficiency.

Install

pip install philiprehberger-duplicate-finder

Usage

from philiprehberger_duplicate_finder import find_duplicates

# Find duplicates in a directory
groups = find_duplicates("~/Documents")

for group in groups:
    print(f"Size: {group.size} bytes, {group.count} copies, wasted: {group.wasted_bytes} bytes")
    for path in group.paths:
        print(f"  {path}")

# Multiple directories with filters
groups = find_duplicates(
    paths=["~/Documents", "~/Downloads"],
    min_size=1024,
    extensions=[".pdf", ".jpg", ".png"],
    algorithm="sha256",
)

# Progress tracking
groups = find_duplicates(
    "~/Pictures",
    on_progress=lambda current, total: print(f"{current}/{total}"),
)

How It Works

Two-pass approach for efficiency:

  1. Groups files by size (fast — eliminates most files immediately)
  2. Hashes only size-matched files (uses partial hashing for large files first)

Options

Option Default Description
min_size 1 Minimum file size in bytes
max_size None Maximum file size in bytes
extensions None Filter by extensions
algorithm "sha256" Hash algorithm (sha256, md5, sha1)
recursive True Scan subdirectories
follow_symlinks False Follow symbolic links

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

philiprehberger_duplicate_finder-0.1.0.tar.gz (4.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file philiprehberger_duplicate_finder-0.1.0.tar.gz.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.1.0.tar.gz
Algorithm Hash digest
SHA256 582b8db4eeb9affa46ccb4230d345bd1881c723e88ac5507cdb8258e6fd4e43e
MD5 29720a4bbbdc247d44117a649429780c
BLAKE2b-256 3281405fb290b7417cb5834175ac39e11d8f435dad3a92cd2a2764f25ecef6a6

See more details on using hashes here.

File details

Details for the file philiprehberger_duplicate_finder-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fecb09313ed34b099bdc989b29a5095362ccb7d3afe338297c5b0d2122327f83
MD5 1d24c3b0ce5c3abcd1a7575d05b3fbdc
BLAKE2b-256 3f585e777c7fa8fb016080f5bb6e03d788bebf6c07ce299898565258f3a587a0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page