Skip to main content

Content-hash duplicate file detection with two-pass efficiency

Project description

philiprehberger-duplicate-finder

Content-hash duplicate file detection with two-pass efficiency.

Install

pip install philiprehberger-duplicate-finder

Usage

from philiprehberger_duplicate_finder import find_duplicates

# Find duplicates in a directory
groups = find_duplicates("~/Documents")

for group in groups:
    print(f"Size: {group.size} bytes, {group.count} copies, wasted: {group.wasted_bytes} bytes")
    for path in group.paths:
        print(f"  {path}")

# Multiple directories with filters
groups = find_duplicates(
    paths=["~/Documents", "~/Downloads"],
    min_size=1024,
    extensions=[".pdf", ".jpg", ".png"],
    algorithm="sha256",
)

# Progress tracking
groups = find_duplicates(
    "~/Pictures",
    on_progress=lambda current, total: print(f"{current}/{total}"),
)

How It Works

Two-pass approach for efficiency:

  1. Groups files by size (fast — eliminates most files immediately)
  2. Hashes only size-matched files (uses partial hashing for large files first)

Options

Option Default Description
min_size 1 Minimum file size in bytes
max_size None Maximum file size in bytes
extensions None Filter by extensions
algorithm "sha256" Hash algorithm (sha256, md5, sha1)
recursive True Scan subdirectories
follow_symlinks False Follow symbolic links

Development

pip install -e .
python -m pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

philiprehberger_duplicate_finder-0.1.3.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file philiprehberger_duplicate_finder-0.1.3.tar.gz.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.1.3.tar.gz
Algorithm Hash digest
SHA256 8b6e97428cf5bcefaad5fecaa5214920d26ab8a54d331ebdf4815dd51aaf4834
MD5 6ed4198b9cc01ed936195456cf8459a8
BLAKE2b-256 6b9712a7019ebd9696535968090f89d58387a77245819f5793c2a7d98a51ad8b

See more details on using hashes here.

File details

Details for the file philiprehberger_duplicate_finder-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5879a5c89d194da38ae3d35f956a1b6865fa49cad12e3239e4399d5dc123f592
MD5 7b57810c9b18936215501ff3aeeed4a9
BLAKE2b-256 5b7daf33187f68bc6bbf0e30afed4778a6d40fb38e70e0974902a4bd907a1b3b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page