Skip to main content

Content-hash duplicate file detection with two-pass efficiency

Project description

philiprehberger-duplicate-finder

Tests PyPI version GitHub release Last updated License Bug Reports Feature Requests Sponsor

Content-hash duplicate file detection with two-pass efficiency.

Installation

pip install philiprehberger-duplicate-finder

Usage

Finding Duplicates

from philiprehberger_duplicate_finder import find_duplicates

# Find duplicates in a directory
groups = find_duplicates("~/Documents")

for group in groups:
    print(f"Size: {group.size} bytes, {group.count} copies, wasted: {group.wasted_bytes} bytes")
    for path in group.paths:
        print(f"  {path}")

Filtering Options

# Multiple directories with filters
groups = find_duplicates(
    paths=["~/Documents", "~/Downloads"],
    min_size=1024,
    extensions=[".pdf", ".jpg", ".png"],
    algorithm="sha256",
)

# Progress tracking
groups = find_duplicates(
    "~/Pictures",
    on_progress=lambda current, total: print(f"{current}/{total}"),
)

Smart Keep/Delete Suggestions

for group in groups:
    # Keep the most recently modified file
    keep = group.keep_newest()
    print(f"Keep: {keep}")

    # Or keep the file with the shortest path (shallowest)
    keep = group.keep_shortest_path()
    print(f"Keep: {keep}")

    # Get the list of files safe to delete
    to_delete = group.deletable(strategy="newest")
    for path in to_delete:
        print(f"  Delete: {path}")

API

Function / Class Description
find_duplicates(paths, ...) Find duplicate files using a two-pass size-then-hash approach
DuplicateGroup A group of duplicate files with paths, size, hash, count, and wasted_bytes
DuplicateGroup.keep_newest() Return the path with the most recent modification time
DuplicateGroup.keep_shortest_path() Return the path with the shortest string length
DuplicateGroup.deletable(strategy) Return all paths except the one to keep ("newest" or "shortest_path")

Development

pip install -e .
python -m pytest tests/ -v

Support

If you find this package useful, consider giving it a star on GitHub — it helps motivate continued maintenance and development.

LinkedIn More packages

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

philiprehberger_duplicate_finder-0.3.0.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file philiprehberger_duplicate_finder-0.3.0.tar.gz.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.3.0.tar.gz
Algorithm Hash digest
SHA256 28eb80bdbdb1cacb9a5ca788e7f21a182877ec1fac97b2b07a22c410d80f1cd4
MD5 8e5dca0f784ecdaca760a51fe3c5e3fe
BLAKE2b-256 e496b9f9f678bdd278d1240090875106b8f209e223a0b5a4f8f6976e8fbcbdd7

See more details on using hashes here.

File details

Details for the file philiprehberger_duplicate_finder-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d72e3472a2bb4e9d7441dfc52b36d6636c0f1237797e32b10d0c86f97f517c85
MD5 90db13196c97bea60f28024fd22bff2d
BLAKE2b-256 2ad91d525a59a5ae442149411a67a916f479fcc337d3b6b31e18f8ff6ca988b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page