Skip to main content

Content-hash duplicate file detection with two-pass efficiency

Project description

philiprehberger-duplicate-finder

Tests PyPI version Last updated

Content-hash duplicate file detection with two-pass efficiency.

Installation

pip install philiprehberger-duplicate-finder

Usage

Finding Duplicates

from philiprehberger_duplicate_finder import find_duplicates

# Find duplicates in a directory
groups = find_duplicates("~/Documents")

for group in groups:
    print(f"Size: {group.size} bytes, {group.count} copies, wasted: {group.wasted_bytes} bytes")
    for path in group.paths:
        print(f"  {path}")

Filtering Options

# Multiple directories with filters
groups = find_duplicates(
    paths=["~/Documents", "~/Downloads"],
    min_size=1024,
    extensions=[".pdf", ".jpg", ".png"],
    algorithm="sha256",
)

# Progress tracking
groups = find_duplicates(
    "~/Pictures",
    on_progress=lambda current, total: print(f"{current}/{total}"),
)

Smart Keep/Delete Suggestions

for group in groups:
    # Keep the most recently modified file
    keep = group.keep_newest()
    print(f"Keep: {keep}")

    # Or keep the file with the shortest path (shallowest)
    keep = group.keep_shortest_path()
    print(f"Keep: {keep}")

    # Get the list of files safe to delete
    to_delete = group.deletable(strategy="newest")
    for path in to_delete:
        print(f"  Delete: {path}")

API

Function / Class Description
find_duplicates(paths, ...) Find duplicate files using a two-pass size-then-hash approach
DuplicateGroup A group of duplicate files with paths, size, hash, count, and wasted_bytes
DuplicateGroup.keep_newest() Return the path with the most recent modification time
DuplicateGroup.keep_shortest_path() Return the path with the shortest string length
DuplicateGroup.deletable(strategy) Return all paths except the one to keep ("newest" or "shortest_path")

Development

pip install -e .
python -m pytest tests/ -v

Support

If you find this project useful:

Star the repo

🐛 Report issues

💡 Suggest features

❤️ Sponsor development

🌐 All Open Source Projects

💻 GitHub Profile

🔗 LinkedIn Profile

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

philiprehberger_duplicate_finder-0.3.1.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file philiprehberger_duplicate_finder-0.3.1.tar.gz.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.3.1.tar.gz
Algorithm Hash digest
SHA256 db39d710e49a6ee50af32cc08ddb858689f196e076ac9f7b50d4dc8b4169cdc7
MD5 325f487d4582ec367edda9c3525145e8
BLAKE2b-256 213d9896f5900fab67f2c1b496fbb1c1e46880b992f2770b34d43cf9ca5a29c9

See more details on using hashes here.

File details

Details for the file philiprehberger_duplicate_finder-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for philiprehberger_duplicate_finder-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8103597f4b7868a230beac3f4c053e5bacafd90da89ab4db076da37e5ddb842f
MD5 c82eeac9ba08a574780ee8c86f40e7d1
BLAKE2b-256 1492968dbc035649c929d28401e6d165fc3a095161a130ca17d2ead57e5a4395

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page