Skip to main content

A Python utility for efficiently deduplicating files in directories

Project description

Duperemover

A Python utility for efficiently deduplicating files in directories.

PyPI Python Coverage Ruff

Install

pip install duperemover

Usage

from duperemover import Deduplicator

dedup = Deduplicator(
    directory="/path/to/directory",
    hash_algorithm="xxhash",
    replace_strategy="hardlink",
    progress=True,
)
dedup.deduplicate()
dedup.print_stats()

CLI

duperemover --help

Command Syntax

duperemover <directory> [options]

Arguments:
  <directory>            Directory to scan for duplicates.
  --hash-file <file>     File to store hashes (default: .hashes.db).
  --buffer-size <size>   Buffer size for hashing (default: 65536, 64KB).
  --hash-algorithm <alg> Hashing algorithm (choices: "xxhash", "blake3", "sha256", default: "xxhash" if available).
  --replace-strategy <strategy> Strategy for handling duplicates (choices: "hardlink", "delete", "rename", default: "hardlink").
  --max-threads <num>    Number of threads to use for processing (default: 4).
  --sync-interval <num>  Sync interval for hashes to disk (default: 100).
  --progress             Show a progress bar while processing files.
  --dry-run              Simulate the deduplication process without making any changes.
  --use-bloom-filter     Use Bloom filter to speed up duplicate checking.
  --exclude PATTERNS     Exclude files matching these patterns.

Examples

# Basic deduplication (using default hashing algorithm)
duperemover /path/to/directory

# Using SHA256 as the hashing algorithm
duperemover /path/to/directory --hash-algorithm sha256

# Simulate deduplication (dry run)
duperemover /path/to/directory --dry-run

# Create hard links for duplicates, use Bloom filter, and show progress
duperemover /path/to/directory --replace-strategy hardlink --use-bloom-filter --progress

Features

  • Hash Algorithms: Choose between xxhash, blake3, and sha256 for calculating file hashes.
  • Duplicate Handling Strategies:
    • hardlink: Replace duplicates with hard links.
    • delete: Delete duplicate files.
    • rename: Rename duplicate files by appending .duplicate to their names.
  • Multi-threading: Process files in parallel to speed up deduplication.
  • Bloom Filter: Optionally, enable the Bloom filter to speed up duplicate checks by avoiding re-hashing files.
  • Exclusion Patterns: Exclude files matching specific patterns from the deduplication process.
  • Progress Bar: Optionally display a progress bar for better visibility during the deduplication process.
  • Dry Run: Run the deduplication process without making any actual changes (useful for testing).

API

Deduplicator

from duperemover import Deduplicator

Constructor

Deduplicator(
    directory: str,
    hash_file: str = ".hashes.db",
    buffer_size: int = 65536,
    hash_algorithm: str = "xxhash",
    replace_strategy: str = "hardlink",
    max_threads: int = 4,
    sync_interval: int = 100,
    progress: bool = False,
    dry_run: bool = False,
    exclude_patterns: list[str] | None = None,
    use_bloom_filter: bool = False,
)

Methods

  • deduplicate(): Scan the directory for duplicates and process each file.
  • print_stats(): Print deduplication statistics.
  • count_files(directory): Count the number of files in a directory.
  • get_file_hash(file_path): Calculate and return the hash of a file.
  • are_same_file(file1, file2): Check if two files are the same based on their inodes.
  • create_hard_link(source, target): Create a hard link from the source file to the target file.
  • delete_duplicate(file_path): Delete a duplicate file.
  • rename_duplicate(file_path): Rename a duplicate file by appending .duplicate.
  • is_excluded(file_path): Check if a file matches any exclusion pattern.

Development

git clone https://github.com/daedalus/duperemover.git
cd duperemover
pip install -e ".[test]"

# run tests
pytest

# format
ruff format src/ tests/

# lint
ruff check src/ tests/

# type check
mypy src/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duperemover-0.1.0.1.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duperemover-0.1.0.1-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file duperemover-0.1.0.1.tar.gz.

File metadata

  • Download URL: duperemover-0.1.0.1.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for duperemover-0.1.0.1.tar.gz
Algorithm Hash digest
SHA256 1576a2ca260ab0653064b1748fa4c909520752c450c881a81ab876cce1eb77e9
MD5 7317d9636ce9ab844d7d8f9df8adfe91
BLAKE2b-256 41e69d6c3ac9ae82bbbdedd7c8947c564a48368328f7daab3ee3b0bf72f07c08

See more details on using hashes here.

Provenance

The following attestation bundles were made for duperemover-0.1.0.1.tar.gz:

Publisher: pypi-publish.yml on daedalus/duperemover

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file duperemover-0.1.0.1-py3-none-any.whl.

File metadata

  • Download URL: duperemover-0.1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for duperemover-0.1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 40413a96ae0a63f155d25a1f0e40cd3f2a446aae5560d4f7e19e3db6d9ad9a35
MD5 2c028058d9cbea260605878a7ab11b67
BLAKE2b-256 0c82d63b5e3f69a36b5541c0898e86db702182ad89c88174a63ddd52da2a5da6

See more details on using hashes here.

Provenance

The following attestation bundles were made for duperemover-0.1.0.1-py3-none-any.whl:

Publisher: pypi-publish.yml on daedalus/duperemover

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page