A Python utility for efficiently deduplicating files in directories
Project description
Duperemover
A Python utility for efficiently deduplicating files in directories.
Install
pip install duperemover
Usage
from duperemover import Deduplicator
dedup = Deduplicator(
directory="/path/to/directory",
hash_algorithm="xxhash",
replace_strategy="hardlink",
progress=True,
)
dedup.deduplicate()
dedup.print_stats()
CLI
duperemover --help
Command Syntax
duperemover <directory> [options]
Arguments:
<directory> Directory to scan for duplicates.
--hash-file <file> File to store hashes (default: .hashes.db).
--buffer-size <size> Buffer size for hashing (default: 65536, 64KB).
--hash-algorithm <alg> Hashing algorithm (choices: "xxhash", "blake3", "sha256", default: "xxhash" if available).
--replace-strategy <strategy> Strategy for handling duplicates (choices: "hardlink", "delete", "rename", default: "hardlink").
--max-threads <num> Number of threads to use for processing (default: 4).
--sync-interval <num> Sync interval for hashes to disk (default: 100).
--progress Show a progress bar while processing files.
--dry-run Simulate the deduplication process without making any changes.
--use-bloom-filter Use Bloom filter to speed up duplicate checking.
--exclude PATTERNS Exclude files matching these patterns.
Examples
# Basic deduplication (using default hashing algorithm)
duperemover /path/to/directory
# Using SHA256 as the hashing algorithm
duperemover /path/to/directory --hash-algorithm sha256
# Simulate deduplication (dry run)
duperemover /path/to/directory --dry-run
# Create hard links for duplicates, use Bloom filter, and show progress
duperemover /path/to/directory --replace-strategy hardlink --use-bloom-filter --progress
Features
- Hash Algorithms: Choose between
xxhash,blake3, andsha256for calculating file hashes. - Duplicate Handling Strategies:
hardlink: Replace duplicates with hard links.delete: Delete duplicate files.rename: Rename duplicate files by appending.duplicateto their names.
- Multi-threading: Process files in parallel to speed up deduplication.
- Bloom Filter: Optionally, enable the Bloom filter to speed up duplicate checks by avoiding re-hashing files.
- Exclusion Patterns: Exclude files matching specific patterns from the deduplication process.
- Progress Bar: Optionally display a progress bar for better visibility during the deduplication process.
- Dry Run: Run the deduplication process without making any actual changes (useful for testing).
API
Deduplicator
from duperemover import Deduplicator
Constructor
Deduplicator(
directory: str,
hash_file: str = ".hashes.db",
buffer_size: int = 65536,
hash_algorithm: str = "xxhash",
replace_strategy: str = "hardlink",
max_threads: int = 4,
sync_interval: int = 100,
progress: bool = False,
dry_run: bool = False,
exclude_patterns: list[str] | None = None,
use_bloom_filter: bool = False,
)
Methods
deduplicate(): Scan the directory for duplicates and process each file.print_stats(): Print deduplication statistics.count_files(directory): Count the number of files in a directory.get_file_hash(file_path): Calculate and return the hash of a file.are_same_file(file1, file2): Check if two files are the same based on their inodes.create_hard_link(source, target): Create a hard link from the source file to the target file.delete_duplicate(file_path): Delete a duplicate file.rename_duplicate(file_path): Rename a duplicate file by appending.duplicate.is_excluded(file_path): Check if a file matches any exclusion pattern.
Development
git clone https://github.com/daedalus/duperemover.git
cd duperemover
pip install -e ".[test]"
# run tests
pytest
# format
ruff format src/ tests/
# lint
ruff check src/ tests/
# type check
mypy src/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file duperemover-0.1.0.1.tar.gz.
File metadata
- Download URL: duperemover-0.1.0.1.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1576a2ca260ab0653064b1748fa4c909520752c450c881a81ab876cce1eb77e9
|
|
| MD5 |
7317d9636ce9ab844d7d8f9df8adfe91
|
|
| BLAKE2b-256 |
41e69d6c3ac9ae82bbbdedd7c8947c564a48368328f7daab3ee3b0bf72f07c08
|
Provenance
The following attestation bundles were made for duperemover-0.1.0.1.tar.gz:
Publisher:
pypi-publish.yml on daedalus/duperemover
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
duperemover-0.1.0.1.tar.gz -
Subject digest:
1576a2ca260ab0653064b1748fa4c909520752c450c881a81ab876cce1eb77e9 - Sigstore transparency entry: 1178859605
- Sigstore integration time:
-
Permalink:
daedalus/duperemover@7c3a9fa858561d9276652e29f4bbe971565dbb7b -
Branch / Tag:
refs/tags/v0.1.0.1 - Owner: https://github.com/daedalus
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@7c3a9fa858561d9276652e29f4bbe971565dbb7b -
Trigger Event:
release
-
Statement type:
File details
Details for the file duperemover-0.1.0.1-py3-none-any.whl.
File metadata
- Download URL: duperemover-0.1.0.1-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40413a96ae0a63f155d25a1f0e40cd3f2a446aae5560d4f7e19e3db6d9ad9a35
|
|
| MD5 |
2c028058d9cbea260605878a7ab11b67
|
|
| BLAKE2b-256 |
0c82d63b5e3f69a36b5541c0898e86db702182ad89c88174a63ddd52da2a5da6
|
Provenance
The following attestation bundles were made for duperemover-0.1.0.1-py3-none-any.whl:
Publisher:
pypi-publish.yml on daedalus/duperemover
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
duperemover-0.1.0.1-py3-none-any.whl -
Subject digest:
40413a96ae0a63f155d25a1f0e40cd3f2a446aae5560d4f7e19e3db6d9ad9a35 - Sigstore transparency entry: 1178859608
- Sigstore integration time:
-
Permalink:
daedalus/duperemover@7c3a9fa858561d9276652e29f4bbe971565dbb7b -
Branch / Tag:
refs/tags/v0.1.0.1 - Owner: https://github.com/daedalus
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@7c3a9fa858561d9276652e29f4bbe971565dbb7b -
Trigger Event:
release
-
Statement type: