A Python library to find and manage duplicate files
Project description
Duplicate File Finder
A Python library to find and manage duplicate files. It scans directories, identifies duplicate files using hash algorithms, stores the information in a SQLite database, and provides tools to manage and delete duplicates.
Features
- 🔍 Fast duplicate detection using SHA256 or MD5 hashing
- 💾 SQLite database for storing file information
- 🗑️ Safe deletion with dry-run mode and confirmation prompts
- 🖥️ Command-line interface for easy automation
- 📊 Statistics about scanned files and duplicates
Documentation
Documentation is hosted on readthedocs.org
Installation
From PyPi
pip install dup-file-finder
Or install from source:
git clone https://github.com/barrust/dup-file-finder.git
cd dup-file-finder
pip install -e .
Quick Start
Using the CLI
1. Scan a directory
dupFileFinder scan /path/to/directory
2. Find duplicates
dupFileFinder find --show-all
3. View statistics
dupFileFinder stats
4. Delete duplicates (dry run)
dupFileFinder delete --dry-run
5. Delete duplicates (for real)
dupFileFinder delete --confirm
Using as a Library
from dup_file_finder import DuplicateFileFinder
# Initialize the finder
finder = DuplicateFileFinder(db_path="my_duplicates.db")
# Scan a directory
count = finder.scan_directory("/path/to/directory", recursive=True)
print(f"Scanned {count} files")
# Find duplicates
duplicates = finder.find_duplicates()
for hash_val, files in duplicates.items():
print(f"Duplicate group: {files}")
# Get statistics
stats = finder.get_statistics()
print(f"Total files: {stats['total_files']}")
print(f"Duplicate files: {stats['duplicate_files']}")
# Get statistics by file extension
ext_stats = finder.get_statistics_by_extension()
for ext, data in ext_stats.items():
print(f"{ext}: {data['count']} files, {data['total_size_bytes']} bytes")
# Delete duplicates (dry run first!)
deleted = finder.delete_duplicates(keep_first=True, dry_run=True)
print(f"Would delete: {deleted}")
# Actually delete
deleted = finder.delete_duplicates(keep_first=True, dry_run=False)
print(f"Deleted: {deleted}")
CLI Commands
scan
Scan a directory for files and store them in the database.
dupFileFinder scan /path/to/directory [--no-recursive]
Options:
--no-recursive: Don't scan subdirectories
find
Find and display duplicate files.
dupFileFinder find [--show-all]
Options:
--show-all: Display all duplicate files (default: show summary)
delete
Delete duplicate files.
dupFileFinder delete [--keep-first|--keep-last] [--dry-run|--confirm]
Options:
--keep-first: Keep the first file alphabetically (default)--keep-last: Keep the last file alphabetically--dry-run: Show what would be deleted without deleting (default)--confirm: Actually delete files
stats
Display statistics about scanned files.
dupFileFinder stats [--by-extension]
Options:
--by-extension: Show statistics grouped by file extension
clear
Clear all data from the database.
dupFileFinder clear --confirm
Database
By default, dupFileFinder uses a SQLite database file named deduper.db in the current directory. You can specify a custom database path:
dupFileFinder --db /path/to/custom.db scan /directory
The database stores:
- File paths (absolute paths)
- File hashes (SHA256 by default)
- File sizes
- File extensions (for filtering and statistics)
- Scan timestamps
Note: If you have an existing database from an earlier version without the extension column, you'll need to rebuild it by clearing and rescanning your files.
Safety Features
- Dry run mode by default for deletions
- Confirmation prompts for destructive operations
- Keeps one copy of each duplicate file
- Error handling for inaccessible files
- Database transactions for data integrity
Example Usage
See example.py for a complete working example. Run it with:
python example.py
License
MIT License - see LICENSE file for details.
Author
Tyler Barrus
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dup_file_finder-0.0.3.tar.gz.
File metadata
- Download URL: dup_file_finder-0.0.3.tar.gz
- Upload date:
- Size: 15.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c8f4e72885717942789bcfbfc1de7b0ca9116e7ecf30d1b30c01f99da19d7ae
|
|
| MD5 |
ec8324315ce30aae6adef9926a06f7ba
|
|
| BLAKE2b-256 |
44505a6bebc80a99c99b9852ae2bbf22dd1cddc365eaa40f77a06b0ecfb013eb
|
Provenance
The following attestation bundles were made for dup_file_finder-0.0.3.tar.gz:
Publisher:
publish.yml on barrust/dup-file-finder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dup_file_finder-0.0.3.tar.gz -
Subject digest:
4c8f4e72885717942789bcfbfc1de7b0ca9116e7ecf30d1b30c01f99da19d7ae - Sigstore transparency entry: 797415031
- Sigstore integration time:
-
Permalink:
barrust/dup-file-finder@c10206c364de50392e02ce7080742a992dfd592a -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/barrust
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c10206c364de50392e02ce7080742a992dfd592a -
Trigger Event:
release
-
Statement type:
File details
Details for the file dup_file_finder-0.0.3-py3-none-any.whl.
File metadata
- Download URL: dup_file_finder-0.0.3-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf8e92d0e4205832350405435a82076b85ab82d4ec665f0b11eba1533bd156a7
|
|
| MD5 |
f56c0ca4be6cf2945c224b30b51eb6f7
|
|
| BLAKE2b-256 |
e425ae5f7012990dbe0875c0e7af7f0cec427726c1f839dede8e5b009aef9d0f
|
Provenance
The following attestation bundles were made for dup_file_finder-0.0.3-py3-none-any.whl:
Publisher:
publish.yml on barrust/dup-file-finder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dup_file_finder-0.0.3-py3-none-any.whl -
Subject digest:
cf8e92d0e4205832350405435a82076b85ab82d4ec665f0b11eba1533bd156a7 - Sigstore transparency entry: 797415039
- Sigstore integration time:
-
Permalink:
barrust/dup-file-finder@c10206c364de50392e02ce7080742a992dfd592a -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/barrust
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c10206c364de50392e02ce7080742a992dfd592a -
Trigger Event:
release
-
Statement type: