A Python library to find and manage duplicate files

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Deduper

A Python library to find and manage duplicate files. Deduper scans directories, identifies duplicate files using hash algorithms, stores the information in a SQLite database, and provides tools to manage and delete duplicates.

Features

🔍 Fast duplicate detection using SHA256 or MD5 hashing
💾 SQLite database for storing file information
🗑️ Safe deletion with dry-run mode and confirmation prompts
🖥️ Command-line interface for easy automation
📊 Statistics about scanned files and duplicates

Installation

pip install -e .

Or install from source:

git clone https://github.com/barrust/deduper.git
cd deduper
pip install -e .

Quick Start

Using the CLI

1. Scan a directory

deduper scan /path/to/directory

2. Find duplicates

deduper find --show-all

3. View statistics

deduper stats

4. Delete duplicates (dry run)

deduper delete --dry-run

5. Delete duplicates (for real)

deduper delete --confirm

Using as a Library

from deduper import DuplicateFileFinder

# Initialize the finder
finder = DuplicateFileFinder(db_path="my_duplicates.db")

# Scan a directory
count = finder.scan_directory("/path/to/directory", recursive=True)
print(f"Scanned {count} files")

# Find duplicates
duplicates = finder.find_duplicates()
for hash_val, files in duplicates.items():
    print(f"Duplicate group: {files}")

# Get statistics
stats = finder.get_statistics()
print(f"Total files: {stats['total_files']}")
print(f"Duplicate files: {stats['duplicate_files']}")

# Get statistics by file extension
ext_stats = finder.get_statistics_by_extension()
for ext, data in ext_stats.items():
    print(f"{ext}: {data['count']} files, {data['total_size_bytes']} bytes")

# Delete duplicates (dry run first!)
deleted = finder.delete_duplicates(keep_first=True, dry_run=True)
print(f"Would delete: {deleted}")

# Actually delete
deleted = finder.delete_duplicates(keep_first=True, dry_run=False)
print(f"Deleted: {deleted}")

CLI Commands

`scan`

Scan a directory for files and store them in the database.

deduper scan /path/to/directory [--no-recursive]

Options:

--no-recursive: Don't scan subdirectories

`find`

Find and display duplicate files.

deduper find [--show-all]

Options:

--show-all: Display all duplicate files (default: show summary)

`delete`

Delete duplicate files.

deduper delete [--keep-first|--keep-last] [--dry-run|--confirm]

Options:

--keep-first: Keep the first file alphabetically (default)
--keep-last: Keep the last file alphabetically
--dry-run: Show what would be deleted without deleting (default)
--confirm: Actually delete files

`stats`

Display statistics about scanned files.

deduper stats [--by-extension]

Options:

--by-extension: Show statistics grouped by file extension

`clear`

Clear all data from the database.

deduper clear --confirm

Database

By default, deduper uses a SQLite database file named deduper.db in the current directory. You can specify a custom database path:

deduper --db /path/to/custom.db scan /directory

The database stores:

File paths (absolute paths)
File hashes (SHA256 by default)
File sizes
File extensions (for filtering and statistics)
Scan timestamps

Note: If you have an existing database from an earlier version without the extension column, you'll need to rebuild it by clearing and rescanning your files.

Safety Features

Dry run mode by default for deletions
Confirmation prompts for destructive operations
Keeps one copy of each duplicate file
Error handling for inaccessible files
Database transactions for data integrity

Example Usage

See example.py for a complete working example. Run it with:

python example.py

License

MIT License - see LICENSE file for details.

Author

Tyler Barrus

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

barrust

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.3

Jan 6, 2026

0.0.2

Dec 30, 2025

This version

0.0.1

Dec 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dup_file_finder-0.0.1.tar.gz (11.8 kB view details)

Uploaded Dec 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dup_file_finder-0.0.1-py3-none-any.whl (9.3 kB view details)

Uploaded Dec 30, 2025 Python 3

File details

Details for the file dup_file_finder-0.0.1.tar.gz.

File metadata

Download URL: dup_file_finder-0.0.1.tar.gz
Upload date: Dec 30, 2025
Size: 11.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dup_file_finder-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`46a25932a4d594ea133f64c656c26614a5128a1c9b2c24df483f8a903d4b90a5`
MD5	`6d7dbee8ab91d31e17fb25e2dcb27858`
BLAKE2b-256	`5a6de47718137283ba18ddd40956e5f578d99d8bdc18e42a77dbe2ddf0a709b1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dup_file_finder-0.0.1.tar.gz:

Publisher: publish.yml on barrust/dup-file-finder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dup_file_finder-0.0.1.tar.gz
- Subject digest: 46a25932a4d594ea133f64c656c26614a5128a1c9b2c24df483f8a903d4b90a5
- Sigstore transparency entry: 782313810
- Sigstore integration time: Dec 30, 2025
Source repository:
- Permalink: barrust/dup-file-finder@2440730958a2265fcb3bf2b749ac6051c90b5faa
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/barrust
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2440730958a2265fcb3bf2b749ac6051c90b5faa
- Trigger Event: release

File details

Details for the file dup_file_finder-0.0.1-py3-none-any.whl.

File metadata

Download URL: dup_file_finder-0.0.1-py3-none-any.whl
Upload date: Dec 30, 2025
Size: 9.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dup_file_finder-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a94dc7ccc6c44c3c0eebca94c2cb43f9d78df40b14662fc75b798ed418230114`
MD5	`80ffb6bc32cc5fe01daacda997115659`
BLAKE2b-256	`0e27bd8bf7fd91ad874bb067d8e4eccd828f8462b8aec64d5be9b721fc2858a1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dup_file_finder-0.0.1-py3-none-any.whl:

Publisher: publish.yml on barrust/dup-file-finder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dup_file_finder-0.0.1-py3-none-any.whl
- Subject digest: a94dc7ccc6c44c3c0eebca94c2cb43f9d78df40b14662fc75b798ed418230114
- Sigstore transparency entry: 782313814
- Sigstore integration time: Dec 30, 2025
Source repository:
- Permalink: barrust/dup-file-finder@2440730958a2265fcb3bf2b749ac6051c90b5faa
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/barrust
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2440730958a2265fcb3bf2b749ac6051c90b5faa
- Trigger Event: release

dup-file-finder 0.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Deduper

Features

Installation

Quick Start

Using the CLI

1. Scan a directory

2. Find duplicates

3. View statistics

4. Delete duplicates (dry run)

5. Delete duplicates (for real)

Using as a Library

CLI Commands

scan

find

delete

stats

clear

Database

Safety Features

Example Usage

License

Author

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`scan`

`find`

`delete`

`stats`

`clear`