Skip to main content

A Python library to find and manage duplicate files

Project description

Deduper

A Python library to find and manage duplicate files. Deduper scans directories, identifies duplicate files using hash algorithms, stores the information in a SQLite database, and provides tools to manage and delete duplicates.

Features

  • 🔍 Fast duplicate detection using SHA256 or MD5 hashing
  • 💾 SQLite database for storing file information
  • 🗑️ Safe deletion with dry-run mode and confirmation prompts
  • 🖥️ Command-line interface for easy automation
  • 📊 Statistics about scanned files and duplicates

Installation

pip install -e .

Or install from source:

git clone https://github.com/barrust/deduper.git
cd deduper
pip install -e .

Quick Start

Using the CLI

1. Scan a directory

deduper scan /path/to/directory

2. Find duplicates

deduper find --show-all

3. View statistics

deduper stats

4. Delete duplicates (dry run)

deduper delete --dry-run

5. Delete duplicates (for real)

deduper delete --confirm

Using as a Library

from deduper import DuplicateFileFinder

# Initialize the finder
finder = DuplicateFileFinder(db_path="my_duplicates.db")

# Scan a directory
count = finder.scan_directory("/path/to/directory", recursive=True)
print(f"Scanned {count} files")

# Find duplicates
duplicates = finder.find_duplicates()
for hash_val, files in duplicates.items():
    print(f"Duplicate group: {files}")

# Get statistics
stats = finder.get_statistics()
print(f"Total files: {stats['total_files']}")
print(f"Duplicate files: {stats['duplicate_files']}")

# Get statistics by file extension
ext_stats = finder.get_statistics_by_extension()
for ext, data in ext_stats.items():
    print(f"{ext}: {data['count']} files, {data['total_size_bytes']} bytes")

# Delete duplicates (dry run first!)
deleted = finder.delete_duplicates(keep_first=True, dry_run=True)
print(f"Would delete: {deleted}")

# Actually delete
deleted = finder.delete_duplicates(keep_first=True, dry_run=False)
print(f"Deleted: {deleted}")

CLI Commands

scan

Scan a directory for files and store them in the database.

deduper scan /path/to/directory [--no-recursive]

Options:

  • --no-recursive: Don't scan subdirectories

find

Find and display duplicate files.

deduper find [--show-all]

Options:

  • --show-all: Display all duplicate files (default: show summary)

delete

Delete duplicate files.

deduper delete [--keep-first|--keep-last] [--dry-run|--confirm]

Options:

  • --keep-first: Keep the first file alphabetically (default)
  • --keep-last: Keep the last file alphabetically
  • --dry-run: Show what would be deleted without deleting (default)
  • --confirm: Actually delete files

stats

Display statistics about scanned files.

deduper stats [--by-extension]

Options:

  • --by-extension: Show statistics grouped by file extension

clear

Clear all data from the database.

deduper clear --confirm

Database

By default, deduper uses a SQLite database file named deduper.db in the current directory. You can specify a custom database path:

deduper --db /path/to/custom.db scan /directory

The database stores:

  • File paths (absolute paths)
  • File hashes (SHA256 by default)
  • File sizes
  • File extensions (for filtering and statistics)
  • Scan timestamps

Note: If you have an existing database from an earlier version without the extension column, you'll need to rebuild it by clearing and rescanning your files.

Safety Features

  • Dry run mode by default for deletions
  • Confirmation prompts for destructive operations
  • Keeps one copy of each duplicate file
  • Error handling for inaccessible files
  • Database transactions for data integrity

Example Usage

See example.py for a complete working example. Run it with:

python example.py

License

MIT License - see LICENSE file for details.

Author

Tyler Barrus

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dup_file_finder-0.0.1.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dup_file_finder-0.0.1-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file dup_file_finder-0.0.1.tar.gz.

File metadata

  • Download URL: dup_file_finder-0.0.1.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dup_file_finder-0.0.1.tar.gz
Algorithm Hash digest
SHA256 46a25932a4d594ea133f64c656c26614a5128a1c9b2c24df483f8a903d4b90a5
MD5 6d7dbee8ab91d31e17fb25e2dcb27858
BLAKE2b-256 5a6de47718137283ba18ddd40956e5f578d99d8bdc18e42a77dbe2ddf0a709b1

See more details on using hashes here.

Provenance

The following attestation bundles were made for dup_file_finder-0.0.1.tar.gz:

Publisher: publish.yml on barrust/dup-file-finder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dup_file_finder-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for dup_file_finder-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a94dc7ccc6c44c3c0eebca94c2cb43f9d78df40b14662fc75b798ed418230114
MD5 80ffb6bc32cc5fe01daacda997115659
BLAKE2b-256 0e27bd8bf7fd91ad874bb067d8e4eccd828f8462b8aec64d5be9b721fc2858a1

See more details on using hashes here.

Provenance

The following attestation bundles were made for dup_file_finder-0.0.1-py3-none-any.whl:

Publisher: publish.yml on barrust/dup-file-finder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page