Skip to main content

A Python library to find and manage duplicate files

Project description

Duplicate File Finder

License GitHub release Build Status PyPI Release Downloads

A Python library to find and manage duplicate files. It scans directories, identifies duplicate files using hash algorithms, stores the information in a SQLite database, and provides tools to manage and delete duplicates.

Features

  • 🔍 Fast duplicate detection using SHA256 or MD5 hashing
  • 💾 SQLite database for storing file information
  • 🗑️ Safe deletion with dry-run mode and confirmation prompts
  • 🖥️ Command-line interface for easy automation
  • 📊 Statistics about scanned files and duplicates

Documentation

Documentation is hosted on readthedocs.org

Installation

From PyPi

pip install dup-file-finder

Or install from source:

git clone https://github.com/barrust/dup-file-finder.git
cd dup-file-finder
pip install -e .

Quick Start

Using the CLI

1. Scan a directory

dupFileFinder scan /path/to/directory

2. Find duplicates

dupFileFinder find --show-all

3. View statistics

dupFileFinder stats

4. Delete duplicates (dry run)

dupFileFinder delete --dry-run

5. Delete duplicates (for real)

dupFileFinder delete --confirm

Using as a Library

from dup_file_finder import DuplicateFileFinder

# Initialize the finder
finder = DuplicateFileFinder(db_path="my_duplicates.db")

# Scan a directory
count = finder.scan_directory("/path/to/directory", recursive=True)
print(f"Scanned {count} files")

# Find duplicates
duplicates = finder.find_duplicates()
for hash_val, files in duplicates.items():
    print(f"Duplicate group: {files}")

# Get statistics
stats = finder.get_statistics()
print(f"Total files: {stats['total_files']}")
print(f"Duplicate files: {stats['duplicate_files']}")

# Get statistics by file extension
ext_stats = finder.get_statistics_by_extension()
for ext, data in ext_stats.items():
    print(f"{ext}: {data['count']} files, {data['total_size_bytes']} bytes")

# Delete duplicates (dry run first!)
deleted = finder.delete_duplicates(keep_first=True, dry_run=True)
print(f"Would delete: {deleted}")

# Actually delete
deleted = finder.delete_duplicates(keep_first=True, dry_run=False)
print(f"Deleted: {deleted}")

CLI Commands

scan

Scan a directory for files and store them in the database.

dupFileFinder scan /path/to/directory [--no-recursive]

Options:

  • --no-recursive: Don't scan subdirectories

find

Find and display duplicate files.

dupFileFinder find [--show-all]

Options:

  • --show-all: Display all duplicate files (default: show summary)

delete

Delete duplicate files.

dupFileFinder delete [--keep-first|--keep-last] [--dry-run|--confirm]

Options:

  • --keep-first: Keep the first file alphabetically (default)
  • --keep-last: Keep the last file alphabetically
  • --dry-run: Show what would be deleted without deleting (default)
  • --confirm: Actually delete files

stats

Display statistics about scanned files.

dupFileFinder stats [--by-extension]

Options:

  • --by-extension: Show statistics grouped by file extension

clear

Clear all data from the database.

dupFileFinder clear --confirm

Database

By default, dupFileFinder uses a SQLite database file named deduper.db in the current directory. You can specify a custom database path:

dupFileFinder --db /path/to/custom.db scan /directory

The database stores:

  • File paths (absolute paths)
  • File hashes (SHA256 by default)
  • File sizes
  • File extensions (for filtering and statistics)
  • Scan timestamps

Note: If you have an existing database from an earlier version without the extension column, you'll need to rebuild it by clearing and rescanning your files.

Safety Features

  • Dry run mode by default for deletions
  • Confirmation prompts for destructive operations
  • Keeps one copy of each duplicate file
  • Error handling for inaccessible files
  • Database transactions for data integrity

Example Usage

See example.py for a complete working example. Run it with:

python example.py

License

MIT License - see LICENSE file for details.

Author

Tyler Barrus

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dup_file_finder-0.0.3.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dup_file_finder-0.0.3-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file dup_file_finder-0.0.3.tar.gz.

File metadata

  • Download URL: dup_file_finder-0.0.3.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dup_file_finder-0.0.3.tar.gz
Algorithm Hash digest
SHA256 4c8f4e72885717942789bcfbfc1de7b0ca9116e7ecf30d1b30c01f99da19d7ae
MD5 ec8324315ce30aae6adef9926a06f7ba
BLAKE2b-256 44505a6bebc80a99c99b9852ae2bbf22dd1cddc365eaa40f77a06b0ecfb013eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for dup_file_finder-0.0.3.tar.gz:

Publisher: publish.yml on barrust/dup-file-finder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dup_file_finder-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for dup_file_finder-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 cf8e92d0e4205832350405435a82076b85ab82d4ec665f0b11eba1533bd156a7
MD5 f56c0ca4be6cf2945c224b30b51eb6f7
BLAKE2b-256 e425ae5f7012990dbe0875c0e7af7f0cec427726c1f839dede8e5b009aef9d0f

See more details on using hashes here.

Provenance

The following attestation bundles were made for dup_file_finder-0.0.3-py3-none-any.whl:

Publisher: publish.yml on barrust/dup-file-finder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page