Skip to main content

A Python library to find and manage duplicate files

Project description

Duplicate File Finder

License GitHub release Build Status PyPI Release

A Python library to find and manage duplicate files. It scans directories, identifies duplicate files using hash algorithms, stores the information in a SQLite database, and provides tools to manage and delete duplicates.

Features

  • 🔍 Fast duplicate detection using SHA256 or MD5 hashing
  • 💾 SQLite database for storing file information
  • 🗑️ Safe deletion with dry-run mode and confirmation prompts
  • 🖥️ Command-line interface for easy automation
  • 📊 Statistics about scanned files and duplicates

Installation

From PyPi

pip install dup-file-finder

Or install from source:

git clone https://github.com/barrust/dup-file-finder.git
cd dup-file-finder
pip install -e .

Quick Start

Using the CLI

1. Scan a directory

dupFileFinder scan /path/to/directory

2. Find duplicates

dupFileFinder find --show-all

3. View statistics

dupFileFinder stats

4. Delete duplicates (dry run)

dupFileFinder delete --dry-run

5. Delete duplicates (for real)

dupFileFinder delete --confirm

Using as a Library

from dup_file_finder import DuplicateFileFinder

# Initialize the finder
finder = DuplicateFileFinder(db_path="my_duplicates.db")

# Scan a directory
count = finder.scan_directory("/path/to/directory", recursive=True)
print(f"Scanned {count} files")

# Find duplicates
duplicates = finder.find_duplicates()
for hash_val, files in duplicates.items():
    print(f"Duplicate group: {files}")

# Get statistics
stats = finder.get_statistics()
print(f"Total files: {stats['total_files']}")
print(f"Duplicate files: {stats['duplicate_files']}")

# Get statistics by file extension
ext_stats = finder.get_statistics_by_extension()
for ext, data in ext_stats.items():
    print(f"{ext}: {data['count']} files, {data['total_size_bytes']} bytes")

# Delete duplicates (dry run first!)
deleted = finder.delete_duplicates(keep_first=True, dry_run=True)
print(f"Would delete: {deleted}")

# Actually delete
deleted = finder.delete_duplicates(keep_first=True, dry_run=False)
print(f"Deleted: {deleted}")

CLI Commands

scan

Scan a directory for files and store them in the database.

dupFileFinder scan /path/to/directory [--no-recursive]

Options:

  • --no-recursive: Don't scan subdirectories

find

Find and display duplicate files.

dupFileFinder find [--show-all]

Options:

  • --show-all: Display all duplicate files (default: show summary)

delete

Delete duplicate files.

dupFileFinder delete [--keep-first|--keep-last] [--dry-run|--confirm]

Options:

  • --keep-first: Keep the first file alphabetically (default)
  • --keep-last: Keep the last file alphabetically
  • --dry-run: Show what would be deleted without deleting (default)
  • --confirm: Actually delete files

stats

Display statistics about scanned files.

dupFileFinder stats [--by-extension]

Options:

  • --by-extension: Show statistics grouped by file extension

clear

Clear all data from the database.

dupFileFinder clear --confirm

Database

By default, dupFileFinder uses a SQLite database file named deduper.db in the current directory. You can specify a custom database path:

dupFileFinder --db /path/to/custom.db scan /directory

The database stores:

  • File paths (absolute paths)
  • File hashes (SHA256 by default)
  • File sizes
  • File extensions (for filtering and statistics)
  • Scan timestamps

Note: If you have an existing database from an earlier version without the extension column, you'll need to rebuild it by clearing and rescanning your files.

Safety Features

  • Dry run mode by default for deletions
  • Confirmation prompts for destructive operations
  • Keeps one copy of each duplicate file
  • Error handling for inaccessible files
  • Database transactions for data integrity

Example Usage

See example.py for a complete working example. Run it with:

python example.py

License

MIT License - see LICENSE file for details.

Author

Tyler Barrus

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dup_file_finder-0.0.2.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dup_file_finder-0.0.2-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file dup_file_finder-0.0.2.tar.gz.

File metadata

  • Download URL: dup_file_finder-0.0.2.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dup_file_finder-0.0.2.tar.gz
Algorithm Hash digest
SHA256 52b53529980e77cc8b08cd173f361ef00c351828300fea964badeb656bdf6ac4
MD5 a20a122109c7b515aad0d33bae4607f5
BLAKE2b-256 215d0fba2226f81a1b84515c14e25d03351b1c791d018ec50403ed36c36d4e5f

See more details on using hashes here.

Provenance

The following attestation bundles were made for dup_file_finder-0.0.2.tar.gz:

Publisher: publish.yml on barrust/dup-file-finder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dup_file_finder-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for dup_file_finder-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ce899efe3feae70bce8687bc10b1b4a3b76e915bc461735167089a2153157f32
MD5 5f7c73aac13e1ab70ac3be35b5a2b94b
BLAKE2b-256 e55f4a17167f74fb546110fda9543f51ad3e170ab7abb36e80e225b298ac22d3

See more details on using hashes here.

Provenance

The following attestation bundles were made for dup_file_finder-0.0.2-py3-none-any.whl:

Publisher: publish.yml on barrust/dup-file-finder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page