Skip to main content

Find near-duplicate and exact-duplicate images using perceptual hashing

Project description

Image Duplicates Detective (imgduptective)

Find near-duplicate and exact-duplicate images in your photo collections using perceptual hashing.

How it works

imgduptective uses a gradient-based horizontal difference hash (dhash) to create a perceptual fingerprint of each image. Images that look similar will have similar hashes, even if the files differ in format, resolution, or compression. A hamming distance threshold controls how similar two images must be to count as duplicates.

Results are cached in a local SQLite database (~/.config/imgduptective/) so subsequent runs are fast — only new or modified files are processed.

Installation

pip install .

Or for development:

pip install -e .

Requires Python 3.10+ and Pillow.

Usage

# Find near-duplicates with hamming distance threshold of 5
imgduptective 5

# Find exact duplicates only (identical file content)
imgduptective --exact

# Add files to the database without comparing
imgduptective --add

# Check what duplicates would be found if current directory were added
imgduptective --check 5

# Show per-directory statistics
imgduptective --stats 5

# Open the built-in viewer to inspect and delete duplicates
imgduptective --view 5

Options

Flag Description
threshold Maximum hamming distance to consider a match (0 = identical perceptual hash)
--view Open the tkinter viewer to browse and manage duplicate groups
--stats Show per-directory duplicate statistics
--check Preview what duplicates would be found without modifying the database
--add Scan and hash files into the database without comparing
--photos Only process common photo formats (jpg, png, heic, webp, tiff, bmp, gif)
--exact Find exact file matches (same content) instead of perceptually similar
--no-scan Skip file scanning/hashing entirely, use the database cache only
--full-hash Use full-file SHA-1 instead of the default fast 64KB partial hash
--project NAME Use a named project database (e.g., work, personal, holidays)
--list-projects List available project databases with file counts

Projects

Organize separate photo collections into named projects. Each project has its own database:

# Scan work photos
cd ~/Photos/Work
imgduptective --project work --add

# Scan holiday photos
cd ~/Photos/Holidays
imgduptective --project holidays --add

# Find duplicates within holidays
imgduptective --project holidays 5

# List all projects
imgduptective --list-projects

Without --project, the default database is used.

Performance

The tool uses several strategies to minimize scan time:

  • Partial hashing (default): Only the first 64KB of each file is hashed (plus file size) for change detection. This is sufficient to distinguish different images while being 10-100x faster than full-file hashing on large files.
  • Stat-based caching: On repeat scans, files whose size and modification time haven't changed skip hashing entirely (a single stat() call per file).
  • --no-scan: For re-running comparisons with different thresholds without any file I/O.
  • --full-hash: Forces full SHA-1 of entire file contents when exact integrity verification is needed.
  • Multiprocessing: File hashing, image hash computation, and pair comparison all run in parallel.

Viewer

The built-in tkinter viewer (--view) displays duplicate groups side by side:

  • ←/→ or n/p/space: Navigate between groups
  • Click: Select/deselect images for deletion
  • d or Delete: Compress selected files with gzip and remove originals
  • q or Escape: Quit

Database

Hashes are stored in ~/.config/imgduptective/:

  • imgduptective.db — default project
  • imgduptective-{name}.db — named projects

The database has two tables:

  • HashValueTable: Content-addressed cache mapping file hashes to image perceptual hashes
  • FileTable: Maps file paths to their file hash, image hash, size, and modification time

Files that no longer exist are automatically pruned from the database on each scan.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imgduptective-0.2.0.tar.gz (22.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

imgduptective-0.2.0-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file imgduptective-0.2.0.tar.gz.

File metadata

  • Download URL: imgduptective-0.2.0.tar.gz
  • Upload date:
  • Size: 22.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for imgduptective-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f1ed50f1d93395d0caec936ab3d3d43a6359d274a3deda765a33fcef02c82114
MD5 6e8d84bc4e38e5a2fa55744575e4dab7
BLAKE2b-256 6fc491b773da2fefdd274d5bbcc0355c53d08c6bc8f25472e14442506f9701be

See more details on using hashes here.

File details

Details for the file imgduptective-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: imgduptective-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for imgduptective-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 77220e60b1eee35c484d966deb48a07e5d3b861626e1ddbea75f96615a660d7c
MD5 b7ce75af57e5d56ff6db7f6a7f8934c1
BLAKE2b-256 1695358e1f16e000ebd339924968eb7ebaeab7a4a5f52d836ef9cc4e6d2c4b19

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page