Modular perceptual-hash based image deduplication library.
Project description
Doppix 🔍
Modular, production-grade image deduplication via perceptual hashing.
Doppix finds exact and near-duplicate images in large collections — fast, with zero ML inference overhead. It clusters images by perceptual similarity and gives you full control over what to do with duplicates: visualise, move, or delete them.
Features
| 🔀 4 hash algorithms | ahash, phash, dhash, whash — swap at any time |
| ⚡ Parallel hashing | Thread-pool I/O for large datasets |
| 📊 Rich result object | ClusterSet with stats, per-cluster metadata, summary |
| 🖼️ Contact sheets | One PNG per cluster for easy visual review |
| 📦 Safe IO | Move duplicates, delete them, or just dry_run first |
| 🔧 CLI + Python API | Use as a script or import into your own pipeline |
| 🧩 Modular | Every layer is independently importable and replaceable |
Installation
pip install doppix
# or from source:
git clone https://github.com/tannousgeagea/doppix.git
cd doppix && pip install -e .
Quick start
Python API
from doppix import Doppix
dp = Doppix(threshold=5, hasher="phash", num_workers=8)
result = dp.run(
"./my_photos",
recursive=True,
visualize=True, # save cluster contact sheets
transfer=True, # move duplicates to archive/
destination_folder="./archive",
)
print(result.summary())
One-liner shortcut (no IO side-effects):
from doppix import Doppix
result = Doppix.find_duplicates("./my_photos", threshold=3)
print(f"Found {result.num_duplicates} duplicates in {result.num_clusters} clusters")
CLI
# Basic — print summary only
doppix ./my_photos
# Use perceptual hash, scan recursively, save contact sheets
doppix ./my_photos --hasher phash --recursive --visualize
# Move duplicates to a separate folder (preview first)
doppix ./my_photos --transfer --destination ./archive --dry-run
doppix ./my_photos --transfer --destination ./archive
# Permanently delete duplicates (use with care!)
doppix ./my_photos --delete
# All options
doppix --help
Concepts
Threshold
The threshold controls how different two images can be (in Hamming distance) and still be grouped together.
| Threshold | Catches |
|---|---|
| 0 | Byte-identical images only |
| 1–3 | Near-identical (minor compression artefacts) |
| 5–8 | Visually similar (slight crops, brightness changes) |
| 10+ | Broadly similar (same subject, different conditions) |
Hash algorithms
| Name | Speed | Robustness | Best for |
|---|---|---|---|
ahash |
⚡⚡⚡ | Medium | Large datasets, good default |
phash |
⚡⚡ | High | Minor edits, JPEG re-compression |
dhash |
⚡⚡⚡ | Medium | Cropped/shifted images |
whash |
⚡ | High | Perceptually accurate matching |
ClusterSet
result.num_clusters # int — number of distinct groups
result.num_duplicates # int — images that are not representatives
result.duplicate_ratio # float — fraction of total that are duplicates
result.summary() # pretty-printed stats string
for cluster in result:
print(cluster.representative) # path to the kept image
print(cluster.duplicates) # list of paths to deduplicate
print(cluster.size) # total images in this group
Architecture
doppix/
├── hashing/
│ └── hashers.py # BaseHasher + AverageHasher, PHasher, DHasher, WHasher
├── clustering/
│ ├── models.py # Cluster, ClusterSet data classes
│ └── cluster.py # cluster_images() — parallel hash + greedy scan
├── io/
│ ├── loader.py # collect_images() — recursive directory scan
│ └── transfer.py # transfer_images(), delete_duplicates()
├── visualization/
│ └── plot.py # visualize_clusters() — matplotlib contact sheets
├── pipeline/
│ └── runner.py # Doppix — high-level orchestrator
└── __main__.py # CLI entry point
Each layer is independently importable:
from doppix.hashing import get_hasher
from doppix.clustering import cluster_images
from doppix.io import collect_images, transfer_images
from doppix.visualization import visualize_clusters
Advanced usage
Custom hasher
from doppix.hashing.hashers import BaseHasher
import imagehash
from PIL import Image
class MyHasher(BaseHasher):
name = "myhash"
def compute(self, path):
with Image.open(path) as img:
return imagehash.colorhash(img)
from doppix.clustering import cluster_images
result = cluster_images(image_paths, hasher=MyHasher(), threshold=3)
Pipeline without the runner
from doppix.io import collect_images
from doppix.clustering import cluster_images
from doppix.io import transfer_images
images = collect_images("/data/raw", recursive=True)
result = cluster_images(images, threshold=5, hasher_name="phash", num_workers=16)
transfer_images(result, "/data/archive", dry_run=False)
Multiple folders
from doppix import Doppix
folders = ["amk_bunker_impurity", "site_b_waste", "site_c_material"]
dp = Doppix(threshold=3, hasher="phash")
for folder in folders:
dp.run(
f"/data/{folder}",
visualize=True,
transfer=True,
output_dir=f"/data/{folder}/clusters",
destination_folder=f"/data/{folder}/archive",
)
Running tests
pip install -e ".[dev]"
pytest
Roadmap
- ANN-backed clustering (FAISS) for 1M+ image datasets
- GPU hashing support
- JSON/CSV export of cluster results
- Web UI for cluster review
- Docker image
License
MIT © Tannous Geagea
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doppix-0.1.0.tar.gz.
File metadata
- Download URL: doppix-0.1.0.tar.gz
- Upload date:
- Size: 21.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd59749932e2ed9b89d5aa0b5ff806f066f570ac5ba6039f3e019af3d7732d61
|
|
| MD5 |
87a3a0531105d012b870c0ff8702fec4
|
|
| BLAKE2b-256 |
d6a4866f8b60f6f4764fbaa8b1e49b223fa2c879e5dfe5a7bf16257214c4a3e4
|
File details
Details for the file doppix-0.1.0-py3-none-any.whl.
File metadata
- Download URL: doppix-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f2068188c2938df933f665a4aecdf9f37753d6afcdba90a542b21a821829bae
|
|
| MD5 |
a2c181cf08256733f9a6258e747bea2c
|
|
| BLAKE2b-256 |
b3d57e1c4fd338ad1c1f9a0b5e352c8574e4f6f7ab2b8725f30b704fdd1a19a9
|