Skip to main content

Modular perceptual-hash based image deduplication library.

Project description

Doppix 🔍

Modular, production-grade image deduplication via perceptual hashing.

Doppix finds exact and near-duplicate images in large collections — fast, with zero ML inference overhead. It clusters images by perceptual similarity and gives you full control over what to do with duplicates: visualise, move, or delete them.


Features

🔀 4 hash algorithms ahash, phash, dhash, whash — swap at any time
Parallel hashing Thread-pool I/O for large datasets
📊 Rich result object ClusterSet with stats, per-cluster metadata, summary
🖼️ Contact sheets One PNG per cluster for easy visual review
📦 Safe IO Move duplicates, delete them, or just dry_run first
🔧 CLI + Python API Use as a script or import into your own pipeline
🧩 Modular Every layer is independently importable and replaceable

Installation

pip install doppix
# or from source:
git clone https://github.com/tannousgeagea/doppix.git
cd doppix && pip install -e .

Quick start

Python API

from doppix import Doppix

dp = Doppix(threshold=5, hasher="phash", num_workers=8)
result = dp.run(
    "./my_photos",
    recursive=True,
    visualize=True,          # save cluster contact sheets
    transfer=True,           # move duplicates to archive/
    destination_folder="./archive",
)
print(result.summary())

One-liner shortcut (no IO side-effects):

from doppix import Doppix

result = Doppix.find_duplicates("./my_photos", threshold=3)
print(f"Found {result.num_duplicates} duplicates in {result.num_clusters} clusters")

CLI

# Basic — print summary only
doppix ./my_photos

# Use perceptual hash, scan recursively, save contact sheets
doppix ./my_photos --hasher phash --recursive --visualize

# Move duplicates to a separate folder (preview first)
doppix ./my_photos --transfer --destination ./archive --dry-run
doppix ./my_photos --transfer --destination ./archive

# Permanently delete duplicates (use with care!)
doppix ./my_photos --delete

# All options
doppix --help

Concepts

Threshold

The threshold controls how different two images can be (in Hamming distance) and still be grouped together.

Threshold Catches
0 Byte-identical images only
1–3 Near-identical (minor compression artefacts)
5–8 Visually similar (slight crops, brightness changes)
10+ Broadly similar (same subject, different conditions)

Hash algorithms

Name Speed Robustness Best for
ahash ⚡⚡⚡ Medium Large datasets, good default
phash ⚡⚡ High Minor edits, JPEG re-compression
dhash ⚡⚡⚡ Medium Cropped/shifted images
whash High Perceptually accurate matching

ClusterSet

result.num_clusters      # int — number of distinct groups
result.num_duplicates    # int — images that are not representatives
result.duplicate_ratio   # float — fraction of total that are duplicates
result.summary()         # pretty-printed stats string

for cluster in result:
    print(cluster.representative)  # path to the kept image
    print(cluster.duplicates)      # list of paths to deduplicate
    print(cluster.size)            # total images in this group

Architecture

doppix/
├── hashing/
│   └── hashers.py       # BaseHasher + AverageHasher, PHasher, DHasher, WHasher
├── clustering/
│   ├── models.py        # Cluster, ClusterSet data classes
│   └── cluster.py       # cluster_images() — parallel hash + greedy scan
├── io/
│   ├── loader.py        # collect_images() — recursive directory scan
│   └── transfer.py      # transfer_images(), delete_duplicates()
├── visualization/
│   └── plot.py          # visualize_clusters() — matplotlib contact sheets
├── pipeline/
│   └── runner.py        # Doppix — high-level orchestrator
└── __main__.py          # CLI entry point

Each layer is independently importable:

from doppix.hashing import get_hasher
from doppix.clustering import cluster_images
from doppix.io import collect_images, transfer_images
from doppix.visualization import visualize_clusters

Advanced usage

Custom hasher

from doppix.hashing.hashers import BaseHasher
import imagehash
from PIL import Image

class MyHasher(BaseHasher):
    name = "myhash"
    def compute(self, path):
        with Image.open(path) as img:
            return imagehash.colorhash(img)

from doppix.clustering import cluster_images
result = cluster_images(image_paths, hasher=MyHasher(), threshold=3)

Pipeline without the runner

from doppix.io import collect_images
from doppix.clustering import cluster_images
from doppix.io import transfer_images

images = collect_images("/data/raw", recursive=True)
result = cluster_images(images, threshold=5, hasher_name="phash", num_workers=16)
transfer_images(result, "/data/archive", dry_run=False)

Multiple folders

from doppix import Doppix

folders = ["amk_bunker_impurity", "site_b_waste", "site_c_material"]
dp = Doppix(threshold=3, hasher="phash")

for folder in folders:
    dp.run(
        f"/data/{folder}",
        visualize=True,
        transfer=True,
        output_dir=f"/data/{folder}/clusters",
        destination_folder=f"/data/{folder}/archive",
    )

Running tests

pip install -e ".[dev]"
pytest

Roadmap

  • ANN-backed clustering (FAISS) for 1M+ image datasets
  • GPU hashing support
  • JSON/CSV export of cluster results
  • Web UI for cluster review
  • Docker image

License

MIT © Tannous Geagea

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doppix-0.1.0.tar.gz (21.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doppix-0.1.0-py3-none-any.whl (22.0 kB view details)

Uploaded Python 3

File details

Details for the file doppix-0.1.0.tar.gz.

File metadata

  • Download URL: doppix-0.1.0.tar.gz
  • Upload date:
  • Size: 21.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for doppix-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dd59749932e2ed9b89d5aa0b5ff806f066f570ac5ba6039f3e019af3d7732d61
MD5 87a3a0531105d012b870c0ff8702fec4
BLAKE2b-256 d6a4866f8b60f6f4764fbaa8b1e49b223fa2c879e5dfe5a7bf16257214c4a3e4

See more details on using hashes here.

File details

Details for the file doppix-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: doppix-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for doppix-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3f2068188c2938df933f665a4aecdf9f37753d6afcdba90a542b21a821829bae
MD5 a2c181cf08256733f9a6258e747bea2c
BLAKE2b-256 b3d57e1c4fd338ad1c1f9a0b5e352c8574e4f6f7ab2b8725f30b704fdd1a19a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page