Semantic image dataset curation using CLIP + HDBSCAN
Project description
imgraft
Semantic image dataset curation via CLIP + HDBSCAN.
imgraft helps you clean, balance, and curate any messy image dataset — no labels, no manual annotation needed. Give it a folder of images and it returns a semantically balanced, deduplicated subset with full visualizations.
How it works
Five stages — all automatic:
- Embed — every image gets a CLIP vector (ViT-B/32 or ViT-L/14)
- Cluster — HDBSCAN finds semantic groups, UMAP pre-reduces for speed
- Curate — pick a strategy: centroid, diversity, or text-query filter
- Inspect — contact sheet PNGs per cluster for visual verification
- Export — structured
core/+diverse/+noise/folders, ready for training
Why imgraft?
Raw image datasets are messy. Web scrapes contain near-duplicates. Crawled sets are class-imbalanced. Annotators mislabel. Training on noisy data silently destroys model performance.
imgraft solves this using semantic similarity — it embeds every image with CLIP, discovers visual groups with HDBSCAN, and selects the best representatives per cluster. No domain assumptions. Works on any image type.
Real-world impact: On a production OCR project, an automated CLIP + HDBSCAN curation pipeline recovered model accuracy from 2% (after discovering 100K+ mislabeled images) to 90% on first retrain, reaching 95% through iteration.
Install
pip install imgraft
For PNG visualization support:
pip install imgraft[vis]
Quick start
Python API
from imgraft import Curator
curator = Curator(model="ViT-B/32") # or "ViT-L/14" for higher accuracy
result = curator.run(
image_dir="./raw_images/",
keep_ratio=0.25, # keep 25% of dataset
strategy="diversity", # centroid | diversity | text-query | drop-noise
drop_noise=True, # remove OOD/noise images
)
# ── verify clusters visually ───────────────────────────────────────────────────
result.inspect(
output_dir="./cluster_grids/", # one PNG contact sheet per cluster
n_per_side=5, # 5 centroid + 5 random thumbnails side by side
)
# open cluster_000.png, cluster_001.png etc. — visually verify before committing
# ── export structured dataset ──────────────────────────────────────────────────
result.export_clusters(
output_dir="./dataset/",
n_core=50, # N most representative images per cluster
n_diverse=50, # N most diverse images per cluster
)
# ── interactive UMAP explorer ──────────────────────────────────────────────────
result.plot("clusters.html") # hover any point to see the image
print(result.stats())
# {'total': 8211, 'kept': 1642, 'clusters': 47, 'noise': 312, ...}
CLI
# basic — keep 25% using diversity sampling
imgraft run ./images/ --keep 0.25 --out ./curated/
# with interactive visualization
imgraft run ./images/ --keep 0.25 --visualize --out ./curated/
# filter by text query (zero-shot, no labels needed)
imgraft run ./images/ \
--strategy text-query \
--query "a clear front-facing product photo on white background" \
--out ./curated/
# drop noise/OOD images only, keep everything else
imgraft run ./images/ --strategy drop-noise --out ./cleaned/
# inspect dataset structure before curating
imgraft info ./images/
Cluster inspection
Before exporting your dataset, visually verify what each cluster contains:
result.inspect(output_dir="./cluster_grids/", n_per_side=5)
Each PNG contact sheet shows two sides:
| Left | Right |
|---|---|
| Core — closest to cluster centroid | Random sample from the cluster |
Lets you spot bad clusters at a glance — if cluster_003 is all blurry images or mislabeled examples, you know to drop it before training.
Structured export
result.export_clusters(
output_dir="./dataset/",
n_core=50,
n_diverse=50,
)
Output layout:
dataset/
cluster_000/
core/ ← centroid-closest (most representative)
diverse/ ← furthest-point sampled (max variety)
cluster_001/
core/
diverse/
...
noise/ ← all HDBSCAN outliers, isolated
export_summary.json
core/ is best for training classifiers. diverse/ maximises variety and removes near-duplicates. noise/ gives a clean view of OOD images to review or discard.
Curation strategies
| Strategy | Best for | How it works |
|---|---|---|
diversity |
Removing near-duplicates, max variety | Greedy furthest-point sampling per cluster |
centroid |
Balanced, representative subsets | Keeps images closest to each cluster center |
text-query |
Domain-specific filtering, no labels | CLIP zero-shot similarity to a text prompt |
drop-noise |
Quick clean without size reduction | Removes HDBSCAN outliers only |
Works on any domain
- Web-scraped datasets — deduplicate crawled images, remove OOD noise
- Medical imaging — balance X-ray / pathology / dermoscopy class distributions
- Satellite / aerial — curate geospatial image sets by region and content type
- E-commerce / products — deduplicate product catalogs by visual similarity
- Industrial / manufacturing — balance defect vs. normal in inspection datasets
- Document / form images — group by layout type, sample representative subset
- General ML training prep — quality control before sending to annotation
Backbone options
| Model | Embedding dim | Speed | Quality |
|---|---|---|---|
ViT-B/32 |
512 | ⚡ Fast | Good |
ViT-B/16 |
512 | Medium | Better |
ViT-L/14 |
768 | Slower | High |
ViT-H/14 |
1024 | Slow | Highest |
Switch with --model ViT-L/14 or Curator(model="ViT-L/14").
Embedding cache
Re-embedding large folders is slow. Cache embeddings to disk so reruns skip it:
imgraft run ./images/ --cache ./.imgraft_cache/ --keep 0.25
Visualizations
Interactive HTML (result.plot("clusters.html")):
- UMAP scatter colored by cluster
- Hover any point → see the image thumbnail
- Kept images at full opacity, dropped faded
Cluster grids (result.inspect("./grids/")):
- One PNG per cluster — centroid sample vs random sample side by side
- Visual verification before committing to training
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file imgraft-0.1.0.tar.gz.
File metadata
- Download URL: imgraft-0.1.0.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c13a32abb0c81a809c54aba3e4a8293d9c487638dd3e4ac4d46c33886928c265
|
|
| MD5 |
dd476d1dfd777910a9c10a85b00e339e
|
|
| BLAKE2b-256 |
34a0ab776594142af3486fc24505861fb1bf8383e29d195fae699d4837260c1d
|
File details
Details for the file imgraft-0.1.0-py3-none-any.whl.
File metadata
- Download URL: imgraft-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
859dcbef8c52077bc10825e72bebe5679bd5b311e24751f18f746af65a620dca
|
|
| MD5 |
818060aa50e06dc4cc0ba55dc587d16c
|
|
| BLAKE2b-256 |
8e767ebf0dbd9a4cb5dcaeb14a566525a84c33306f2287f6a97490a79fb03661
|