Semantic image dataset curation using CLIP + HDBSCAN

These details have not been verified by PyPI

Project links

Project description

imgraft

Semantic image dataset curation via CLIP + HDBSCAN.

imgraft helps you clean, balance, and curate any messy image dataset — no labels, no manual annotation needed. Give it a folder of images and it returns a semantically balanced, deduplicated subset with full visualizations.

How it works

imgraft pipeline

Five stages — all automatic:

Embed — every image gets a CLIP vector (ViT-B/32 or ViT-L/14)
Cluster — HDBSCAN finds semantic groups, UMAP pre-reduces for speed
Curate — pick a strategy: centroid, diversity, or text-query filter
Inspect — contact sheet PNGs per cluster for visual verification
Export — structured core/ + diverse/ + noise/ folders, ready for training

Why imgraft?

Raw image datasets are messy. Web scrapes contain near-duplicates. Crawled sets are class-imbalanced. Annotators mislabel. Training on noisy data silently destroys model performance.

imgraft solves this using semantic similarity — it embeds every image with CLIP, discovers visual groups with HDBSCAN, and selects the best representatives per cluster. No domain assumptions. Works on any image type.

Real-world impact: On a production OCR project, an automated CLIP + HDBSCAN curation pipeline recovered model accuracy from 2% (after discovering 100K+ mislabeled images) to 90% on first retrain, reaching 95% through iteration.

Install

pip install imgraft

For PNG visualization support:

pip install imgraft[vis]

Quick start

Python API

from imgraft import Curator

curator = Curator(model="ViT-B/32")   # or "ViT-L/14" for higher accuracy

result = curator.run(
    image_dir="./raw_images/",
    keep_ratio=0.25,           # keep 25% of dataset
    strategy="diversity",      # centroid | diversity | text-query | drop-noise
    drop_noise=True,           # remove OOD/noise images
)

# ── verify clusters visually ───────────────────────────────────────────────────
result.inspect(
    output_dir="./cluster_grids/",   # one PNG contact sheet per cluster
    n_per_side=5,                    # 5 centroid + 5 random thumbnails side by side
)
# open cluster_000.png, cluster_001.png etc. — visually verify before committing

# ── export structured dataset ──────────────────────────────────────────────────
result.export_clusters(
    output_dir="./dataset/",
    n_core=50,      # N most representative images per cluster
    n_diverse=50,   # N most diverse images per cluster
)

# ── interactive UMAP explorer ──────────────────────────────────────────────────
result.plot("clusters.html")   # hover any point to see the image

print(result.stats())
# {'total': 8211, 'kept': 1642, 'clusters': 47, 'noise': 312, ...}

CLI

# basic — keep 25% using diversity sampling
imgraft run ./images/ --keep 0.25 --out ./curated/

# with interactive visualization
imgraft run ./images/ --keep 0.25 --visualize --out ./curated/

# filter by text query (zero-shot, no labels needed)
imgraft run ./images/ \
  --strategy text-query \
  --query "a clear front-facing product photo on white background" \
  --out ./curated/

# drop noise/OOD images only, keep everything else
imgraft run ./images/ --strategy drop-noise --out ./cleaned/

# inspect dataset structure before curating
imgraft info ./images/

Cluster inspection

Before exporting your dataset, visually verify what each cluster contains:

result.inspect(output_dir="./cluster_grids/", n_per_side=5)

Each PNG contact sheet shows two sides:

Left	Right
Core — closest to cluster centroid	Random sample from the cluster

Lets you spot bad clusters at a glance — if cluster_003 is all blurry images or mislabeled examples, you know to drop it before training.

Structured export

result.export_clusters(
    output_dir="./dataset/",
    n_core=50,
    n_diverse=50,
)

Output layout:

dataset/
  cluster_000/
    core/           ← centroid-closest (most representative)
    diverse/        ← furthest-point sampled (max variety)
  cluster_001/
    core/
    diverse/
  ...
  noise/            ← all HDBSCAN outliers, isolated
  export_summary.json

core/ is best for training classifiers. diverse/ maximises variety and removes near-duplicates. noise/ gives a clean view of OOD images to review or discard.

Curation strategies

Strategy	Best for	How it works
`diversity`	Removing near-duplicates, max variety	Greedy furthest-point sampling per cluster
`centroid`	Balanced, representative subsets	Keeps images closest to each cluster center
`text-query`	Domain-specific filtering, no labels	CLIP zero-shot similarity to a text prompt
`drop-noise`	Quick clean without size reduction	Removes HDBSCAN outliers only

Works on any domain

Web-scraped datasets — deduplicate crawled images, remove OOD noise
Medical imaging — balance X-ray / pathology / dermoscopy class distributions
Satellite / aerial — curate geospatial image sets by region and content type
E-commerce / products — deduplicate product catalogs by visual similarity
Industrial / manufacturing — balance defect vs. normal in inspection datasets
Document / form images — group by layout type, sample representative subset
General ML training prep — quality control before sending to annotation

Backbone options

Model	Embedding dim	Speed	Quality
`ViT-B/32`	512	⚡ Fast	Good
`ViT-B/16`	512	Medium	Better
`ViT-L/14`	768	Slower	High
`ViT-H/14`	1024	Slow	Highest

Switch with --model ViT-L/14 or Curator(model="ViT-L/14").

Embedding cache

Re-embedding large folders is slow. Cache embeddings to disk so reruns skip it:

imgraft run ./images/ --cache ./.imgraft_cache/ --keep 0.25

Visualizations

Interactive HTML (result.plot("clusters.html")):

UMAP scatter colored by cluster
Hover any point → see the image thumbnail
Kept images at full opacity, dropped faded

Cluster grids (result.inspect("./grids/")):

One PNG per cluster — centroid sample vs random sample side by side
Visual verification before committing to training

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imgraft-0.1.0.tar.gz (9.2 kB view details)

Uploaded May 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

imgraft-0.1.0-py3-none-any.whl (8.7 kB view details)

Uploaded May 11, 2026 Python 3

File details

Details for the file imgraft-0.1.0.tar.gz.

File metadata

Download URL: imgraft-0.1.0.tar.gz
Upload date: May 11, 2026
Size: 9.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for imgraft-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c13a32abb0c81a809c54aba3e4a8293d9c487638dd3e4ac4d46c33886928c265`
MD5	`dd476d1dfd777910a9c10a85b00e339e`
BLAKE2b-256	`34a0ab776594142af3486fc24505861fb1bf8383e29d195fae699d4837260c1d`

See more details on using hashes here.

File details

Details for the file imgraft-0.1.0-py3-none-any.whl.

File metadata

Download URL: imgraft-0.1.0-py3-none-any.whl
Upload date: May 11, 2026
Size: 8.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for imgraft-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`859dcbef8c52077bc10825e72bebe5679bd5b311e24751f18f746af65a620dca`
MD5	`818060aa50e06dc4cc0ba55dc587d16c`
BLAKE2b-256	`8e767ebf0dbd9a4cb5dcaeb14a566525a84c33306f2287f6a97490a79fb03661`

See more details on using hashes here.

imgraft 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

imgraft

How it works

Why imgraft?

Install

Quick start

Python API

CLI

Cluster inspection

Structured export

Curation strategies

Works on any domain

Backbone options

Embedding cache

Visualizations

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes