Skip to main content

Semantic image dataset curation using CLIP + HDBSCAN

Project description

imgraft

Semantic image dataset curation via CLIP + HDBSCAN.

PyPI version License: MIT Python 3.9+

imgraft helps you clean, balance, and curate any messy image dataset — no labels, no manual annotation needed. Give it a folder of images and it returns a semantically balanced, deduplicated subset with full visualizations.


How it works

imgraft pipeline

Five stages — all automatic:

  1. Embed — every image gets a CLIP vector (ViT-B/32 or ViT-L/14)
  2. Cluster — HDBSCAN finds semantic groups, UMAP pre-reduces for speed
  3. Curate — pick a strategy: centroid, diversity, or text-query filter
  4. Inspect — contact sheet PNGs per cluster for visual verification
  5. Export — structured core/ + diverse/ + noise/ folders, ready for training

Why imgraft?

Raw image datasets are messy. Web scrapes contain near-duplicates. Crawled sets are class-imbalanced. Annotators mislabel. Training on noisy data silently destroys model performance.

imgraft solves this using semantic similarity — it embeds every image with CLIP, discovers visual groups with HDBSCAN, and selects the best representatives per cluster. No domain assumptions. Works on any image type.

Real-world impact: On a production OCR project, an automated CLIP + HDBSCAN curation pipeline recovered model accuracy from 2% (after discovering 100K+ mislabeled images) to 90% on first retrain, reaching 95% through iteration.


Install

pip install imgraft

For PNG visualization support:

pip install imgraft[vis]

Quick start

Python API

from imgraft import Curator

curator = Curator(model="ViT-B/32")   # or "ViT-L/14" for higher accuracy

result = curator.run(
    image_dir="./raw_images/",
    keep_ratio=0.25,           # keep 25% of dataset
    strategy="diversity",      # centroid | diversity | text-query | drop-noise
    drop_noise=True,           # remove OOD/noise images
)

# ── verify clusters visually ───────────────────────────────────────────────────
result.inspect(
    output_dir="./cluster_grids/",   # one PNG contact sheet per cluster
    n_per_side=5,                    # 5 centroid + 5 random thumbnails side by side
)
# open cluster_000.png, cluster_001.png etc. — visually verify before committing

# ── export structured dataset ──────────────────────────────────────────────────
result.export_clusters(
    output_dir="./dataset/",
    n_core=50,      # N most representative images per cluster
    n_diverse=50,   # N most diverse images per cluster
)

# ── interactive UMAP explorer ──────────────────────────────────────────────────
result.plot("clusters.html")   # hover any point to see the image

print(result.stats())
# {'total': 8211, 'kept': 1642, 'clusters': 47, 'noise': 312, ...}

CLI

# basic — keep 25% using diversity sampling
imgraft run ./images/ --keep 0.25 --out ./curated/

# with interactive visualization
imgraft run ./images/ --keep 0.25 --visualize --out ./curated/

# filter by text query (zero-shot, no labels needed)
imgraft run ./images/ \
  --strategy text-query \
  --query "a clear front-facing product photo on white background" \
  --out ./curated/

# drop noise/OOD images only, keep everything else
imgraft run ./images/ --strategy drop-noise --out ./cleaned/

# inspect dataset structure before curating
imgraft info ./images/

Cluster inspection

Before exporting your dataset, visually verify what each cluster contains:

result.inspect(output_dir="./cluster_grids/", n_per_side=5)

Each PNG contact sheet shows two sides:

Left Right
Core — closest to cluster centroid Random sample from the cluster

Lets you spot bad clusters at a glance — if cluster_003 is all blurry images or mislabeled examples, you know to drop it before training.


Structured export

result.export_clusters(
    output_dir="./dataset/",
    n_core=50,
    n_diverse=50,
)

Output layout:

dataset/
  cluster_000/
    core/           ← centroid-closest (most representative)
    diverse/        ← furthest-point sampled (max variety)
  cluster_001/
    core/
    diverse/
  ...
  noise/            ← all HDBSCAN outliers, isolated
  export_summary.json

core/ is best for training classifiers. diverse/ maximises variety and removes near-duplicates. noise/ gives a clean view of OOD images to review or discard.


Curation strategies

Strategy Best for How it works
diversity Removing near-duplicates, max variety Greedy furthest-point sampling per cluster
centroid Balanced, representative subsets Keeps images closest to each cluster center
text-query Domain-specific filtering, no labels CLIP zero-shot similarity to a text prompt
drop-noise Quick clean without size reduction Removes HDBSCAN outliers only

Works on any domain

  • Web-scraped datasets — deduplicate crawled images, remove OOD noise
  • Medical imaging — balance X-ray / pathology / dermoscopy class distributions
  • Satellite / aerial — curate geospatial image sets by region and content type
  • E-commerce / products — deduplicate product catalogs by visual similarity
  • Industrial / manufacturing — balance defect vs. normal in inspection datasets
  • Document / form images — group by layout type, sample representative subset
  • General ML training prep — quality control before sending to annotation

Backbone options

Model Embedding dim Speed Quality
ViT-B/32 512 ⚡ Fast Good
ViT-B/16 512 Medium Better
ViT-L/14 768 Slower High
ViT-H/14 1024 Slow Highest

Switch with --model ViT-L/14 or Curator(model="ViT-L/14").


Embedding cache

Re-embedding large folders is slow. Cache embeddings to disk so reruns skip it:

imgraft run ./images/ --cache ./.imgraft_cache/ --keep 0.25

Visualizations

Interactive HTML (result.plot("clusters.html")):

  • UMAP scatter colored by cluster
  • Hover any point → see the image thumbnail
  • Kept images at full opacity, dropped faded

Cluster grids (result.inspect("./grids/")):

  • One PNG per cluster — centroid sample vs random sample side by side
  • Visual verification before committing to training

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imgraft-0.1.0.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

imgraft-0.1.0-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file imgraft-0.1.0.tar.gz.

File metadata

  • Download URL: imgraft-0.1.0.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for imgraft-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c13a32abb0c81a809c54aba3e4a8293d9c487638dd3e4ac4d46c33886928c265
MD5 dd476d1dfd777910a9c10a85b00e339e
BLAKE2b-256 34a0ab776594142af3486fc24505861fb1bf8383e29d195fae699d4837260c1d

See more details on using hashes here.

File details

Details for the file imgraft-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: imgraft-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for imgraft-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 859dcbef8c52077bc10825e72bebe5679bd5b311e24751f18f746af65a620dca
MD5 818060aa50e06dc4cc0ba55dc587d16c
BLAKE2b-256 8e767ebf0dbd9a4cb5dcaeb14a566525a84c33306f2287f6a97490a79fb03661

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page