Skip to main content

Unsupervised image clustering with modern deep embeddings (timm, CLIP), PCA and K-Means.

Project description

Tasnif

Unsupervised image clustering with modern deep embeddings.

PyPI Python CI License

tasnif turns a folder of images into clusters you can browse on disk — no labels required. It uses modern pretrained vision backbones (timm by default, optionally CLIP via open_clip), reduces the embedding with PCA, and runs K-Means on top.

Highlights

  • Modern backbones — any timm model (ResNet, ConvNeXt, ViT, DINOv2, ...) or CLIP via the [clip] extra.
  • GPU / MPS / CPU with automatic device detection.
  • scikit-learn-style API: fit, predict, fit_predict, transform.
  • Rich export: per-cluster folders, CSV manifest, JSON summary, preview grids, raw embeddings.
  • Multiple materialization modes: copy, symlink, move, or none (metadata only).
  • First-class CLI powered by Typer.
  • Pluggable embedders — register your own backend.
  • Deterministic with a random_state seed.

Installation

tasnif is built and packaged with uv, but plain pip works too.

pip install tasnif                # core
pip install "tasnif[cli]"         # core + CLI
pip install "tasnif[clip]"        # core + CLIP backend
pip install "tasnif[all]"         # everything

Development setup:

git clone https://github.com/cobanov/tasnif
cd tasnif
uv sync --extra dev
uv run pytest

Quickstart — Python API

from tasnif import TasnifClusterer

clf = TasnifClusterer(n_clusters=5, embedder="timm", pca_dim=16)
clf.fit("photos/")
result = clf.result_

# Inspect without exporting
print(result.counts)            # e.g. [42, 31, 28, 19, 7]
print(result.silhouette)        # None unless compute_silhouette=True
mapping = result.as_dict()      # {Path('photos/a.jpg'): 2, ...}
buckets = result.by_cluster()   # {0: [Path(...), ...], 1: [...]}

# Export to disk
clf.export("output/", mode="copy")

One-shot helper

from tasnif import cluster_directory

cluster_directory("photos/", "output/", n_clusters=5, mode="symlink")

CLI

# Cluster a directory with the default ResNet-50 (timm) backbone
tasnif cluster photos/ -k 5 -o output/

# Use CLIP and copy via symlink (fast, non-destructive)
tasnif cluster photos/ -k 8 --embedder clip --model ViT-B-32 --mode symlink

# Just compute embeddings, save .npy + .json
tasnif embed photos/ --embedder timm --model convnext_base -o embeddings.npy

# List available backends
tasnif backends

Run tasnif --help to see all options.

Use a different model

Default is timm:resnet50. Pass any timm model name:

clf = TasnifClusterer(
    n_clusters=8,
    embedder="timm",
    embedder_kwargs={"model": "vit_base_patch14_dinov2.lvd142m", "device": "cuda"},
)

CLIP:

clf = TasnifClusterer(
    n_clusters=8,
    embedder="clip",
    embedder_kwargs={"model": "ViT-L-14", "pretrained": "laion2b_s32b_b82k"},
)

Custom backend

Anything implementing the Embedder protocol works:

import numpy as np
from tasnif import TasnifClusterer, register_embedder

class MyEncoder:
    name = "my-encoder"
    @property
    def dim(self): return 128
    @property
    def device(self): return "cpu"
    def embed(self, images, *, batch_size=32, show_progress=True):
        return np.stack([extract(img) for img in images])

register_embedder("my-encoder", lambda: MyEncoder())

clf = TasnifClusterer(n_clusters=5, embedder="my-encoder")

Building blocks

All pieces are independently usable:

from tasnif import (
    discover_images, create_embedder,
    reduce_pca, fit_kmeans, export_clusters, ClusterResult,
)

paths = discover_images("photos/")
embedder = create_embedder("timm", model="resnet50")
embeddings = embedder.embed([open_pil(p) for p in paths])
reduced = reduce_pca(embeddings, n_components=16)
fit = fit_kmeans(reduced, n_clusters=5, compute_silhouette=True)

result = ClusterResult(
    labels=fit.labels, centroids=fit.centroids, counts=fit.counts,
    paths=tuple(paths), n_clusters=5, inertia=fit.inertia, silhouette=fit.silhouette,
    embedder=embedder.name, pca_dim=16,
)
export_clusters(result, "output/")

Migrating from 0.1.x

The 0.1.x API (Tasnif().read().calculate().export()) was removed in 0.2.0. Replace with:

- from tasnif import Tasnif
- c = Tasnif(num_classes=5, pca_dim=16, use_gpu=False)
- c.read("photos/")
- c.calculate()
- c.export("output/")
+ from tasnif import TasnifClusterer
+ clf = TasnifClusterer(n_clusters=5, pca_dim=16, embedder_kwargs={"device": "auto"})
+ clf.fit("photos/")
+ clf.export("output/")

See CHANGELOG.md for the full list of breaking changes.

Contributing

Issues and PRs welcome. Before submitting, run:

uv run ruff check . && uv run ruff format --check .
uv run mypy
uv run pytest

Pre-commit is configured — install hooks with uv run pre-commit install.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tasnif-0.2.0.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tasnif-0.2.0-py3-none-any.whl (23.7 kB view details)

Uploaded Python 3

File details

Details for the file tasnif-0.2.0.tar.gz.

File metadata

  • Download URL: tasnif-0.2.0.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tasnif-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ebd0a63eea60278c677a38885fa2289ccfdc89cca2df5dc01626e99f228824ac
MD5 4c14613e48f20a827c8c5eb407f9c1c4
BLAKE2b-256 68141b47417010f682f3adec62c905d16a33a4ae0ab8036c55f0ccc75d76b14e

See more details on using hashes here.

Provenance

The following attestation bundles were made for tasnif-0.2.0.tar.gz:

Publisher: release.yml on cobanov/tasnif

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tasnif-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tasnif-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 23.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tasnif-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8d91f5952a926060de059798bbdb92f44b76d9d3978b6dc2ca4997f620d7443b
MD5 43539313bc8d54860e68d5980de7855f
BLAKE2b-256 a39f923ce25afd19c8aa9d5d53bdd87b674c8e314d98d06a2eb485d7c99e45b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for tasnif-0.2.0-py3-none-any.whl:

Publisher: release.yml on cobanov/tasnif

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page