Skip to main content

Adaptive Topic and Cluster Number Determination via structured search

Project description

ATCND

Adaptive Topic and Cluster Number Determination via Structured Search over Sliding Ranges

PyPI Python License: GPL v3 Tests

ATCND is a model-agnostic framework that determines the optimal number of topics (for LDA/NMF) or clusters (for K-Means) by treating K selection as a structured search problem over a user-specified integer range. Prior to ATCND, no model-agnostic method with correct K* achieved sub-linear evaluation count: exhaustive grid search (the SOTA baseline in the model-agnostic + exact-K class) requires O(K_max − K_min) model evaluations. ATCND applies eight search strategies to achieve O(log(K_max − K_min)) evaluations — 59–79% fewer than grid search while matching its K* accuracy.

Key Features

  • 8 search strategies: grid, binary, golden section, ternary, Fibonacci, interpolation, exponential, predictive
  • 15 adapters spanning NumPy, SciPy, scikit-learn, PyTorch, Gensim, and Pandas
  • 3 model families: LDA, NMF, K-Means (plus GMM, PCA, DBSCAN, etc.)
  • 5 quality metrics: silhouette, coherence (c_v), perplexity, reconstruction error, combined
  • Ranked candidate set: returns multiple optimal K values (handles plateaus, ties, multi-optima)
  • First O(log N) method in the model-agnostic + exact-K class: 59–79% fewer evaluations than grid search
  • Predictive search with PCA hot-start: 76% reduction (7 evaluations vs 29 for grid)
  • CLI and Python API

Installation

pip install atcnd

For PyTorch adapters:

pip install atcnd[torch]

From source:

git clone https://github.com/CodeOfMe/ATCND.git
cd ATCND
pip install -e .

Quick Start

Python API

from atcnd import ATCNDConfig, atcnd_search

# K-Means on numeric data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=1000, n_features=50, centers=8, random_state=42)

config = ATCNDConfig(
    k_min=2, k_max=30,
    model_type="kmeans",
    search_strategy="binary",
    metric="silhouette",
    n_candidates=3,
)
result = atcnd_search(X=X, config=config)

print(f"Optimal K: {result.optimal_k}")
print(f"Best score: {result.optimal_score:.4f}")
print(f"Top candidates: {result.candidate_ks}")
print(f"Evaluations: {len(result.search_history)}")

Low-level API (any callable)

from atcnd import search
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def f(k):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X)
    return silhouette_score(X, labels)

result = search(f, k_min=2, k_max=30, strategy="binary")
print(f"K* = {result.optimal_k}, evals = {len(result.search_history)}")

Adapter API (scikit-learn, PyTorch, Gensim, ...)

from atcnd import search_model, search_gmm_components, search_nmf_topics
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

# K-Means
r = search_model(KMeans, X, param_name="n_clusters", k_min=2, k_max=30, strategy="binary")

# GMM with BIC
r = search_gmm_components(X, k_min=2, k_max=15, strategy="binary")

# NMF topics with coherence
r = search_nmf_topics(texts, k_min=2, k_max=20, strategy="binary", metric="coherence")

# Predictive search with PCA hot-start
from atcnd import estimate_k_n_clusters
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)
r = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)

Text data (LDA/NMF)

from atcnd import ATCNDConfig, atcnd_search

texts = ["your document texts here"] * 100

# NMF with coherence metric
config = ATCNDConfig(
    k_min=2, k_max=20,
    model_type="nmf",
    search_strategy="binary",
    metric="coherence",
)
result = atcnd_search(texts=texts, config=config)

CLI

# Search with K-Means on synthetic data
atcnd search --model kmeans --strategy binary --k-min 2 --k-max 30

# Search with NMF
atcnd search --model nmf --strategy golden_section --metric silhouette

# JSON output
atcnd search --model kmeans --json

# Run benchmarks
atcnd benchmark --dataset blobs --k-min 2 --k-max 30

# All 8 strategies comparison
atcnd benchmark --dataset blobs --k-min 2 --k-max 30 --all-strategies

Search Strategies

Strategy Complexity Best for Evals (K∈[2,30]) vs Grid
Grid O(N) Baseline comparison 29
Binary O(log N) Unimodal objectives 9 69%
Golden Section O(log_φ N) General objectives 12 59%
Ternary O(log_1.5 N) Multi-step objectives 14 52%
Fibonacci O(log_φ N) Discrete unimodal (optimal) 10 66%
Interpolation O(log log N)* Smooth objectives 11 62%
Exponential O(log K*) Unknown K range 10 66%
Predictive O(1)+O(log Δ) With data-driven hot-start 7 76%

*Worst case degrades to O(N) for highly non-uniform objectives.

Benchmark: K-Means on SyntheticBlobs (K_true=8, range [2,30]). All strategies recover K*=8.

Predictive Search

Predictive search uses PCA-based hot-start to estimate K* before any model evaluation, then applies parabolic peak fitting:

from atcnd import estimate_k_n_clusters, search

# PCA hot-start estimates K from eigenvalue elbow
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)

# Predictive search uses hot_start to reduce evaluations
result = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)

Adapters

15 adapters wrap common model/parameter pairs with appropriate quality metrics:

Library Adapter Parameter Metric
sklearn search_model n_clusters silhouette
sklearn search_neighbors n_neighbors CV accuracy
NumPy search_bins bins AIC
sklearn search_components n_components variance
SciPy search_knots internal knots MSE
signal search_window window BIC
any search_param any user-defined
PyTorch search_hidden hidden_dim −CE
PyTorch search_layers n_layers −CE
sklearn search_trees n_estimators CV accuracy
sklearn search_dbscan_eps eps silhouette
sklearn search_gmm_components n_components BIC
Pandas search_dataframe_bins bins AIC
Pandas search_rolling_window window BIC
Gensim search_nmf_topics n_topics c_v coherence

Run all demos with figures (SVG + PDF + PNG):

python examples/demo_all.py
# Output: examples/figures/*.{svg,pdf,png}

Quality Metrics

Metric LDA NMF K-Means Description
Silhouette Yes Yes Yes Inter-cluster separation vs intra-cluster cohesion
Coherence (c_v) Yes Yes No Semantic coherence of top topic words
Perplexity Yes No No Negative log-likelihood per word
Reconstruction No Yes Yes Frobenius norm / inertia
Combined Yes Yes No 0.5 × silhouette + 0.5 × coherence

Multiple Optima

K is a discrete integer parameter. Multiple values of K may achieve the same or nearly the same quality score. ATCND returns candidate_ks (ranked list) alongside optimal_k, handling plateaus, ties, and multiple local maxima.

Comparison with Baselines

ATCND is the first method in the model-agnostic + exact-K class to achieve O(log N) evaluations:

Method Model-agnostic? Exact K*? Evals
ATCND (all) Yes Yes O(log N)
Grid Search Yes Yes O(N)
HDP No (LDA only) No 1 (costly)
Top2Vec No No 1
Black-box Opt. No (LDA only) Yes Variable

Development

# Install in development mode
pip install -e ".[dev]"

# Run tests (75 tests)
python -m pytest tests/ -v

# Format code
black src/atcnd/ tests/

# Lint code
ruff check src/atcnd/ tests/

License

GNU General Public License v3.0 (GPLv3)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atcnd-0.4.0.tar.gz (45.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atcnd-0.4.0-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file atcnd-0.4.0.tar.gz.

File metadata

  • Download URL: atcnd-0.4.0.tar.gz
  • Upload date:
  • Size: 45.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for atcnd-0.4.0.tar.gz
Algorithm Hash digest
SHA256 499a1f737979969f700e5437c83e1b9b45887281dc9643d5d7d91fe5a08b4504
MD5 0657182d5c1688b4022b0a44b382b2a9
BLAKE2b-256 1fad3a122100fa2a4b08da8cff8c081280b066b7c0822249b83af0f5230d6c88

See more details on using hashes here.

File details

Details for the file atcnd-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: atcnd-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for atcnd-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fc55eec23bbffeb0bf58eb3227604064bd346bef400b20bcd336d8fa9caa6cf5
MD5 a1c8ef1373216f3498adb6827aef213b
BLAKE2b-256 8db352cd77b90cc1d6e46a1a61805b481f10a8fc5fce7a3e0e69b9ed49879c16

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page