Skip to main content

Adaptive Topic and Cluster Number Determination via structured search

Project description

ATCND

Adaptive Topic and Cluster Number Determination via Structured Search over Sliding Ranges

PyPI Python License: GPL v3 Tests

ATCND is a model-agnostic framework that determines the optimal number of topics (for LDA/NMF) or clusters (for K-Means) by treating K selection as a structured search problem over a user-specified integer range. Prior to ATCND, no model-agnostic method with correct K* achieved sub-linear evaluation count: exhaustive grid search (the SOTA baseline in the model-agnostic + exact-K class) requires O(K_max − K_min) model evaluations. ATCND applies eight search strategies to achieve O(log(K_max − K_min)) evaluations — 59–79% fewer than grid search while matching its K* accuracy.

8-Strategy Comparison

ATCND: All 8 search strategies on K-Means (K_true=8)

Benchmark on K-Means (SyntheticBlobs, K_true=8, range [2,30]):

Strategy Complexity Evals vs Grid
Grid O(N) 29
Binary O(log N) 9 69%
Golden Section O(log_φ N) 12 59%
Ternary O(log_{1.5} N) 14 52%
Fibonacci O(log_φ N) 10 66%
Interpolation O(log log N)* 11 62%
Exponential O(log K*) 10 66%
Predictive O(1)+O(log Δ) 7 76%

All strategies recover K*=8 with identical silhouette score. Predictive search with PCA hot-start achieves the fewest evaluations.

Consistency Across Datasets

ATCND never catastrophically fails. Compare absolute errors |K* − K_true| across five benchmarks:

Method Iris Wine Breast Digits Blobs Avg Max
ATCND-sil 1 1 0 1 0 0.6 1
Kneedle 2 2 2 1 0 1.4 2
X-Means 0 0 0 8 6 2.8 8
Gap Stat. 12 12 8 10 0 8.4 12
G-Means 12 4 8 10 22 11.2 22

Consistency analysis: ATCND never catastrophically fails

Real-World Demos

K-Means on Iris (3D)

K-Means on Iris (K*=2) Iris K-Means search curve

GMM on Wine (3D)

GMM on Wine (K*=2) Wine GMM search curve

PCA on Digits (3D)

PCA on Digits (K*=31 for 95% variance) Digits PCA cumulative variance

DBSCAN on Two-Moons

DBSCAN on Two-Moons DBSCAN eps search curve

Optimal Histogram Bins (AIC)

ATCND optimal bins vs Sturges rule

Smoothing Spline (SciPy)

Smoothing spline with ATCND-selected parameter

Rolling Window (Pandas)

Optimal rolling window on time series

NMF Topic Count (Gensim)

NMF topic search with c_v coherence

PyTorch Hidden Layer Size

PyTorch hidden layer size search

Run all demos with figures (SVG + PDF + PNG):

python examples/demo_all.py
# Output: examples/figures/*.{svg,pdf,png}

Key Features

  • 8 search strategies: grid, binary, golden section, ternary, Fibonacci, interpolation, exponential, predictive
  • 15 adapters spanning NumPy, SciPy, scikit-learn, PyTorch, Gensim, and Pandas
  • 3 model families: LDA, NMF, K-Means (plus GMM, PCA, DBSCAN, etc.)
  • 5+5 quality metrics: general (silhouette, coherence c_v, perplexity, reconstruction, combined) + K-Means-specific (silhouette_knee, bic, combined, silhouette_drop)
  • Adaptive strategy selection: automatically recommends best strategy + metric based on data profile (dimensionality, sparsity, separation)
  • Multi-objective optimization: simultaneously optimize multiple metrics (e.g., silhouette + BIC) with Pareto frontier
  • Ranked candidate set: returns multiple optimal K values (handles plateaus, ties, multi-optima)
  • First O(log N) method in the model-agnostic + exact-K class: 59–79% fewer evaluations than grid search
  • Predictive search with PCA hot-start: 76% reduction (7 evaluations vs 29 for grid)
  • CLI and Python API

Installation

pip install atcnd

For PyTorch adapters:

pip install atcnd[torch]

From source:

git clone https://github.com/CodeOfMe/ATCND.git
cd ATCND
pip install -e .

Quick Start

Python API

from atcnd import ATCNDConfig, atcnd_search

# K-Means on numeric data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=1000, n_features=50, centers=8, random_state=42)

config = ATCNDConfig(
    k_min=2, k_max=30,
    model_type="kmeans",
    search_strategy="binary",
    metric="silhouette",
    n_candidates=3,
)
result = atcnd_search(X=X, config=config)

print(f"Optimal K: {result.optimal_k}")
print(f"Best score: {result.optimal_score:.4f}")
print(f"Top candidates: {result.candidate_ks}")
print(f"Evaluations: {len(result.search_history)}")

Low-level API (any callable)

from atcnd import search
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def f(k):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X)
    return silhouette_score(X, labels)

result = search(f, k_min=2, k_max=30, strategy="binary")
print(f"K* = {result.optimal_k}, evals = {len(result.search_history)}")

Adapter API (scikit-learn, PyTorch, Gensim, ...)

from atcnd import search_model, search_gmm_components, search_nmf_topics
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

# K-Means
r = search_model(KMeans, X, param_name="n_clusters", k_min=2, k_max=30, strategy="binary")

# GMM with BIC
r = search_gmm_components(X, k_min=2, k_max=15, strategy="binary")

# NMF topics with coherence
r = search_nmf_topics(texts, k_min=2, k_max=20, strategy="binary", metric="coherence")

# Predictive search with PCA hot-start
from atcnd import estimate_k_n_clusters
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)
r = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)

Text data (LDA/NMF)

from atcnd import ATCNDConfig, atcnd_search

texts = ["your document texts here"] * 100

# NMF with coherence metric
config = ATCNDConfig(
    k_min=2, k_max=20,
    model_type="nmf",
    search_strategy="binary",
    metric="coherence",
)
result = atcnd_search(texts=texts, config=config)

Adaptive Strategy Selection

ATCND automatically recommends the best strategy + metric based on data characteristics (dimensionality, sparsity, cluster separation, intrinsic dimension):

from atcnd import adaptive_select, adaptive_search

# Get recommendation without running search
rec = adaptive_select(X, k_min=2, k_max=30)
print(f"Recommended: {rec.strategy} + {rec.metric} (confidence: {rec.confidence:.2f})")
print(f"Top 5: {[(r['strategy'], r['metric'], r['score']) for r in rec.all_recommendations[:5]}")

# Or run search directly with auto-selected strategy
result = adaptive_search(X, k_min=2, k_max=30)
print(f"K* = {result.optimal_k}, strategy = {result.strategy}")

CLI:

atcnd adaptive --k-min 2 --k-max 30
atcnd adaptive --k-min 2 --k-max 30 --run --json

Multi-Objective Optimization

Optimize multiple metrics simultaneously and find Pareto-optimal K values:

from atcnd import multi_objective_search
from sklearn.cluster import KMeans

# Silhouette + BIC with equal weights
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic"], k_min=2, k_max=30)
print(f"Combined best: K* = {mo.optimal_k}")
print(f"Pareto frontier: {mo.pareto_ks}")  # Non-dominated K values

# Custom weights (prioritize silhouette)
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic"],
                              weights={"silhouette": 0.7, "bic": 0.3}, k_min=2, k_max=30)

# Three metrics
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic", "silhouette_drop"],
                              k_min=2, k_max=30)

CLI:

atcnd multi --metrics silhouette bic --k-min 2 --k-max 30
atcnd multi --metrics silhouette bic --weights 0.7 0.3 --json

CLI

# Search with K-Means on synthetic data
atcnd search --model kmeans --strategy binary --k-min 2 --k-max 30

# Search with NMF
atcnd search --model nmf --strategy golden_section --metric silhouette

# JSON output
atcnd search --model kmeans --json

# Run benchmarks
atcnd benchmark --dataset blobs --k-min 2 --k-max 30

# All 8 strategies comparison
atcnd benchmark --dataset blobs --k-min 2 --k-max 30 --all-strategies

Search Strategies Detail

Predictive Search

Predictive search uses PCA-based hot-start to estimate K* before any model evaluation, then applies parabolic peak fitting:

  1. PCA hot-start: Eigenvalue elbow method estimates K* from data structure
  2. Probing: Evaluate f(K̂−1), f(K̂), f(K̂+1)
  3. Parabolic peak fitting: Fit a parabola through the three best points and jump to the predicted peak
  4. Binary refinement: Narrow the search around the predicted peak
from atcnd import estimate_k_n_clusters, search

# PCA hot-start estimates K from eigenvalue elbow
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)

# Predictive search uses hot_start to reduce evaluations
result = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)

Fibonacci Search

Classical optimal algorithm for discrete unimodal search (Kiefer 1953). Achieves minimum worst-case evaluations among all derivative-free methods on unimodal functions.

Exponential Search

Doubles the probe point (1, 2, 4, 8, ...) until f(K) decreases, then refines via binary search. Best when K* is near K_min.

Adapters

15 adapters wrap common model/parameter pairs with appropriate quality metrics:

Library Adapter Parameter Metric
sklearn search_model n_clusters silhouette
sklearn search_neighbors n_neighbors CV accuracy
NumPy search_bins bins AIC
sklearn search_components n_components variance
SciPy search_knots internal knots MSE
signal search_window window BIC
any search_param any user-defined
PyTorch search_hidden hidden_dim −CE
PyTorch search_layers n_layers −CE
sklearn search_trees n_estimators CV accuracy
sklearn search_dbscan_eps eps silhouette
sklearn search_gmm_components n_components BIC
Pandas search_dataframe_bins bins AIC
Pandas search_rolling_window window BIC
Gensim search_nmf_topics n_topics c_v coherence

Quality Metrics

Metric LDA NMF K-Means Description
Silhouette Yes Yes Yes Inter-cluster separation vs intra-cluster cohesion
Coherence (c_v) Yes Yes No Semantic coherence of top topic words
Perplexity Yes No No Negative log-likelihood per word
Reconstruction No Yes Yes Frobenius norm / inertia
Combined Yes Yes No 0.5 × silhouette + 0.5 × coherence

Multiple Optima

K is a discrete integer parameter. Multiple values of K may achieve the same or nearly the same quality score. ATCND returns candidate_ks (ranked list) alongside optimal_k, handling plateaus, ties, and multiple local maxima.

Comparison with Baselines

ATCND is the first method in the model-agnostic + exact-K class to achieve O(log N) evaluations:

Method Model-agnostic? Exact K*? Evals
ATCND (all) Yes Yes O(log N)
Grid Search Yes Yes O(N)
HDP No (LDA only) No 1 (costly)
Top2Vec No No 1
Black-box Opt. No (LDA only) Yes Variable

Real-World Results

Dataset Adapter K* Strategy Evals Reduction
Iris search_model 2 binary 5 64%
Wine search_gmm 2 binary 6 57%
Digits search_components 31 binary 10 84%
Moons search_dbscan 3 binary 10 64%
Wine search_trees 300 binary 37 87%
Iris search_neighbors 12 binary 9 70%

Development

# Install in development mode
pip install -e ".[dev]"

# Run tests (75 tests)
python -m pytest tests/ -v

# Format code
black src/atcnd/ tests/

# Lint code
ruff check src/atcnd/ tests/

License

GNU General Public License v3.0 (GPLv3)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atcnd-0.5.1.tar.gz (55.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atcnd-0.5.1-py3-none-any.whl (49.7 kB view details)

Uploaded Python 3

File details

Details for the file atcnd-0.5.1.tar.gz.

File metadata

  • Download URL: atcnd-0.5.1.tar.gz
  • Upload date:
  • Size: 55.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for atcnd-0.5.1.tar.gz
Algorithm Hash digest
SHA256 77ab0fc78b3d5e9cb38015292952663027f988e1b42fa4825a7127f02dc0a709
MD5 e42ec3a46000906946ab91a622b1acad
BLAKE2b-256 e56ba0c47a8230c903f93b2ed5acbcaf6cc6d9387354036b5a9d4de45c9404ee

See more details on using hashes here.

File details

Details for the file atcnd-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: atcnd-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 49.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for atcnd-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c807311c1b2d47af0c84034cfb0f52096bb02949c674284616ffcaba5a198f95
MD5 05bbba7eaea3eca7f885686708ada338
BLAKE2b-256 8bcdac3bd4603b3e31ee0e99ce1da5c03958544d7ffccc4680542ef703c4df6e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page