Skip to main content

Adaptive Topic and Cluster Number Determination via structured search

Project description

ATCND

Adaptive Topic and Cluster Number Determination via Structured Search over Sliding Ranges

PyPI Python License: GPL v3 Tests

ATCND is a model-agnostic framework that determines the optimal number of topics (for LDA/NMF) or clusters (for K-Means) by treating K selection as a structured search problem over a user-specified integer range. Prior to ATCND, no model-agnostic method with correct K* achieved sub-linear evaluation count: exhaustive grid search (the SOTA baseline in the model-agnostic + exact-K class) requires O(K_max − K_min) model evaluations. ATCND applies eight search strategies to achieve O(log(K_max − K_min)) evaluations — 59–79% fewer than grid search while matching its K* accuracy.

8-Strategy Comparison

ATCND: All 8 search strategies on K-Means (K_true=8)

Benchmark on K-Means (SyntheticBlobs, K_true=8, range [2,30]):

Strategy Complexity Evals vs Grid
Grid O(N) 29
Binary O(log N) 9 69%
Golden Section O(log_φ N) 12 59%
Ternary O(log_{1.5} N) 14 52%
Fibonacci O(log_φ N) 10 66%
Interpolation O(log log N)* 11 62%
Exponential O(log K*) 10 66%
Predictive O(1)+O(log Δ) 7 76%

All strategies recover K*=8 with identical silhouette score. Predictive search with PCA hot-start achieves the fewest evaluations.

Consistency Across Datasets

ATCND never catastrophically fails. Compare absolute errors |K* − K_true| across five benchmarks:

Method Iris Wine Breast Digits Blobs Avg Max
ATCND-sil 1 1 0 1 0 0.6 1
Kneedle 2 2 2 1 0 1.4 2
X-Means 0 0 0 8 6 2.8 8
Gap Stat. 12 12 8 10 0 8.4 12
G-Means 12 4 8 10 22 11.2 22

Consistency analysis: ATCND never catastrophically fails

Real-World Demos

K-Means on Iris (3D)

K-Means on Iris (K*=2) Iris K-Means search curve

GMM on Wine (3D)

GMM on Wine (K*=2) Wine GMM search curve

PCA on Digits (3D)

PCA on Digits (K*=31 for 95% variance) Digits PCA cumulative variance

DBSCAN on Two-Moons

DBSCAN on Two-Moons DBSCAN eps search curve

Optimal Histogram Bins (AIC)

ATCND optimal bins vs Sturges rule

Smoothing Spline (SciPy)

Smoothing spline with ATCND-selected parameter

Rolling Window (Pandas)

Optimal rolling window on time series

NMF Topic Count (Gensim)

NMF topic search with c_v coherence

PyTorch Hidden Layer Size

PyTorch hidden layer size search

Run all demos with figures (SVG + PDF + PNG):

python examples/demo_all.py
# Output: examples/figures/*.{svg,pdf,png}

Key Features

  • 8 search strategies: grid, binary, golden section, ternary, Fibonacci, interpolation, exponential, predictive
  • 15 adapters spanning NumPy, SciPy, scikit-learn, PyTorch, Gensim, and Pandas
  • 3 model families: LDA, NMF, K-Means (plus GMM, PCA, DBSCAN, etc.)
  • 5+5 quality metrics: general (silhouette, coherence c_v, perplexity, reconstruction, combined) + K-Means-specific (silhouette_knee, bic, combined, silhouette_drop)
  • Adaptive strategy selection: automatically recommends best strategy + metric based on data profile (dimensionality, sparsity, separation)
  • Multi-objective optimization: simultaneously optimize multiple metrics (e.g., silhouette + BIC) with Pareto frontier
  • Ranked candidate set: returns multiple optimal K values (handles plateaus, ties, multi-optima)
  • First O(log N) method in the model-agnostic + exact-K class: 59–79% fewer evaluations than grid search
  • Predictive search with PCA hot-start: 76% reduction (7 evaluations vs 29 for grid)
  • Automatic K range estimation: suggest_k_range() recommends search bounds from PCA, √N, and data sparsity heuristics
  • CLI and Python API

Installation

pip install atcnd

For PyTorch adapters:

pip install atcnd[torch]

From source:

git clone https://github.com/CodeOfMe/ATCND.git
cd ATCND
pip install -e .

Quick Start

Python API

from atcnd import ATCNDConfig, atcnd_search

# K-Means on numeric data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=1000, n_features=50, centers=8, random_state=42)

config = ATCNDConfig(
    k_min=2, k_max=30,
    model_type="kmeans",
    search_strategy="binary",
    metric="silhouette",
    n_candidates=3,
)
result = atcnd_search(X=X, config=config)

print(f"Optimal K: {result.optimal_k}")
print(f"Best score: {result.optimal_score:.4f}")
print(f"Top candidates: {result.candidate_ks}")
print(f"Evaluations: {len(result.search_history)}")

Low-level API (any callable)

from atcnd import search
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def f(k):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X)
    return silhouette_score(X, labels)

result = search(f, k_min=2, k_max=30, strategy="binary")
print(f"K* = {result.optimal_k}, evals = {len(result.search_history)}")

Adapter API (scikit-learn, PyTorch, Gensim, ...)

from atcnd import search_model, search_gmm_components, search_nmf_topics
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

# K-Means
r = search_model(KMeans, X, param_name="n_clusters", k_min=2, k_max=30, strategy="binary")

# GMM with BIC
r = search_gmm_components(X, k_min=2, k_max=15, strategy="binary")

# NMF topics with coherence
r = search_nmf_topics(texts, k_min=2, k_max=20, strategy="binary", metric="coherence")

# Predictive search with PCA hot-start
from atcnd import estimate_k_n_clusters
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)
r = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)

Text data (LDA/NMF)

from atcnd import ATCNDConfig, atcnd_search

texts = ["your document texts here"] * 100

# NMF with coherence metric
config = ATCNDConfig(
    k_min=2, k_max=20,
    model_type="nmf",
    search_strategy="binary",
    metric="coherence",
)
result = atcnd_search(texts=texts, config=config)

Adaptive Strategy Selection

ATCND automatically recommends the best strategy + metric based on data characteristics (dimensionality, sparsity, cluster separation, intrinsic dimension):

from atcnd import adaptive_select, adaptive_search

# Get recommendation without running search
rec = adaptive_select(X, k_min=2, k_max=30)
print(f"Recommended: {rec.strategy} + {rec.metric} (confidence: {rec.confidence:.2f})")
print(f"Top 5: {[(r['strategy'], r['metric'], r['score']) for r in rec.all_recommendations[:5]}")

# Or run search directly with auto-selected strategy
result = adaptive_search(X, k_min=2, k_max=30)
print(f"K* = {result.optimal_k}, strategy = {result.strategy}")

CLI:

atcnd adaptive --k-min 2 --k-max 30
atcnd adaptive --k-min 2 --k-max 30 --run --json

Automatic K Range Estimation

Instead of manually choosing k_min and k_max, ATCND can suggest a search range from data characteristics:

from atcnd import suggest_k_range

rec = suggest_k_range(X, model_type="kmeans")
print(f"Suggested range: [{rec.k_min}, {rec.k_max}]")
print(f"Heuristics: {rec.rationale}")
# e.g. {pca_intrinsic_dim: 14, sqrt_n: 22, pca_elbow_doubled: 14}

The returned k_max is the harmonic mean of three upper-bound heuristics, scaled by a 1.2× safety factor:

  • pca_intrinsic_dim: 2 × (PCA dimensions for 95% cumulative variance)
  • sqrt_n: √N rule (Mardia 1979)
  • pca_elbow_doubled (clustering) or topic_estimate_doubled (topic models)

CLI:

atcnd range --model kmeans --json
atcnd range --model lda

Multi-Objective Optimization

Optimize multiple metrics simultaneously and find Pareto-optimal K values:

from atcnd import multi_objective_search
from sklearn.cluster import KMeans

# Silhouette + BIC with equal weights
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic"], k_min=2, k_max=30)
print(f"Combined best: K* = {mo.optimal_k}")
print(f"Pareto frontier: {mo.pareto_ks}")  # Non-dominated K values

# Custom weights (prioritize silhouette)
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic"],
                              weights={"silhouette": 0.7, "bic": 0.3}, k_min=2, k_max=30)

# Three metrics
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic", "silhouette_drop"],
                              k_min=2, k_max=30)

CLI:

atcnd multi --metrics silhouette bic --k-min 2 --k-max 30
atcnd multi --metrics silhouette bic --weights 0.7 0.3 --json

CLI

# Search with K-Means on synthetic data
atcnd search --model kmeans --strategy binary --k-min 2 --k-max 30

# Search with NMF
atcnd search --model nmf --strategy golden_section --metric silhouette

# JSON output
atcnd search --model kmeans --json

# Suggest K range from data
atcnd range --model kmeans --json

# Run benchmarks
atcnd benchmark --dataset blobs --k-min 2 --k-max 30

# All 8 strategies comparison
atcnd benchmark --dataset blobs --k-min 2 --k-max 30 --all-strategies

Search Strategies Detail

Predictive Search

Predictive search uses PCA-based hot-start to estimate K* before any model evaluation, then applies parabolic peak fitting:

  1. PCA hot-start: Eigenvalue elbow method estimates K* from data structure
  2. Probing: Evaluate f(K̂−1), f(K̂), f(K̂+1)
  3. Parabolic peak fitting: Fit a parabola through the three best points and jump to the predicted peak
  4. Binary refinement: Narrow the search around the predicted peak
from atcnd import estimate_k_n_clusters, search

# PCA hot-start estimates K from eigenvalue elbow
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)

# Predictive search uses hot_start to reduce evaluations
result = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)

Fibonacci Search

Classical optimal algorithm for discrete unimodal search (Kiefer 1953). Achieves minimum worst-case evaluations among all derivative-free methods on unimodal functions.

Exponential Search

Doubles the probe point (1, 2, 4, 8, ...) until f(K) decreases, then refines via binary search. Best when K* is near K_min.

Adapters

15 adapters wrap common model/parameter pairs with appropriate quality metrics:

Library Adapter Parameter Metric
sklearn search_model n_clusters silhouette
sklearn search_neighbors n_neighbors CV accuracy
NumPy search_bins bins AIC
sklearn search_components n_components variance
SciPy search_knots internal knots MSE
signal search_window window BIC
any search_param any user-defined
PyTorch search_hidden hidden_dim −CE
PyTorch search_layers n_layers −CE
sklearn search_trees n_estimators CV accuracy
sklearn search_dbscan_eps eps silhouette
sklearn search_gmm_components n_components BIC
Pandas search_dataframe_bins bins AIC
Pandas search_rolling_window window BIC
Gensim search_nmf_topics n_topics c_v coherence

Quality Metrics

Metric LDA NMF K-Means Description
Silhouette Yes Yes Yes Inter-cluster separation vs intra-cluster cohesion
Coherence (c_v) Yes Yes No Semantic coherence of top topic words
Perplexity Yes No No Negative log-likelihood per word
Reconstruction No Yes Yes Frobenius norm / inertia
Combined Yes Yes No 0.5 × silhouette + 0.5 × coherence

Multiple Optima

K is a discrete integer parameter. Multiple values of K may achieve the same or nearly the same quality score. ATCND returns candidate_ks (ranked list) alongside optimal_k, handling plateaus, ties, and multiple local maxima.

Comparison with Baselines

ATCND is the first method in the model-agnostic + exact-K class to achieve O(log N) evaluations:

Method Model-agnostic? Exact K*? Evals
ATCND (all) Yes Yes O(log N)
Grid Search Yes Yes O(N)
HDP No (LDA only) No 1 (costly)
Top2Vec No No 1
Black-box Opt. No (LDA only) Yes Variable

Real-World Results

Dataset Adapter K* Strategy Evals Reduction
Iris search_model 2 binary 5 64%
Wine search_gmm 2 binary 6 57%
Digits search_components 31 binary 10 84%
Moons search_dbscan 3 binary 10 64%
Wine search_trees 300 binary 37 87%
Iris search_neighbors 12 binary 9 70%

Development

# Install in development mode
pip install -e ".[dev]"

# Run tests (75 tests)
python -m pytest tests/ -v

# Format code
black src/atcnd/ tests/

# Lint code
ruff check src/atcnd/ tests/

License

GNU General Public License v3.0 (GPLv3)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atcnd-0.5.3.tar.gz (57.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atcnd-0.5.3-py3-none-any.whl (50.7 kB view details)

Uploaded Python 3

File details

Details for the file atcnd-0.5.3.tar.gz.

File metadata

  • Download URL: atcnd-0.5.3.tar.gz
  • Upload date:
  • Size: 57.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for atcnd-0.5.3.tar.gz
Algorithm Hash digest
SHA256 19707f676fa809399b97937706f7052632b9697b0bcd2b09ba05b00ee101f402
MD5 fcc833c01e6481d72060fa6a0f16db85
BLAKE2b-256 c29e60732af05e961b6106ceb7925d0375cf2a22f22a37f12f4d56d24b295bcc

See more details on using hashes here.

File details

Details for the file atcnd-0.5.3-py3-none-any.whl.

File metadata

  • Download URL: atcnd-0.5.3-py3-none-any.whl
  • Upload date:
  • Size: 50.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for atcnd-0.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 bcc6152afb6343dd973c4765d5ddad8f5c3ff57420889e1aa4e246b30198ef82
MD5 6231bdc981ecbbf98774556944cf01b1
BLAKE2b-256 be73edcd177080fd99959ce524709c4defcc05c98f20432811d649de428e103a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page