Adaptive Topic and Cluster Number Determination via structured search
Project description
ATCND
Adaptive Topic and Cluster Number Determination via Structured Search over Sliding Ranges
ATCND is a model-agnostic framework that determines the optimal number of topics (for LDA/NMF) or clusters (for K-Means) by treating K selection as a structured search problem over a user-specified integer range. Prior to ATCND, no model-agnostic method with correct K* achieved sub-linear evaluation count: exhaustive grid search (the SOTA baseline in the model-agnostic + exact-K class) requires O(K_max − K_min) model evaluations. ATCND applies eight search strategies to achieve O(log(K_max − K_min)) evaluations — 59–79% fewer than grid search while matching its K* accuracy.
8-Strategy Comparison
Benchmark on K-Means (SyntheticBlobs, K_true=8, range [2,30]):
| Strategy | Complexity | Evals | vs Grid |
|---|---|---|---|
| Grid | O(N) | 29 | — |
| Binary | O(log N) | 9 | 69% |
| Golden Section | O(log_φ N) | 12 | 59% |
| Ternary | O(log_{1.5} N) | 14 | 52% |
| Fibonacci | O(log_φ N) | 10 | 66% |
| Interpolation | O(log log N)* | 11 | 62% |
| Exponential | O(log K*) | 10 | 66% |
| Predictive | O(1)+O(log Δ) | 7 | 76% |
All strategies recover K*=8 with identical silhouette score. Predictive search with PCA hot-start achieves the fewest evaluations.
Consistency Across Datasets
ATCND never catastrophically fails. Compare absolute errors |K* − K_true| across five benchmarks:
| Method | Iris | Wine | Breast | Digits | Blobs | Avg | Max |
|---|---|---|---|---|---|---|---|
| ATCND-sil | 1 | 1 | 0 | 1 | 0 | 0.6 | 1 |
| Kneedle | 2 | 2 | 2 | 1 | 0 | 1.4 | 2 |
| X-Means | 0 | 0 | 0 | 8 | 6 | 2.8 | 8 |
| Gap Stat. | 12 | 12 | 8 | 10 | 0 | 8.4 | 12 |
| G-Means | 12 | 4 | 8 | 10 | 22 | 11.2 | 22 |
Real-World Demos
K-Means on Iris (3D)
GMM on Wine (3D)
PCA on Digits (3D)
DBSCAN on Two-Moons
Optimal Histogram Bins (AIC)
Smoothing Spline (SciPy)
Rolling Window (Pandas)
NMF Topic Count (Gensim)
PyTorch Hidden Layer Size
Run all demos with figures (SVG + PDF + PNG):
python examples/demo_all.py
# Output: examples/figures/*.{svg,pdf,png}
Key Features
- 8 search strategies: grid, binary, golden section, ternary, Fibonacci, interpolation, exponential, predictive
- 15 adapters spanning NumPy, SciPy, scikit-learn, PyTorch, Gensim, and Pandas
- 3 model families: LDA, NMF, K-Means (plus GMM, PCA, DBSCAN, etc.)
- 5+5 quality metrics: general (silhouette, coherence c_v, perplexity, reconstruction, combined) + K-Means-specific (silhouette_knee, bic, combined, silhouette_drop)
- Adaptive strategy selection: automatically recommends best strategy + metric based on data profile (dimensionality, sparsity, separation)
- Multi-objective optimization: simultaneously optimize multiple metrics (e.g., silhouette + BIC) with Pareto frontier
- Ranked candidate set: returns multiple optimal K values (handles plateaus, ties, multi-optima)
- First O(log N) method in the model-agnostic + exact-K class: 59–79% fewer evaluations than grid search
- Predictive search with PCA hot-start: 76% reduction (7 evaluations vs 29 for grid)
- CLI and Python API
Installation
pip install atcnd
For PyTorch adapters:
pip install atcnd[torch]
From source:
git clone https://github.com/CodeOfMe/ATCND.git
cd ATCND
pip install -e .
Quick Start
Python API
from atcnd import ATCNDConfig, atcnd_search
# K-Means on numeric data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=1000, n_features=50, centers=8, random_state=42)
config = ATCNDConfig(
k_min=2, k_max=30,
model_type="kmeans",
search_strategy="binary",
metric="silhouette",
n_candidates=3,
)
result = atcnd_search(X=X, config=config)
print(f"Optimal K: {result.optimal_k}")
print(f"Best score: {result.optimal_score:.4f}")
print(f"Top candidates: {result.candidate_ks}")
print(f"Evaluations: {len(result.search_history)}")
Low-level API (any callable)
from atcnd import search
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
def f(k):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = km.fit_predict(X)
return silhouette_score(X, labels)
result = search(f, k_min=2, k_max=30, strategy="binary")
print(f"K* = {result.optimal_k}, evals = {len(result.search_history)}")
Adapter API (scikit-learn, PyTorch, Gensim, ...)
from atcnd import search_model, search_gmm_components, search_nmf_topics
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
# K-Means
r = search_model(KMeans, X, param_name="n_clusters", k_min=2, k_max=30, strategy="binary")
# GMM with BIC
r = search_gmm_components(X, k_min=2, k_max=15, strategy="binary")
# NMF topics with coherence
r = search_nmf_topics(texts, k_min=2, k_max=20, strategy="binary", metric="coherence")
# Predictive search with PCA hot-start
from atcnd import estimate_k_n_clusters
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)
r = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)
Text data (LDA/NMF)
from atcnd import ATCNDConfig, atcnd_search
texts = ["your document texts here"] * 100
# NMF with coherence metric
config = ATCNDConfig(
k_min=2, k_max=20,
model_type="nmf",
search_strategy="binary",
metric="coherence",
)
result = atcnd_search(texts=texts, config=config)
Adaptive Strategy Selection
ATCND automatically recommends the best strategy + metric based on data characteristics (dimensionality, sparsity, cluster separation, intrinsic dimension):
from atcnd import adaptive_select, adaptive_search
# Get recommendation without running search
rec = adaptive_select(X, k_min=2, k_max=30)
print(f"Recommended: {rec.strategy} + {rec.metric} (confidence: {rec.confidence:.2f})")
print(f"Top 5: {[(r['strategy'], r['metric'], r['score']) for r in rec.all_recommendations[:5]}")
# Or run search directly with auto-selected strategy
result = adaptive_search(X, k_min=2, k_max=30)
print(f"K* = {result.optimal_k}, strategy = {result.strategy}")
CLI:
atcnd adaptive --k-min 2 --k-max 30
atcnd adaptive --k-min 2 --k-max 30 --run --json
Multi-Objective Optimization
Optimize multiple metrics simultaneously and find Pareto-optimal K values:
from atcnd import multi_objective_search
from sklearn.cluster import KMeans
# Silhouette + BIC with equal weights
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic"], k_min=2, k_max=30)
print(f"Combined best: K* = {mo.optimal_k}")
print(f"Pareto frontier: {mo.pareto_ks}") # Non-dominated K values
# Custom weights (prioritize silhouette)
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic"],
weights={"silhouette": 0.7, "bic": 0.3}, k_min=2, k_max=30)
# Three metrics
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic", "silhouette_drop"],
k_min=2, k_max=30)
CLI:
atcnd multi --metrics silhouette bic --k-min 2 --k-max 30
atcnd multi --metrics silhouette bic --weights 0.7 0.3 --json
CLI
# Search with K-Means on synthetic data
atcnd search --model kmeans --strategy binary --k-min 2 --k-max 30
# Search with NMF
atcnd search --model nmf --strategy golden_section --metric silhouette
# JSON output
atcnd search --model kmeans --json
# Run benchmarks
atcnd benchmark --dataset blobs --k-min 2 --k-max 30
# All 8 strategies comparison
atcnd benchmark --dataset blobs --k-min 2 --k-max 30 --all-strategies
Search Strategies Detail
Predictive Search
Predictive search uses PCA-based hot-start to estimate K* before any model evaluation, then applies parabolic peak fitting:
- PCA hot-start: Eigenvalue elbow method estimates K* from data structure
- Probing: Evaluate f(K̂−1), f(K̂), f(K̂+1)
- Parabolic peak fitting: Fit a parabola through the three best points and jump to the predicted peak
- Binary refinement: Narrow the search around the predicted peak
from atcnd import estimate_k_n_clusters, search
# PCA hot-start estimates K from eigenvalue elbow
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)
# Predictive search uses hot_start to reduce evaluations
result = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)
Fibonacci Search
Classical optimal algorithm for discrete unimodal search (Kiefer 1953). Achieves minimum worst-case evaluations among all derivative-free methods on unimodal functions.
Exponential Search
Doubles the probe point (1, 2, 4, 8, ...) until f(K) decreases, then refines via binary search. Best when K* is near K_min.
Adapters
15 adapters wrap common model/parameter pairs with appropriate quality metrics:
| Library | Adapter | Parameter | Metric |
|---|---|---|---|
| sklearn | search_model |
n_clusters | silhouette |
| sklearn | search_neighbors |
n_neighbors | CV accuracy |
| NumPy | search_bins |
bins | AIC |
| sklearn | search_components |
n_components | variance |
| SciPy | search_knots |
internal knots | MSE |
| signal | search_window |
window | BIC |
| any | search_param |
any | user-defined |
| PyTorch | search_hidden |
hidden_dim | −CE |
| PyTorch | search_layers |
n_layers | −CE |
| sklearn | search_trees |
n_estimators | CV accuracy |
| sklearn | search_dbscan_eps |
eps | silhouette |
| sklearn | search_gmm_components |
n_components | BIC |
| Pandas | search_dataframe_bins |
bins | AIC |
| Pandas | search_rolling_window |
window | BIC |
| Gensim | search_nmf_topics |
n_topics | c_v coherence |
Quality Metrics
| Metric | LDA | NMF | K-Means | Description |
|---|---|---|---|---|
| Silhouette | Yes | Yes | Yes | Inter-cluster separation vs intra-cluster cohesion |
| Coherence (c_v) | Yes | Yes | No | Semantic coherence of top topic words |
| Perplexity | Yes | No | No | Negative log-likelihood per word |
| Reconstruction | No | Yes | Yes | Frobenius norm / inertia |
| Combined | Yes | Yes | No | 0.5 × silhouette + 0.5 × coherence |
Multiple Optima
K is a discrete integer parameter. Multiple values of K may achieve the same or nearly the same quality score. ATCND returns candidate_ks (ranked list) alongside optimal_k, handling plateaus, ties, and multiple local maxima.
Comparison with Baselines
ATCND is the first method in the model-agnostic + exact-K class to achieve O(log N) evaluations:
| Method | Model-agnostic? | Exact K*? | Evals |
|---|---|---|---|
| ATCND (all) | Yes | Yes | O(log N) |
| Grid Search | Yes | Yes | O(N) |
| HDP | No (LDA only) | No | 1 (costly) |
| Top2Vec | No | No | 1 |
| Black-box Opt. | No (LDA only) | Yes | Variable |
Real-World Results
| Dataset | Adapter | K* | Strategy | Evals | Reduction |
|---|---|---|---|---|---|
| Iris | search_model | 2 | binary | 5 | 64% |
| Wine | search_gmm | 2 | binary | 6 | 57% |
| Digits | search_components | 31 | binary | 10 | 84% |
| Moons | search_dbscan | 3 | binary | 10 | 64% |
| Wine | search_trees | 300 | binary | 37 | 87% |
| Iris | search_neighbors | 12 | binary | 9 | 70% |
Development
# Install in development mode
pip install -e ".[dev]"
# Run tests (75 tests)
python -m pytest tests/ -v
# Format code
black src/atcnd/ tests/
# Lint code
ruff check src/atcnd/ tests/
License
GNU General Public License v3.0 (GPLv3)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file atcnd-0.5.1.tar.gz.
File metadata
- Download URL: atcnd-0.5.1.tar.gz
- Upload date:
- Size: 55.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77ab0fc78b3d5e9cb38015292952663027f988e1b42fa4825a7127f02dc0a709
|
|
| MD5 |
e42ec3a46000906946ab91a622b1acad
|
|
| BLAKE2b-256 |
e56ba0c47a8230c903f93b2ed5acbcaf6cc6d9387354036b5a9d4de45c9404ee
|
File details
Details for the file atcnd-0.5.1-py3-none-any.whl.
File metadata
- Download URL: atcnd-0.5.1-py3-none-any.whl
- Upload date:
- Size: 49.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c807311c1b2d47af0c84034cfb0f52096bb02949c674284616ffcaba5a198f95
|
|
| MD5 |
05bbba7eaea3eca7f885686708ada338
|
|
| BLAKE2b-256 |
8bcdac3bd4603b3e31ee0e99ce1da5c03958544d7ffccc4680542ef703c4df6e
|