Adaptive Topic and Cluster Number Determination via structured search

These details have not been verified by PyPI

Project links

Project description

ATCND

Adaptive Topic and Cluster Number Determination via Structured Search over Sliding Ranges

ATCND is a model-agnostic framework that determines the optimal number of topics (for LDA/NMF) or clusters (for K-Means) by treating K selection as a structured search problem over a user-specified integer range. Prior to ATCND, no model-agnostic method with correct K* achieved sub-linear evaluation count: exhaustive grid search (the SOTA baseline in the model-agnostic + exact-K class) requires O(K_max − K_min) model evaluations. ATCND applies eight search strategies to achieve O(log(K_max − K_min)) evaluations — 59–79% fewer than grid search while matching its K* accuracy.

8-Strategy Comparison

ATCND: All 8 search strategies on K-Means (K_true=8)

Benchmark on K-Means (SyntheticBlobs, K_true=8, range [2,30]):

Strategy	Complexity	Evals	vs Grid
Grid	O(N)	29	—
Binary	O(log N)	9	69%
Golden Section	O(log_φ N)	12	59%
Ternary	O(log_{1.5} N)	14	52%
Fibonacci	O(log_φ N)	10	66%
Interpolation	O(log log N)*	11	62%
Exponential	O(log K*)	10	66%
Predictive	O(1)+O(log Δ)	7	76%

All strategies recover K*=8 with identical silhouette score. Predictive search with PCA hot-start achieves the fewest evaluations.

Consistency Across Datasets

ATCND never catastrophically fails. Compare absolute errors |K* − K_true| across five benchmarks:

Method	Iris	Wine	Breast	Digits	Blobs	Avg	Max
ATCND-sil	1	1	0	1	0	0.6	1
Kneedle	2	2	2	1	0	1.4	2
X-Means	0	0	0	8	6	2.8	8
Gap Stat.	12	12	8	10	0	8.4	12
G-Means	12	4	8	10	22	11.2	22

Consistency analysis: ATCND never catastrophically fails

Real-World Demos

K-Means on Iris (3D)

K-Means on Iris (K*=2) Iris K-Means search curve

GMM on Wine (3D)

GMM on Wine (K*=2) Wine GMM search curve

PCA on Digits (3D)

PCA on Digits (K*=31 for 95% variance) Digits PCA cumulative variance

DBSCAN on Two-Moons

DBSCAN on Two-Moons DBSCAN eps search curve

Optimal Histogram Bins (AIC)

ATCND optimal bins vs Sturges rule

Smoothing Spline (SciPy)

Smoothing spline with ATCND-selected parameter

Rolling Window (Pandas)

Optimal rolling window on time series

NMF Topic Count (Gensim)

NMF topic search with c_v coherence

PyTorch Hidden Layer Size

PyTorch hidden layer size search

Run all demos with figures (SVG + PDF + PNG):

python examples/demo_all.py
# Output: examples/figures/*.{svg,pdf,png}

Key Features

8 search strategies: grid, binary, golden section, ternary, Fibonacci, interpolation, exponential, predictive
15 adapters spanning NumPy, SciPy, scikit-learn, PyTorch, Gensim, and Pandas
3 model families: LDA, NMF, K-Means (plus GMM, PCA, DBSCAN, etc.)
5+5 quality metrics: general (silhouette, coherence c_v, perplexity, reconstruction, combined) + K-Means-specific (silhouette_knee, bic, combined, silhouette_drop)
Adaptive strategy selection: automatically recommends best strategy + metric based on data profile (dimensionality, sparsity, separation)
Multi-objective optimization: simultaneously optimize multiple metrics (e.g., silhouette + BIC) with Pareto frontier
Ranked candidate set: returns multiple optimal K values (handles plateaus, ties, multi-optima)
First O(log N) method in the model-agnostic + exact-K class: 59–79% fewer evaluations than grid search
Predictive search with PCA hot-start: 76% reduction (7 evaluations vs 29 for grid)
CLI and Python API

Installation

pip install atcnd

For PyTorch adapters:

pip install atcnd[torch]

From source:

git clone https://github.com/CodeOfMe/ATCND.git
cd ATCND
pip install -e .

Quick Start

Python API

from atcnd import ATCNDConfig, atcnd_search

# K-Means on numeric data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=1000, n_features=50, centers=8, random_state=42)

config = ATCNDConfig(
    k_min=2, k_max=30,
    model_type="kmeans",
    search_strategy="binary",
    metric="silhouette",
    n_candidates=3,
)
result = atcnd_search(X=X, config=config)

print(f"Optimal K: {result.optimal_k}")
print(f"Best score: {result.optimal_score:.4f}")
print(f"Top candidates: {result.candidate_ks}")
print(f"Evaluations: {len(result.search_history)}")

Low-level API (any callable)

from atcnd import search
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def f(k):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X)
    return silhouette_score(X, labels)

result = search(f, k_min=2, k_max=30, strategy="binary")
print(f"K* = {result.optimal_k}, evals = {len(result.search_history)}")

Adapter API (scikit-learn, PyTorch, Gensim, ...)

from atcnd import search_model, search_gmm_components, search_nmf_topics
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

# K-Means
r = search_model(KMeans, X, param_name="n_clusters", k_min=2, k_max=30, strategy="binary")

# GMM with BIC
r = search_gmm_components(X, k_min=2, k_max=15, strategy="binary")

# NMF topics with coherence
r = search_nmf_topics(texts, k_min=2, k_max=20, strategy="binary", metric="coherence")

# Predictive search with PCA hot-start
from atcnd import estimate_k_n_clusters
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)
r = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)

Text data (LDA/NMF)

from atcnd import ATCNDConfig, atcnd_search

texts = ["your document texts here"] * 100

# NMF with coherence metric
config = ATCNDConfig(
    k_min=2, k_max=20,
    model_type="nmf",
    search_strategy="binary",
    metric="coherence",
)
result = atcnd_search(texts=texts, config=config)

Adaptive Strategy Selection

ATCND automatically recommends the best strategy + metric based on data characteristics (dimensionality, sparsity, cluster separation, intrinsic dimension):

from atcnd import adaptive_select, adaptive_search

# Get recommendation without running search
rec = adaptive_select(X, k_min=2, k_max=30)
print(f"Recommended: {rec.strategy} + {rec.metric} (confidence: {rec.confidence:.2f})")
print(f"Top 5: {[(r['strategy'], r['metric'], r['score']) for r in rec.all_recommendations[:5]}")

# Or run search directly with auto-selected strategy
result = adaptive_search(X, k_min=2, k_max=30)
print(f"K* = {result.optimal_k}, strategy = {result.strategy}")

CLI:

atcnd adaptive --k-min 2 --k-max 30
atcnd adaptive --k-min 2 --k-max 30 --run --json

Multi-Objective Optimization

Optimize multiple metrics simultaneously and find Pareto-optimal K values:

from atcnd import multi_objective_search
from sklearn.cluster import KMeans

# Silhouette + BIC with equal weights
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic"], k_min=2, k_max=30)
print(f"Combined best: K* = {mo.optimal_k}")
print(f"Pareto frontier: {mo.pareto_ks}")  # Non-dominated K values

# Custom weights (prioritize silhouette)
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic"],
                              weights={"silhouette": 0.7, "bic": 0.3}, k_min=2, k_max=30)

# Three metrics
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic", "silhouette_drop"],
                              k_min=2, k_max=30)

CLI:

atcnd multi --metrics silhouette bic --k-min 2 --k-max 30
atcnd multi --metrics silhouette bic --weights 0.7 0.3 --json

CLI

# Search with K-Means on synthetic data
atcnd search --model kmeans --strategy binary --k-min 2 --k-max 30

# Search with NMF
atcnd search --model nmf --strategy golden_section --metric silhouette

# JSON output
atcnd search --model kmeans --json

# Run benchmarks
atcnd benchmark --dataset blobs --k-min 2 --k-max 30

# All 8 strategies comparison
atcnd benchmark --dataset blobs --k-min 2 --k-max 30 --all-strategies

Search Strategies Detail

Predictive Search

Predictive search uses PCA-based hot-start to estimate K* before any model evaluation, then applies parabolic peak fitting:

PCA hot-start: Eigenvalue elbow method estimates K* from data structure
Probing: Evaluate f(K̂−1), f(K̂), f(K̂+1)
Parabolic peak fitting: Fit a parabola through the three best points and jump to the predicted peak
Binary refinement: Narrow the search around the predicted peak

from atcnd import estimate_k_n_clusters, search

# PCA hot-start estimates K from eigenvalue elbow
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)

# Predictive search uses hot_start to reduce evaluations
result = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)

Fibonacci Search

Classical optimal algorithm for discrete unimodal search (Kiefer 1953). Achieves minimum worst-case evaluations among all derivative-free methods on unimodal functions.

Exponential Search

Doubles the probe point (1, 2, 4, 8, ...) until f(K) decreases, then refines via binary search. Best when K* is near K_min.

Adapters

15 adapters wrap common model/parameter pairs with appropriate quality metrics:

Library	Adapter	Parameter	Metric
sklearn	`search_model`	n_clusters	silhouette
sklearn	`search_neighbors`	n_neighbors	CV accuracy
NumPy	`search_bins`	bins	AIC
sklearn	`search_components`	n_components	variance
SciPy	`search_knots`	internal knots	MSE
signal	`search_window`	window	BIC
any	`search_param`	any	user-defined
PyTorch	`search_hidden`	hidden_dim	−CE
PyTorch	`search_layers`	n_layers	−CE
sklearn	`search_trees`	n_estimators	CV accuracy
sklearn	`search_dbscan_eps`	eps	silhouette
sklearn	`search_gmm_components`	n_components	BIC
Pandas	`search_dataframe_bins`	bins	AIC
Pandas	`search_rolling_window`	window	BIC
Gensim	`search_nmf_topics`	n_topics	c_v coherence

Quality Metrics

Metric	LDA	NMF	K-Means	Description
Silhouette	Yes	Yes	Yes	Inter-cluster separation vs intra-cluster cohesion
Coherence (c_v)	Yes	Yes	No	Semantic coherence of top topic words
Perplexity	Yes	No	No	Negative log-likelihood per word
Reconstruction	No	Yes	Yes	Frobenius norm / inertia
Combined	Yes	Yes	No	0.5 × silhouette + 0.5 × coherence

Multiple Optima

K is a discrete integer parameter. Multiple values of K may achieve the same or nearly the same quality score. ATCND returns candidate_ks (ranked list) alongside optimal_k, handling plateaus, ties, and multiple local maxima.

Comparison with Baselines

ATCND is the first method in the model-agnostic + exact-K class to achieve O(log N) evaluations:

Method	Model-agnostic?	Exact K*?	Evals
ATCND (all)	Yes	Yes	O(log N)
Grid Search	Yes	Yes	O(N)
HDP	No (LDA only)	No	1 (costly)
Top2Vec	No	No	1
Black-box Opt.	No (LDA only)	Yes	Variable

Real-World Results

Dataset	Adapter	K*	Strategy	Evals	Reduction
Iris	search_model	2	binary	5	64%
Wine	search_gmm	2	binary	6	57%
Digits	search_components	31	binary	10	84%
Moons	search_dbscan	3	binary	10	64%
Wine	search_trees	300	binary	37	87%
Iris	search_neighbors	12	binary	9	70%

Development

# Install in development mode
pip install -e ".[dev]"

# Run tests (75 tests)
python -m pytest tests/ -v

# Format code
black src/atcnd/ tests/

# Lint code
ruff check src/atcnd/ tests/

License

GNU General Public License v3.0 (GPLv3)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.3

Jun 4, 2026

0.5.2

Jun 3, 2026

This version

0.5.1

Jun 3, 2026

0.5.0

Jun 3, 2026

0.4.0

Jun 3, 2026

0.3.0

Jun 3, 2026

0.2.0

Jun 3, 2026

0.1.0

Jun 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atcnd-0.5.1.tar.gz (55.8 kB view details)

Uploaded Jun 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

atcnd-0.5.1-py3-none-any.whl (49.7 kB view details)

Uploaded Jun 3, 2026 Python 3

File details

Details for the file atcnd-0.5.1.tar.gz.

File metadata

Download URL: atcnd-0.5.1.tar.gz
Upload date: Jun 3, 2026
Size: 55.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for atcnd-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`77ab0fc78b3d5e9cb38015292952663027f988e1b42fa4825a7127f02dc0a709`
MD5	`e42ec3a46000906946ab91a622b1acad`
BLAKE2b-256	`e56ba0c47a8230c903f93b2ed5acbcaf6cc6d9387354036b5a9d4de45c9404ee`

See more details on using hashes here.

File details

Details for the file atcnd-0.5.1-py3-none-any.whl.

File metadata

Download URL: atcnd-0.5.1-py3-none-any.whl
Upload date: Jun 3, 2026
Size: 49.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for atcnd-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c807311c1b2d47af0c84034cfb0f52096bb02949c674284616ffcaba5a198f95`
MD5	`05bbba7eaea3eca7f885686708ada338`
BLAKE2b-256	`8bcdac3bd4603b3e31ee0e99ce1da5c03958544d7ffccc4680542ef703c4df6e`

See more details on using hashes here.

atcnd 0.5.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ATCND

8-Strategy Comparison

Consistency Across Datasets

Real-World Demos

K-Means on Iris (3D)

GMM on Wine (3D)

PCA on Digits (3D)

DBSCAN on Two-Moons

Optimal Histogram Bins (AIC)

Smoothing Spline (SciPy)

Rolling Window (Pandas)

NMF Topic Count (Gensim)

PyTorch Hidden Layer Size

Key Features

Installation

Quick Start

Python API

Low-level API (any callable)

Adapter API (scikit-learn, PyTorch, Gensim, ...)

Text data (LDA/NMF)

Adaptive Strategy Selection

Multi-Objective Optimization

CLI

Search Strategies Detail

Predictive Search

Fibonacci Search

Exponential Search

Adapters

Quality Metrics

Multiple Optima

Comparison with Baselines

Real-World Results

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes