cake-ensemble

CAKE: Confidence in Assignments via K-partition Ensembles

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

a.semoglou

These details have not been verified by PyPI

Project description

CAKE: Confidence in Assignments via K-partition Ensembles

Accepted for publication in Machine Learning with Applications.

Clustering assigns each data point to a group, but it does not tell us how reliable that assignment is.
While global validation metrics assess overall quality, they provide little insight into the trustworthiness of individual assignments.

Ensemble-based clustering improves robustness by aggregating multiple partitions. However, most ensemble-style uncertainty measures focus primarily on cross-run agreement (e.g., aligned vote counts or dispersion of labels across runs). Agreement alone is not sufficient: a point may be consistently assigned due to systematic bias or rigid decision boundaries, even if it is weakly integrated into the cluster structure.

Conversely, purely geometric signals such as the Silhouette score evaluate local separation within a single run but ignore cross-run instability. A point may appear geometrically well placed in one partition, yet switch clusters across runs because it lies near a boundary or because multiple partitions explain it nearly equally well.

Assignment Stability vs Geometric Consistency

These complementary failure modes motivate a confidence signal that jointly accounts for both assignment stability and consistency of geometric support at the point level.

CAKE (Confidence in Assignments via K-partition Ensembles) provides a principled, per-instance confidence score by fusing two complementary statistics computed over a clustering ensemble:

🔁 Assignment Stability: pairwise agreement across runs after optimal label alignment, using the Hungarian algorithm.
📐 Geometric Consistency: aggregated Silhouette statistics across runs.

These components are combined into a single, interpretable score in [0, 1], enabling:

Identification of stable core members.
Detection of ambiguous boundary points.
Filtering of unreliable assignments.
Ranking of instances by clustering confidence.

CAKE scores on synthetic data. Red ↑; Blue ↓.

Citation

If you find this work useful, please consider citing:

Semoglou, A., & Pavlopoulos, J. (2026).
Assigning Confidence: K-partition Ensembles.
Preprint: https://arxiv.org/abs/2602.18435

@misc{semoglou2026assigningconfidencekpartitionensembles,
   title={Assigning Confidence: K-partition Ensembles},
   author={Aggelos Semoglou and John Pavlopoulos},
   year={2026},
   eprint={2602.18435},
   archivePrefix={arXiv},
   primaryClass={cs.LG},
   url={https://arxiv.org/abs/2602.18435},
}

Installation

Install CAKE from PyPI:

pip install cake-ensemble

Ιmport the main functions in Python as:

from cake_ensemble import (
    sil_samples,
    sil_samples_stats,
    align_labels,
    pairwise_stability,
    cake,
    consensus_labels,
    kmeans_ensemble,
    cake_with_consensus,
)

API Reference

CAKE provides utilities for computing sample-level silhouette statistics, aligning clustering labels, measuring assignment stability, computing CAKE confidence scores, deriving consensus labels, and building KMeans ensembles.

`sil_samples`

Computes silhouette scores for each sample, either exactly using scikit-learn or approximately using distances to cluster centroids.

sil_samples(X, labels, approximation=False, centers=None)

Inputs

X: array-like of shape (n_samples, n_features)
Input data matrix.
labels: array-like of shape (n_samples,)
Cluster label assigned to each sample.
approximation: bool, default False
If False, computes exact silhouette scores.
If True, computes a faster centroid-based approximation.
centers: array-like of shape (n_clusters, n_features) or None, default None
Optional cluster centers used when approximation=True.
If not provided, centers are computed from X and labels.

Returns

silhouette_scores: array of shape (n_samples,)
Silhouette score for each sample.

`sil_samples_stats`

Computes sample-level silhouette statistics across multiple clustering runs.

sil_samples_stats(X, labels_list, approximation=False, centers_list=None)

Inputs

X: array-like of shape (n_samples, n_features)
Input data matrix.
labels_list: list of array-like, each of shape (n_samples,)
List of label vectors from multiple clustering runs.
approximation: bool, default False
Whether to use centroid-based approximate silhouette scores.
centers_list: list of array-like or None, default None
Optional list of cluster-center arrays, one per clustering run.
Used when approximation=True.

Returns

mean_silhouette: array of shape (n_samples,)
Mean silhouette score of each sample across the ensemble.
std_silhouette: array of shape (n_samples,)
Standard deviation of each sample’s silhouette score across the ensemble.

`align_labels`

Aligns the labels of one clustering solution to another using the Hungarian algorithm.

align_labels(target, source)

Inputs

target: array-like of shape (n_samples,)
Reference label vector.
source: array-like of shape (n_samples,)
Label vector to be remapped so that it best matches target.

Returns

aligned_labels: array of shape (n_samples,)
The source labels remapped to match the target label space.

`pairwise_stability`

Computes pointwise assignment stability across all pairs of clustering runs.

pairwise_stability(labels_runs)

Inputs

labels_runs: list of array-like, each of shape (n_samples,)
List of cluster label vectors from multiple clustering runs.

Returns

assignment_stability: array of shape (n_samples,)
For each sample, the fraction of run-pairs where the sample receives the same aligned label.

`cake`

Computes CAKE confidence scores for each sample in a clustering ensemble.

cake(X, labels_list, method='product', approximation=False, centers_list=None, geom_norm='clip')

Inputs

X: array-like of shape (n_samples, n_features)
Input data matrix.
labels_list: list of array-like, each of shape (n_samples,)
List of cluster label vectors from multiple clustering runs.
method: str, default 'product'
Method used to combine assignment stability and geometric stability.
Options: 'product' or 'harmonic_mean'.
approximation: bool, default False
Whether to use centroid-based approximate silhouette scores.
centers_list: list of array-like or None, default None
Optional list of cluster centers, one per clustering run.
Used when approximation=True.
geom_norm: str, default 'clip'
Strategy for mapping the geometric component to [0, 1].
'clip' uses max(mean silhouette − std silhouette, 0).
'affine' maps the raw value from [-1, 1] to [0, 1].

Returns

cake_scores: array of shape (n_samples,)
Final CAKE confidence score for each sample.
assignment_stability: array of shape (n_samples,)
Pairwise label-agreement stability across clustering runs.
geometric_stability: array of shape (n_samples,)
Silhouette-based geometric reliability score.
summary: pandas DataFrame
Per-sample table containing mean silhouette, silhouette standard deviation, geometric stability, assignment stability, and CAKE score.

`consensus_labels`

Computes consensus clustering labels from multiple clustering runs using label alignment and majority vote.

consensus_labels(labels_list, method='medoid')

Inputs

labels_list: list of array-like, each of shape (n_samples,)
List of cluster label vectors from multiple clustering runs.
method: str, default 'medoid'
Consensus strategy.
'medoid' selects the most representative run as reference, aligns all runs to it, and computes majority-vote labels.
'best_ref' tries every run as reference and keeps the one with the highest average consensus strength.

Returns

consensus: array of shape (n_samples,)
Consensus cluster label for each sample.
consensus_strength: array of shape (n_samples,)
Fraction of aligned runs voting for the selected consensus label.
reference_index: int
Index of the selected reference run.

Note: majority-vote consensus does not guarantee that all reference cluster labels appear in the final consensus.

`kmeans_ensemble`

Builds a KMeans clustering ensemble by running KMeans multiple times with different random seeds.

kmeans_ensemble(
    X,
    n_clusters,
    n_runs=50,
    random_state=1,
    init='random',
    n_init=1,
    max_iter=300,
    tol=1e-4,
    algorithm='lloyd',
    return_models=False
)

Inputs

X: array-like of shape (n_samples, n_features)
Input data matrix.
n_clusters: int
Number of clusters for each KMeans run.
n_runs: int, default 50
Number of KMeans runs in the ensemble.
random_state: int or None, default 1
Base random seed used to generate run-specific seeds.
init: str or array-like, default 'random'
KMeans initialization strategy.
n_init: int or 'auto', default 1
Number of initializations per KMeans run.
For ensemble diversity, n_init=1 is recommended.
max_iter: int, default 300
Maximum number of iterations for each KMeans run.
tol: float, default 1e-4
Convergence tolerance.
algorithm: str, default 'lloyd'
KMeans algorithm.
return_models: bool, default False
If True, also returns the fitted KMeans models.

Returns

labels_list: list of arrays
One label vector per KMeans run. Each array has shape (n_samples,).
centers_list: list of arrays
Cluster centers for each KMeans run.
ensemble_summary: pandas DataFrame
Per-run metadata including run index, seed, inertia, and number of iterations.
models: list of fitted KMeans models
Returned only when return_models=True.

`cake_with_consensus`

Computes CAKE confidence scores together with consensus clustering labels.

cake_with_consensus(
    X,
    labels_list,
    cake_method='product',
    consensus_method='medoid',
    approximation=False,
    centers_list=None,
    geom_norm='clip'
)

Inputs

X: array-like of shape (n_samples, n_features)
Input data matrix.
labels_list: list of array-like, each of shape (n_samples,)
List of cluster label vectors from multiple clustering runs.
cake_method: str, default 'product'
Method used to combine assignment stability and geometric stability.
Options: 'product' or 'harmonic_mean'.
consensus_method: str, default 'medoid'
Consensus strategy passed to consensus_labels.
Options: 'medoid' or 'best_ref'.
approximation: bool, default False
Whether to use centroid-based approximate silhouette scores.
centers_list: list of array-like or None, default None
Optional list of cluster centers, one per clustering run.
Used when approximation=True.
geom_norm: str, default 'clip'
Strategy for normalizing the geometric stability component.
Options: 'clip' or 'affine'.

Returns

cake_scores: array of shape (n_samples,)
Final CAKE confidence score for each sample.
consensus: array of shape (n_samples,)
Consensus cluster label for each sample.
consensus_strength: array of shape (n_samples,)
Fraction of aligned runs voting for the consensus label.
assignment_stability: array of shape (n_samples,)
Pairwise assignment stability across the ensemble.
geometric_stability: array of shape (n_samples,)
Silhouette-based geometric reliability score.
summary: pandas DataFrame
Per-sample table containing consensus labels, consensus strength, CAKE components, and final CAKE score.
reference_index: int
Index of the reference run selected by the consensus method.

Quick Start

This example builds a KMeans ensemble and then computes CAKE confidence scores using cake.

from sklearn.datasets import make_blobs
from cake_ensemble import kmeans_ensemble, cake

# Create example data
X, _ = make_blobs(
    n_samples=300,
    centers=3,
    n_features=2,
    cluster_std=1.0,
    random_state=42,
)

# Build a KMeans ensemble
labels_list, centers_list, ensemble_summary = kmeans_ensemble(
    X,
    n_clusters=3,
    n_runs=50,
    random_state=1,
)

# Compute CAKE scores
cake_scores, assignment_stability, geometric_stability, summary = cake(
    X,
    labels_list,
    method="product",
    approximation=True,
    centers_list=centers_list,
    geom_norm="clip",
)

print(summary.head())

The returned summary is a pandas DataFrame with one row per data point:

	Mean Silhouette	STD Silhouette	Geometric Stability	Assignment Stability	CAKE
0	0.871807	0.176727	0.695080	0.865306	0.601457
1	0.886966	0.169302	0.717664	0.800000	0.574131
2	0.806492	0.023170	0.783321	0.832653	0.652235
3	0.842851	0.069890	0.772961	0.817143	0.631619
4	0.771558	0.196408	0.575151	0.800000	0.460120

The CAKE column contains the final confidence score for each clustering assignment. Higher values indicate assignments that are both stable across the ensemble and geometrically well supported.

License

This project is licensed under the MIT License.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

a.semoglou

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.3

May 16, 2026

0.1.2

May 13, 2026

0.1.1

May 10, 2026

This version

0.1.0

May 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cake_ensemble-0.1.0.tar.gz (15.1 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cake_ensemble-0.1.0-py3-none-any.whl (12.3 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file cake_ensemble-0.1.0.tar.gz.

File metadata

Download URL: cake_ensemble-0.1.0.tar.gz
Upload date: May 10, 2026
Size: 15.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cake_ensemble-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0ecbab5246d3c5a04b99e67283628f2808b1b620187325a3aa91db267c50f2de`
MD5	`b0f66632859bc49a0d3cb982a7c5b66f`
BLAKE2b-256	`edfab48965fd5e845f79cd9d37608ae1dce5632d433af1a9682e276aeffb6b03`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cake_ensemble-0.1.0.tar.gz:

Publisher: python-publish.yml on semoglou/cake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cake_ensemble-0.1.0.tar.gz
- Subject digest: 0ecbab5246d3c5a04b99e67283628f2808b1b620187325a3aa91db267c50f2de
- Sigstore transparency entry: 1494403598
- Sigstore integration time: May 10, 2026
Source repository:
- Permalink: semoglou/cake@72f7af571232a9686c34794f187bc4fd98ab6c95
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/semoglou
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@72f7af571232a9686c34794f187bc4fd98ab6c95
- Trigger Event: release

File details

Details for the file cake_ensemble-0.1.0-py3-none-any.whl.

File metadata

Download URL: cake_ensemble-0.1.0-py3-none-any.whl
Upload date: May 10, 2026
Size: 12.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cake_ensemble-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`deb46415d0704a10381ff0f291635a95b85e3588cc123581a0640d000e658dcb`
MD5	`9feb72b9de4399a2c8389c20034a979e`
BLAKE2b-256	`dba945af2eaed60a8368c63d86faf64c53c6815275d4447407fac2628bfd1c54`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cake_ensemble-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on semoglou/cake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cake_ensemble-0.1.0-py3-none-any.whl
- Subject digest: deb46415d0704a10381ff0f291635a95b85e3588cc123581a0640d000e658dcb
- Sigstore transparency entry: 1494404575
- Sigstore integration time: May 10, 2026
Source repository:
- Permalink: semoglou/cake@72f7af571232a9686c34794f187bc4fd98ab6c95
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/semoglou
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@72f7af571232a9686c34794f187bc4fd98ab6c95
- Trigger Event: release

cake-ensemble 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

CAKE: Confidence in Assignments via K-partition Ensembles

Citation

Installation

API Reference

sil_samples

sil_samples_stats

align_labels

pairwise_stability

cake

consensus_labels

kmeans_ensemble

cake_with_consensus

Quick Start

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`sil_samples`

`sil_samples_stats`

`align_labels`

`pairwise_stability`

`cake`

`consensus_labels`

`kmeans_ensemble`

`cake_with_consensus`