BICO is a fast streaming algorithm to compute coresets for the k-means problem on very large sets of points.
Project description
BICO
BICO is a fast streaming algorithm to compute high quality solutions for the k-means problem on very large sets of points. It combines the tree data structure of SIGMOND Test of Time Award winning algorithm BIRCH with insights from clustering theory to obtain solutions fast while keeping the error regarding the k-means cost function low.
You can try BICO out on our Clustering Toolkit!
Installation
pip install bico
Example
from bico import BICO
import numpy as np
import time
np.random.seed(42)
data = np.random.rand(10000, 10)
start = time.time()
bico = BICO(n_clusters=3, random_state=0, fit_coreset=True)
bico.fit(data)
print("Time:", time.time() - start)
# Time: 0.08275651931762695
print(bico.coreset_points_)
# BICO returns a set of points that act as a summary of the entire dataset.
# By default, at most 200 * n_clusters points are returned.
# This behaviour can be changed by setting the `summary_size` parameter.
# [[0.45224018 0.70183673 0.55506671 ... 0.70132665 0.57244196 0.66789088]
# [0.73712952 0.5250208 0.43809322 ... 0.61427161 0.67910981 0.56207661]
# [0.89905336 0.46942062 0.20677639 ... 0.74210482 0.75714522 0.49651055]
# ...
# [0.68744494 0.41508081 0.39197623 ... 0.44093386 0.21983902 0.37237243]
# [0.60820965 0.29406341 0.67067782 ... 0.66435474 0.2390822 0.20070476]
# [0.67385626 0.33474823 0.68238779 ... 0.3581703 0.65646253 0.41386131]]
print(bico.cluster_centers_)
# If the `fit_coreset` parameter is set to True, the cluster centers are computed using KMeans from sklearn based on the coreset.
# [[0.46892639 0.41968333 0.47302945 0.51782955 0.39390839 0.56209413
# 0.4481691 0.49521457 0.31394509 0.5104331 ]
# [0.54384638 0.518978 0.49456809 0.56677848 0.63881783 0.33627504
# 0.49873782 0.5541338 0.52913562 0.56017203]
# [0.48639347 0.55542596 0.54350474 0.41931257 0.48117255 0.60089563
# 0.55457724 0.44833238 0.67583389 0.43069267]]
Example with Large Datasets
For very large datasets, the data may not actually fit in memory. In this case, you can use partial_fit to stream the data in chunks. In this example, we use the US Census Data (1990) dataset. You can find more examples in the tests folder.
from bico import BICO
import numpy as np
import time
np.random.seed(42)
data = np.random.rand(10000, 10)
start = time.time()
bico = BICO(n_clusters=3, random_state=0)
for chunk in pd.read_csv(
"census.txt", delimiter=",", header=None, chunksize=10000
):
bico.partial_fit(chunk.to_numpy(copy=False))
# If a final `partial_fit` is called with no data, the coreset is computed
bico.partial_fit()
Development
Install poetry
curl -sSL https://install.python-poetry.org | python3 -
Install clang
sudo apt-get install clang
Set clang variables
export CXX=/usr/bin/clang++
export CC=/usr/bin/clang
Install the package
poetry install
If the installation does not work and you do not see the C++ output, you can build the package to see the stack trace
poetry build
Run the tests
poetry run python -m unittest discover tests -v
Citation
If you use this code, please cite the following paper:
H. Fichtenberger, M. Gillé, M. Schmidt, C. Schwiegelshohn, and C. Sohler, "BICO: BIRCH meets Coresets for K-Means Clustering," in Lecture notes in computer science, 2013, pp. 481–492. doi: 10.1007/978-3-642-40450-4_41.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bico-0.1.3.tar.gz.
File metadata
- Download URL: bico-0.1.3.tar.gz
- Upload date:
- Size: 39.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.10.19 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
806a5c27bd88ee6872da87acf1a466e3ad2d9d725b10c7a1830dc8cf16302fb0
|
|
| MD5 |
4b15d8022b548827e4a3ae8be34bd01c
|
|
| BLAKE2b-256 |
15c85d08a09240d77f66b6a2141ddd6b95dfc4e1ff7fb7622069c3385f4c6f63
|
File details
Details for the file bico-0.1.3-cp310-cp310-manylinux_2_39_x86_64.whl.
File metadata
- Download URL: bico-0.1.3-cp310-cp310-manylinux_2_39_x86_64.whl
- Upload date:
- Size: 568.1 kB
- Tags: CPython 3.10, manylinux: glibc 2.39+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.10.19 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b988205d60835112e1df6fdb8aa0d72d208e594b0695f8910ba1da3080d9f4f9
|
|
| MD5 |
240a4a4eab911e47f0a2093f86e9a2e8
|
|
| BLAKE2b-256 |
c7f7cf3891b3162ea261514ecb9947763276573338c16047d32cba03349e2a2b
|