Skip to main content

BICO is a fast streaming algorithm to compute coresets for the k-means problem on very large sets of points.

Project description

Build Status License: GPL v3 Supported Python version Stable Version

BICO

BICO is a fast streaming algorithm to compute high quality solutions for the k-means problem on very large sets of points. It combines the tree data structure of SIGMOND Test of Time Award winning algorithm BIRCH with insights from clustering theory to obtain solutions fast while keeping the error regarding the k-means cost function low.

You can try BICO out on our Clustering Toolkit!

Installation

pip install bico

Example

from bico import BICO
import numpy as np
import time

np.random.seed(42)

data = np.random.rand(10000, 10)

start = time.time()
bico = BICO(n_clusters=3, random_state=0, fit_coreset=True)
bico.fit(data)

print("Time:", time.time() - start)
# Time: 0.08275651931762695

print(bico.coreset_points_)
# BICO returns a set of points that act as a summary of the entire dataset.
# By default, at most 200 * n_clusters points are returned.
# This behaviour can be changed by setting the `summary_size` parameter.

# [[0.45224018 0.70183673 0.55506671 ... 0.70132665 0.57244196 0.66789088]
#  [0.73712952 0.5250208  0.43809322 ... 0.61427161 0.67910981 0.56207661]
#  [0.89905336 0.46942062 0.20677639 ... 0.74210482 0.75714522 0.49651055]
#  ...
#  [0.68744494 0.41508081 0.39197623 ... 0.44093386 0.21983902 0.37237243]
#  [0.60820965 0.29406341 0.67067782 ... 0.66435474 0.2390822  0.20070476]
#  [0.67385626 0.33474823 0.68238779 ... 0.3581703  0.65646253 0.41386131]]

print(bico.cluster_centers_)
# If the `fit_coreset` parameter is set to True, the cluster centers are computed using KMeans from sklearn based on the coreset.

# [[0.46892639 0.41968333 0.47302945 0.51782955 0.39390839 0.56209413
#   0.4481691  0.49521457 0.31394509 0.5104331 ]
#  [0.54384638 0.518978   0.49456809 0.56677848 0.63881783 0.33627504
#   0.49873782 0.5541338  0.52913562 0.56017203]
#  [0.48639347 0.55542596 0.54350474 0.41931257 0.48117255 0.60089563
#   0.55457724 0.44833238 0.67583389 0.43069267]]

Example with Large Datasets

For very large datasets, the data may not actually fit in memory. In this case, you can use partial_fit to stream the data in chunks. In this example, we use the US Census Data (1990) dataset. You can find more examples in the tests folder.

from bico import BICO
import numpy as np
import time

np.random.seed(42)

data = np.random.rand(10000, 10)

start = time.time()
bico = BICO(n_clusters=3, random_state=0)
for chunk in pd.read_csv(
    "census.txt", delimiter=",", header=None, chunksize=10000
):
    bico.partial_fit(chunk.to_numpy(copy=False))
# If a final `partial_fit` is called with no data, the coreset is computed
bico.partial_fit()

Development

Install poetry

curl -sSL https://install.python-poetry.org | python3 -

Install clang

sudo apt-get install clang

Set clang variables

export CXX=/usr/bin/clang++
export CC=/usr/bin/clang

Install the package

poetry install

If the installation does not work and you do not see the C++ output, you can build the package to see the stack trace

poetry build

Run the tests

poetry run python -m unittest discover tests -v

Citation

If you use this code, please cite the following paper:

H. Fichtenberger, M. Gillé, M. Schmidt, C. Schwiegelshohn, and C. Sohler, "BICO: BIRCH meets Coresets for K-Means Clustering," in Lecture notes in computer science, 2013, pp. 481–492. doi: 10.1007/978-3-642-40450-4_41.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bico-0.1.3.tar.gz (39.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bico-0.1.3-cp310-cp310-manylinux_2_39_x86_64.whl (568.1 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.39+ x86-64

File details

Details for the file bico-0.1.3.tar.gz.

File metadata

  • Download URL: bico-0.1.3.tar.gz
  • Upload date:
  • Size: 39.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.19 Linux/6.11.0-1018-azure

File hashes

Hashes for bico-0.1.3.tar.gz
Algorithm Hash digest
SHA256 806a5c27bd88ee6872da87acf1a466e3ad2d9d725b10c7a1830dc8cf16302fb0
MD5 4b15d8022b548827e4a3ae8be34bd01c
BLAKE2b-256 15c85d08a09240d77f66b6a2141ddd6b95dfc4e1ff7fb7622069c3385f4c6f63

See more details on using hashes here.

File details

Details for the file bico-0.1.3-cp310-cp310-manylinux_2_39_x86_64.whl.

File metadata

  • Download URL: bico-0.1.3-cp310-cp310-manylinux_2_39_x86_64.whl
  • Upload date:
  • Size: 568.1 kB
  • Tags: CPython 3.10, manylinux: glibc 2.39+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.19 Linux/6.11.0-1018-azure

File hashes

Hashes for bico-0.1.3-cp310-cp310-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 b988205d60835112e1df6fdb8aa0d72d208e594b0695f8910ba1da3080d9f4f9
MD5 240a4a4eab911e47f0a2093f86e9a2e8
BLAKE2b-256 c7f7cf3891b3162ea261514ecb9947763276573338c16047d32cba03349e2a2b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page