Skip to main content

Clustream, Streamkm++ and metrics utilities C/C++ bindings for python

Project description

ClusOpt Core

This package is used by ClusOpt for it's CPU intensive tasks, but it can be easily imported in any python data stream clustering project, it is coded mainly in C/C++ with bindings for python, and features:

  • CluStream (based on MOA implementation)
  • StreamKM++ (wrapped around the original paper authors implementation)
  • Distance Matrix computation (in place implementation using boost threads)
  • Silhouette score (custom in place implementation inspired by BIRCH clustering vector)

Prerequisites

  • python >= 3.6
  • pip
  • boost-thread
  • gcc >= 6

boost-thread can be installed in Debian based systems with :

apt install libboost-thread-dev

Usage

See examples folder for more.

CluStream online clustering

from clusopt_core.cluster import CluStream
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt

k = 32

dataset, _ = make_blobs(n_samples=64000, centers=k, random_state=42, cluster_std=0.1)

model = CluStream(
    m=k * 10,  # no microclusters
    h=64000,  # horizon
    t=2,  # radius factor
)

chunks = np.split(dataset, len(dataset) / 4000)

model.init_offline(chunks.pop(0), seed=42)

for chunk in chunks:
    model.partial_fit(chunk)

clusters, _ = model.get_macro_clusters(k, seed=42)

plt.scatter(*dataset.T, marker=",", label="datapoints")

plt.scatter(*model.get_partial_cluster_centers().T, marker=".", label="microclusters")

plt.scatter(*clusters.T, marker="x", label="macro clusters", color="black")

plt.legend()
plt.show()

output:

clustream clustering results

Benchmarks

Some functions in clusopt_core are faster than scikit learn implementations, see the benchmark folder for more info.

Silhouette

Each bar have a tuple of (no_samples,dimension,no_groups), so independently of those 3 factors, clusopt implementation is faster.

clusopt silhouette versus scikit learn silhouette execution time

Distance Matrix

Each bar shows the dataset dimension, so clusopt_core implemetation is faster when the dataset dimension is small (<~150), even when using 4 processes in scikit-learn.

clusopt distance matrix versus scikit learn pairwise distance in execution time

Installation

You can install it directly from pypi with

pip install clusopt-core

or you can clone this repo and install from the directory

pip install ./clusopt_core

Acknowledgments

Thanks to:

  • Marcel R. Ackermann et al. for the StreamKM++ algorithm - link
  • The university of Waikato for the MOA framework - link

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusopt_core-1.1.7.tar.gz (37.1 kB view details)

Uploaded Source

File details

Details for the file clusopt_core-1.1.7.tar.gz.

File metadata

  • Download URL: clusopt_core-1.1.7.tar.gz
  • Upload date:
  • Size: 37.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for clusopt_core-1.1.7.tar.gz
Algorithm Hash digest
SHA256 e2208df62b077c09bf9be210728e88ed63b6407598fa6e6aab588b5d6862f3e2
MD5 a0e56aebd2e2741cc7ec0bccba300c96
BLAKE2b-256 088217b6fe24450276675c0f25f09c30531823b7330a44d808cd6a1d0d8e610d

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page