Skip to main content

Clustream, Streamkm++ and metrics utilities C/C++ bindings for python

Project description

ClusOpt Core

This package is used by ClusOpt for it's CPU intensive tasks, but it can be easily imported in any python data stream clustering project, it is coded mainly in C/C++ with bindings for python, and features:

  • CluStream (based on MOA implementation)
  • StreamKM++ (wrapped around the original paper authors implementation)
  • Distance Matrix computation (in place implementation using boost threads)
  • Silhouette score (custom in place implementation inspired by BIRCH clustering vector)

Prerequisites

  • python >= 3.6
  • pip
  • boost-thread
  • gcc >= 6

boost-thread can be installed in Debian based systems with :

apt install libboost-thread-dev

Usage

See examples folder for more.

CluStream online clustering

from clusopt_core.cluster import CluStream
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt

k = 32

dataset, _ = make_blobs(n_samples=64000, centers=k, random_state=42, cluster_std=0.1)

model = CluStream(
    m=k * 10,  # no microclusters
    h=64000,  # horizon
    t=2,  # radius factor
)

chunks = np.split(dataset, len(dataset) / 4000)

model.init_offline(chunks.pop(0), seed=42)

for chunk in chunks:
    model.partial_fit(chunk)

clusters, _ = model.get_macro_clusters(k, seed=42)

plt.scatter(*dataset.T, marker=",", label="datapoints")

plt.scatter(*model.get_partial_cluster_centers().T, marker=".", label="microclusters")

plt.scatter(*clusters.T, marker="x", label="macro clusters", color="black")

plt.legend()
plt.show()

output:

clustream clustering results

Benchmarks

Some functions in clusopt_core are faster than scikit learn implementations, see the benchmark folder for more info.

Silhouette

Each bar have a tuple of (no_samples,dimension,no_groups), so independently of those 3 factors, clusopt implementation is faster.

clusopt silhouette versus scikit learn silhouette execution time

Distance Matrix

Each bar shows the dataset dimension, so clusopt_core implemetation is faster when the dataset dimension is small (<~150), even when using 4 processes in scikit-learn.

clusopt distance matrix versus scikit learn pairwise distance in execution time

Installation

You can install it directly from pypi with

pip install clusopt-core

or you can clone this repo and install from the directory

pip install ./clusopt_core

Acknowledgments

Thanks to:

  • Marcel R. Ackermann et al. for the StreamKM++ algorithm - link
  • The university of Waikato for the MOA framework - link

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusopt_core-1.1.1.tar.gz (37.1 kB view details)

Uploaded Source

File details

Details for the file clusopt_core-1.1.1.tar.gz.

File metadata

  • Download URL: clusopt_core-1.1.1.tar.gz
  • Upload date:
  • Size: 37.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for clusopt_core-1.1.1.tar.gz
Algorithm Hash digest
SHA256 b6562000035e5602a3a12ac221ef3e35faae2f61c82479b596ff5c96e794a761
MD5 37de46c90d641d6c204be631bf626afd
BLAKE2b-256 b2d0c54d3e7357ddc6aed8b3479d7539f71074587c9ebb4e9a9d8a24eb697e48

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page