Skip to main content

Clustream, Streamkm++ and metrics utilities C/C++ bindings for python

Project description

ClusOpt Core

This package is used by ClusOpt for it's CPU intensive tasks, but it can be easily imported in any python data stream clustering project, it is coded mainly in C/C++ with bindings for python, and features:

  • CluStream (based on MOA implementation)
  • StreamKM++ (wrapped around the original paper authors implementation)
  • Distance Matrix computation (in place implementation using boost threads)
  • Silhouette score (custom in place implementation inspired by BIRCH clustering vector)

Prerequisites

  • python >= 3.6
  • pip
  • boost-thread
  • gcc >= 6

boost-thread can be installed in Debian based systems with :

apt install libboost-thread-dev

Usage

See examples folder for more.

CluStream online clustering

from clusopt_core.cluster import CluStream
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt

k = 32

dataset, _ = make_blobs(n_samples=64000, centers=k, random_state=42, cluster_std=0.1)

model = CluStream(
    m=k * 10,  # no microclusters
    h=64000,  # horizon
    t=2,  # radius factor
)

chunks = np.split(dataset, len(dataset) / 4000)

model.init_offline(chunks.pop(0), seed=42)

for chunk in chunks:
    model.partial_fit(chunk)

clusters, _ = model.get_macro_clusters(k, seed=42)

plt.scatter(*dataset.T, marker=",", label="datapoints")

plt.scatter(*model.get_partial_cluster_centers().T, marker=".", label="microclusters")

plt.scatter(*clusters.T, marker="x", label="macro clusters", color="black")

plt.legend()
plt.show()

output:

clustream clustering results

Benchmarks

Some functions in clusopt_core are faster than scikit learn implementations, see the benchmark folder for more info.

Silhouette

Each bar have a tuple of (no_samples,dimension,no_groups), so independently of those 3 factors, clusopt implementation is faster.

clusopt silhouette versus scikit learn silhouette execution time

Distance Matrix

Each bar shows the dataset dimension, so clusopt_core implemetation is faster when the dataset dimension is small (<~150), even when using 4 processes in scikit-learn.

clusopt distance matrix versus scikit learn pairwise distance in execution time

Installation

You can install it directly from pypi with

pip install clusopt-core

or you can clone this repo and install from the directory

pip install ./clusopt_core

Acknowledgments

Thanks to:

  • Marcel R. Ackermann et al. for the StreamKM++ algorithm - link
  • The university of Waikato for the MOA framework - link

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusopt_core-1.0.0.tar.gz (35.6 kB view details)

Uploaded Source

File details

Details for the file clusopt_core-1.0.0.tar.gz.

File metadata

  • Download URL: clusopt_core-1.0.0.tar.gz
  • Upload date:
  • Size: 35.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for clusopt_core-1.0.0.tar.gz
Algorithm Hash digest
SHA256 d0bb659024cb811206fc9c3e5612c21487322d41cfb2148379b771f4a4f942a9
MD5 e22f03c8374bea1f9088fd67391fc023
BLAKE2b-256 9628baf60b8d73bea22cb211e12d024d42676be30594223a14fbd2e80d107700

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page