Skip to main content

Enforced Density -Based Spatial Clustering of Applications with Noise.

Project description

EDBSCAN

Enforced Density-Based Spatial Clustering of Applications with Noise.

This package is an extension on the DBSCAN algorithm to enable for pre-labeled data points in the clusters, or in other words to enforce certain cluster values and splits. It mimics the scikit-learn implementation of sklearn.cluster.DBSCAN.

Installation

You can either install the package through PyPI:

pip install edbscan

Or via this repository directly:

pip install git+https://github.com/RubenPants/EDBSCAN.git

Usage

The image below shows you the result of EDBSCAN on a given input. The image on the left shows you the raw input data, together with the few labeled samples. The image on the right shows the clusters found by EDBSCAN, where the light blue dots represent the detected noise.

Result of EDBSCAN

# Load in the data
import numpy as np
data = np.load(open('data.npy'))
print(data.shape)  # (220, 2)
y = np.load(open('y.npy'))
print(y.shape)  # (220, )
print(y)  # array([None, None, …, -1, None, …, 0, None, …, 1, …], dtype=object)

# Run the algorithm
from edbscan import edbscan
core_points, labels = edbscan(X=data, y=y)
print(labels)  # array([-1, 2, 2, 4, -1, -1, 6, 3, 4, …])

As shown in the code snippet above, aside from the raw data (data), a target vector y is provided. This vector indicates the known (labeled) clusters. A None cluster label are those not yet known, that need to get clustered by the EDBSCAN algorithm.

For more detailed usages, see the notebooks present in the examples/ folder.

How EDBSCAN works

There are three concepts that define how EDBSCAN operates:

  • The DBSCAN algorithm on which this algorithm is based on, read the paper or the scikit-learn documentation for more.
  • Semi-supervised annotations, represented by the y vector in the Usage section. This vector contains three types of values:
    • None if the given sample is not known to belong to a specific cluster and needs to get labeled by the EDBSCAN algorithm
    • -1 if the given sample is known to be noise
    • 0..N if the given sample is known to belong to cluster 0..N
  • Where DBSCAN expands its clusters in a FIFO fashion, will EDBSCAN expand its clusters in a most dense first fashion. In other words, the items that have the most detected nearest neighbours get expanded first. By doing so, the denser areas get assigned a cluster faster. This prevents two dense cluster that are near each other from merging if they are already assigned a different label.

Comparison

This section compares EDBSCAN to (1) other clustering algorithms as DBSCAN and HDBSCAN, and (2) on different clustering benchmarks.

1. DBSCAN, HDBSCAN, and EDBSCAN

This section compares the behaviour of the DBSCAN algorithm, the HDBSCAN and the EDBSCAN algorithm on the data shown in the Usage section. The input data looks as follows:

Comparison between DBSCAN, HDBSCAN, and EDBSCAN

In each of the clustered results, light-blue data represents the detected noise.

Some observations on the DBSCAN result:

  • Green combined two clusters that should be separated
  • Purple combined two clusters that should be separated
  • Brown identified noise as a cluster

Some observations on the HDBSCAN result:

  • Yellow and Grey are now successfully separated
  • Brown and Pink are now successfully separated
  • Purple identified noise as a cluster

Some observations on the EDBSCAN result:

  • Grey and Orange are now successfully separated
  • Brown and Pink are now successfully separated
  • The noise that was previously detected as a cluster is now successfully identified as noise

2. Scikit-learn cluster benchmark

The following images show the results of the EDBSCAN algorithm on different scikit-learn clustering benchmarks.

circles

moons

blobs

aniso

uniform

multi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edbscan-0.0.1.tar.gz (69.6 kB view details)

Uploaded Source

Built Distribution

edbscan-0.0.1-cp38-cp38-macosx_10_9_x86_64.whl (60.3 kB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

File details

Details for the file edbscan-0.0.1.tar.gz.

File metadata

  • Download URL: edbscan-0.0.1.tar.gz
  • Upload date:
  • Size: 69.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.12

File hashes

Hashes for edbscan-0.0.1.tar.gz
Algorithm Hash digest
SHA256 726df171474ef3148c3bba8b679834071cf02d63a4b11e32c84aa1b8aaef0be4
MD5 94753846e655b6663baeab7d7824abd3
BLAKE2b-256 74ab08e294a4979be3c0a6ae6ed3546efeb1207b7f8d0daf4aa2b6300894c8d3

See more details on using hashes here.

File details

Details for the file edbscan-0.0.1-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: edbscan-0.0.1-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 60.3 kB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.12

File hashes

Hashes for edbscan-0.0.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b630b2cd0a89740b2bcd574cf512a5dc3eb324b2218aa1493bce55660fbbbf52
MD5 b0d37a907629056d5e969c5f6273d428
BLAKE2b-256 2cf7a0dc56f25606357a9e98bdd7d288548dd80945b2c639d8b41674d204a55d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page