Skip to main content

Theoretically efficient and practical parallel DBSCAN

Project description

Theoretically-Efficient and Practical Parallel DBSCAN

arXiv build

Overview

This repository hosts fast parallel DBSCAN clustering code for low dimensional Euclidean space. The code automatically uses the available threads on a parallel shared-memory machine to speedup DBSCAN clustering. It stems from a paper presented in SIGMOD'20: Theoretically Efficient and Practical Parallel DBSCAN.

Our software on 1 thread is on par with all serial state-of-the-art DBSCAN packages, and provides additional speedup via multi-threading. Below, we show a simple benchmark comparing our code with the DBSCAN implementation of Sklearn, tested on a 6-core computer with 2-way hyperthreading using a 2-dimensional data set with 50000 data points, where both implementation uses all available threads. Our implementation is more than 32x faster. We also show a visualization of the clustering result on a smaller data set.

Data sets with dimensionality 2 - 20 are supported by default, which can be modified by modifying DBSCAN_MIN_DIMS and DBSCAN_MAX_DIMS in the source code.

timing example

Tutorial

Option 1: Use the Python binding

There are two ways to install it:

  • Install it using PyPI: pip3 install --user dbscan (you can find the wheels here).
  • To build from scratch for testing: pip3 install -e . from the project root directory.

An example for using the Python module is provided in example.py. It generates the clustering example above.

Python API

from dbscan import DBSCAN
labels, core_samples_mask = DBSCAN(X, eps=0.3, min_samples=10)

Input

  • X: A 2-D Numpy array containing the input data points. The first dimension of X is the number of data points n, and the second dimension is the data set dimensionality (the maximum supported dimensionality is 20).
  • eps: The epsilon parameter (default 0.5).
  • min_samples: The minPts parameter (default 5).

Output

  • labels: A length n Numpy array (dtype=np.int32) containing cluster IDs of the data points, in the same ordering as the input data. Noise points are given a pseudo-ID of -1.
  • core_samples_mask: A length n Numpy array (dtype=np.bool) masking the core points, in the same ordering as the input data.

We provide a complete example below that generates a toy data set, computes the DBSCAN clustering, and visualizes the result as shown in the plot above.

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)
X = StandardScaler().fit_transform(X)

# #############################################################################
# Compute DBSCAN

# direct call of the C API:
from dbscan import DBSCAN
labels, core_samples_mask = DBSCAN(X, eps=0.3, min_samples=10)

# OR calling our sklearn API:
# from dbscan import sklDBSCAN as DBSCAN
# db = DBSCAN(eps=0.3, min_samples=10).fit(X)
# core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
# core_samples_mask[db.core_sample_indices_] = True
# labels = db.labels_

# #############################################################################
# Plot result
import matplotlib.pyplot as plt

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

Option 2: Use the binary executable

Compile and run the program:

mkdir build
cd build
cmake ..
cd executable
make -j # this will take a while
./dbscan -eps 0.1 -minpts 10 -o clusters.txt <data-file>

The <data-file> can be any CSV-like point data file, where each line contains a data point -- see an example here. The data file can be either with or without header. The cluster output clusters.txt will contain a cluster ID on each line (other than the first-line header), giving a cluster assignment in the same ordering as the input file. A noise point will have a cluster ID of -1.

Option 3: Include directly in your own C++ program

Create your own caller header and source file by instantiating the DBSCAN template function in "dbscan/algo.h".

dbscan.h:

template<int dim>
int DBSCAN(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster);

// equivalent to
// int DBSCAN(intT n, floatT PF[n][dim], double epsilon, intT minPts, bool coreFlagOut[n], intT coreFlag[n], intT cluster[n])
// if C++ syntax was a little more flexible

template<>
int DBSCAN<3>(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster);

dbscan.cpp:

#include "dbscan/algo.h"
#include "dbscan.h"

Calling the instantiated function:

int n = ...; // number of data points
double data[n][3] = ...; // data points
int labels[n]; // label ids get saved here
bool core_samples[n]; // a flag determining whether or not the sample is a core sample is saved here
{
  int ignore[n];
  DBSCAN<3>(n, (void*)data, 70, 100, core_samples, ignore, labels);
}

Doing this will only compile the function for the number of dimensions that you want, which saves on compilation time.

You can also include the "dbscan/capi.h" and define your own DBSCAN_MIN_DIMS and DBSCAN_MAX_DIMS macros the same way the Python extension uses it. The function exported has the following signature.

extern "C" int DBSCAN(int dim, int n, double* PF, double epsilon, int minPts, bool* coreFlag, int* cluster);

Right now, the only two files that are guaranteed to remain in the C/C++ API are "dbscan/algo.h" and "dbscan/capi.h" and the functions named DBSCAN within.

Citation

If you use our work in a publication, we would appreciate citations:

@inproceedings{wang2020theoretically,
  author = {Wang, Yiqiu and Gu, Yan and Shun, Julian},
  title = {Theoretically-Efficient and Practical Parallel DBSCAN},
  year = {2020},
  isbn = {9781450367356},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3318464.3380582},
  doi = {10.1145/3318464.3380582},
  booktitle = {Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},
  pages = {2555–2571},
  numpages = {17},
  keywords = {parallel algorithms, spatial clustering, DBScan},
  location = {Portland, OR, USA},
  series = {SIGMOD ’20}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbscan-0.0.12.tar.gz (168.3 kB view details)

Uploaded Source

Built Distributions

dbscan-0.0.12-cp38-abi3-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

dbscan-0.0.12-cp36-abi3-win_amd64.whl (786.5 kB view details)

Uploaded CPython 3.6+ Windows x86-64

dbscan-0.0.12-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.6+ manylinux: glibc 2.17+ x86-64

dbscan-0.0.12-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.2 MB view details)

Uploaded CPython 3.6+ manylinux: glibc 2.17+ ARM64

dbscan-0.0.12-cp36-abi3-macosx_10_9_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6+ macOS 10.9+ x86-64

File details

Details for the file dbscan-0.0.12.tar.gz.

File metadata

  • Download URL: dbscan-0.0.12.tar.gz
  • Upload date:
  • Size: 168.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for dbscan-0.0.12.tar.gz
Algorithm Hash digest
SHA256 90c91ae98b2e6b04c110a0860978f4315e6dd2328d505a44749d82b7ea50a0ef
MD5 19037d4a5e3516e9527d7385384185a0
BLAKE2b-256 e4eeee02fe6b756131c6107de0b435e02d70cc1742f1e5afa9aa173df39d0e1c

See more details on using hashes here.

File details

Details for the file dbscan-0.0.12-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dbscan-0.0.12-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8620a1d3b2cd204bdba6893592724290a2c628267ba8c2e757c3e6371416b7e8
MD5 39a89a58c65e1e9d17e58b183dbbb456
BLAKE2b-256 a99884db82ed489fe84fa9ddd124dbbccde40b27b012f98b04c7a31b03ee3482

See more details on using hashes here.

File details

Details for the file dbscan-0.0.12-cp36-abi3-win_amd64.whl.

File metadata

  • Download URL: dbscan-0.0.12-cp36-abi3-win_amd64.whl
  • Upload date:
  • Size: 786.5 kB
  • Tags: CPython 3.6+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for dbscan-0.0.12-cp36-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 885694473fd7c387e3b2438ddb9d6cb3d4ee82d1ff2e5a0040ed5d079296dc73
MD5 5b3e2da74986bd2fafb16a0a1b6a385f
BLAKE2b-256 add039f4e11f339332b5301966800051510e6a45a50a72e3571376f47973be7e

See more details on using hashes here.

File details

Details for the file dbscan-0.0.12-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dbscan-0.0.12-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 818402c41188d605702c02733fad79ce88afe3b3491cb80b03ff7e137fc28572
MD5 6d4f32e03ba567e1925d3eca2a3ea893
BLAKE2b-256 7997b5401324fd6c76147ae20500c155ef4a031ac4d26d8e2c58291d5fddc3f1

See more details on using hashes here.

File details

Details for the file dbscan-0.0.12-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dbscan-0.0.12-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d16f944304ea17f476de88c7d09641217c2102c2f2258c87a73c0650399dd44b
MD5 85a34abcf4627a1f41e1ecbe3e5688de
BLAKE2b-256 f039a309c4c394407908641d6442a1bdd0f7a9a25505723e7423434a3d33c886

See more details on using hashes here.

File details

Details for the file dbscan-0.0.12-cp36-abi3-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dbscan-0.0.12-cp36-abi3-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0c1c7108e4cd6bd92e839f083750acfbe9a8d725fe2932014b3bf0ad691e2016
MD5 6820fd126289ae2aae793f439c935f8f
BLAKE2b-256 a869b7f619697dc25fec5b6320323162d22f9f8384b49119362c49bbb3eb5c2e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page