Skip to main content

Theoretically efficient and practical parallel DBSCAN

Project description

Theoretically-Efficient and Practical Parallel DBSCAN

arXiv build

Overview

This repository hosts fast parallel DBSCAN clustering code for low dimensional Euclidean space. The code automatically uses the available threads on a parallel shared-memory machine to speedup DBSCAN clustering. It stems from a paper presented in SIGMOD'20: Theoretically Efficient and Practical Parallel DBSCAN.

Our software on 1 thread is on par with all serial state-of-the-art DBSCAN packages, and provides additional speedup via multi-threading. Below, we show a simple benchmark comparing our code with the DBSCAN implementation of Sklearn, tested on a 6-core computer using a 2-dimensional data set with 50000 data points. We also show a visualization of the clustering result on a smaller data set.

Data sets with dimensionality 2 - 20 are supported by default, which can be modified by modifying DBSCAN_MIN_DIMS and DBSCAN_MAX_DIMS in the source code.

timing example

Tutorial

Option 1: Use the binary executable

Compile and run the program:

mkdir build
cd build
cmake ..
cd executable
make -j # this will take a while
./dbscan -eps 0.1 -minpts 10 -o clusters.txt <data-file>

The <data-file> can be any CSV-like point data file, where each line contains a data point -- see an example here. The data file can be either with or without header. The cluster output clusters.txt will contain a cluster ID on each line (other than the first-line header), giving a cluster assignment in the same ordering as the input file. A noise point will have a cluster ID of -1.

Option 2: Use the Python binding

There are two ways to install it:

  • Install it using PyPI: pip3 install --user dbscan (you can find the wheels here)
  • (harder and not recommended) Compile it yourself: First install dependencies pip3 install -r requirements.txt and sudo apt install libpython3-dev. Run python3 setup.py build, The compilation will take a few minutes, and generate a .so library containing the DBSCAN module. To create a wheel that is supported universally across many Python versions for your given OS, run python setup.py bdist_wheel in an environment containing the oldest numpy version available for the version of Python that you are compiling for. For example, for Python 3.8, use numpy 1.17 to compile the wheel. Then, the wheel will work on all Python and numpy versions that are newer that that for your given OS. This is done automatically when installing via pip.

An example for using the Python module is provided in example.py. If the dependencies above are installed, simply run python3 example.py from the root directory to reproduce the plots above.

Python API

from dbscan import DBSCAN
labels, core_samples_mask = DBSCAN(X, eps=0.3, min_samples=10)

Input

  • X: A 2-D Numpy array containing the input data points. The first dimension of X is the number of data points n, and the second dimension is the data set dimensionality (the maximum supported dimensionality is 20).
  • eps: The epsilon parameter (default 0.5).
  • min_samples: The minPts parameter (default 5).

Output

  • labels: A length n Numpy array (dtype=np.int32) containing cluster IDs of the data points, in the same ordering as the input data. Noise points are given a pseudo-ID of -1.
  • core_samples_mask: A length n Numpy array (dtype=np.bool) masking the core points, in the same ordering as the input data.

We provide a complete example below that generates a toy data set, computes the DBSCAN clustering, and visualizes the result as shown in the plot above.

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)
X = StandardScaler().fit_transform(X)

# #############################################################################
# Compute DBSCAN

from dbscan import sklDBSCAN as DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# OR direct call of the C API:
# from dbscan import DBSCAN
# labels, core_samples_mask = DBSCAN(X, eps=0.3, min_samples=10)

# #############################################################################
# Plot result
import matplotlib.pyplot as plt

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

Option 3: Include directly in your own C++ program

Create your own caller header and source file by instantiating the DBSCAN template function in "dbscan/algo.h".

dbscan.h:

template<int dim>
int DBSCAN(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster);

// equivalent to
// int DBSCAN(intT n, floatT PF[n][dim], double epsilon, intT minPts, bool coreFlagOut[n], intT coreFlag[n], intT cluster[n])
// if C++ syntax was a little more flexible

template<>
int DBSCAN<3>(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster);

dbscan.cpp:

#include "dbscan/algo.h"
#include "dbscan.h"

Calling the instantiated function:

int n = ...; // number of data points
double data[n][3] = ...; // data points
int labels[n]; // label ids get saved here
bool core_samples[n]; // a flag determining whether or not the sample is a core sample is saved here
{
  int ignore[n];
  DBSCAN<3>(n, (void*)data, 70, 100, core_samples, ignore, labels);
}

Doing this will only compile the function for the number of dimensions that you want, which saves on compilation time.

You can also include the "dbscan/capi.h" and define your own DBSCAN_MIN_DIMS and DBSCAN_MAX_DIMS macros the same way the Python extension uses it. The function exported has the following signature.

extern "C" int DBSCAN(int dim, int n, double* PF, double epsilon, int minPts, bool* coreFlag, int* cluster);

Right now, the only two files that are guaranteed to remain in the C/C++ API are "dbscan/algo.h" and "dbscan/capi.h" and the functions named DBSCAN within.

Citation

If you use our work in a publication, we would appreciate citations:

@inproceedings{wang2020theoretically,
  author = {Wang, Yiqiu and Gu, Yan and Shun, Julian},
  title = {Theoretically-Efficient and Practical Parallel DBSCAN},
  year = {2020},
  isbn = {9781450367356},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3318464.3380582},
  doi = {10.1145/3318464.3380582},
  booktitle = {Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},
  pages = {2555–2571},
  numpages = {17},
  keywords = {parallel algorithms, spatial clustering, DBScan},
  location = {Portland, OR, USA},
  series = {SIGMOD ’20}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbscan-0.0.11.tar.gz (148.8 kB view details)

Uploaded Source

Built Distributions

dbscan-0.0.11-cp38-abi3-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

dbscan-0.0.11-cp36-abi3-win_amd64.whl (786.7 kB view details)

Uploaded CPython 3.6+ Windows x86-64

dbscan-0.0.11-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.6+ manylinux: glibc 2.17+ x86-64

dbscan-0.0.11-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.2 MB view details)

Uploaded CPython 3.6+ manylinux: glibc 2.17+ ARM64

dbscan-0.0.11-cp36-abi3-macosx_10_9_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6+ macOS 10.9+ x86-64

File details

Details for the file dbscan-0.0.11.tar.gz.

File metadata

  • Download URL: dbscan-0.0.11.tar.gz
  • Upload date:
  • Size: 148.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for dbscan-0.0.11.tar.gz
Algorithm Hash digest
SHA256 97ce1b38f480ee06b60e39b9f97a4c1f1f236575f62606701c2b696389958d22
MD5 fc17ed2cf3a43aea022d547c56539710
BLAKE2b-256 6bc70950ebb73db78ae08f177776ee7a17ad8c8a786d07f230c4d2aebab5f048

See more details on using hashes here.

File details

Details for the file dbscan-0.0.11-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dbscan-0.0.11-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b961160ec2be1a03346366531d13c5616c6e026faf388f9ce203d4d2196c2ab7
MD5 5f9369413d0f98d08c284f3a32686114
BLAKE2b-256 e099d54b26ee76e28dc7dbebc3876ccd656312209b95bad6928e19b80b22824b

See more details on using hashes here.

File details

Details for the file dbscan-0.0.11-cp36-abi3-win_amd64.whl.

File metadata

  • Download URL: dbscan-0.0.11-cp36-abi3-win_amd64.whl
  • Upload date:
  • Size: 786.7 kB
  • Tags: CPython 3.6+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for dbscan-0.0.11-cp36-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b7fc3cf3b4503c22f6f0a02ade2969de3a11703a76b649f9b5103b2d441db579
MD5 64f32c3e0e237b3a01f358f99ce2b277
BLAKE2b-256 8dc26352b9e1a0e4685212a51087cae3088fa4034ac00bde48e4ff5b4c2b7121

See more details on using hashes here.

File details

Details for the file dbscan-0.0.11-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dbscan-0.0.11-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 44913956a14d23d65bd3080e7f9dcdd9540395b37cb0933445d20ff05b11503c
MD5 64d19298da0db32b9c687f61372c0066
BLAKE2b-256 8105992003907f5e3ab888e0e73805453bc7b320bf7f8c1a30c74bd34ebb8cc4

See more details on using hashes here.

File details

Details for the file dbscan-0.0.11-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for dbscan-0.0.11-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 22a683c0964560c3f0772f2865fe10d5f2cbe5ade8086968bf60c7a98c854519
MD5 5f01319c631d90d7c81e4123d415bafc
BLAKE2b-256 bc89243a6c6837226f1d6708327bf4f52b0db7229027f1a5ba3f14873c653cb8

See more details on using hashes here.

File details

Details for the file dbscan-0.0.11-cp36-abi3-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for dbscan-0.0.11-cp36-abi3-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3d9ecdebc5dc289dabb15d7bdead52ce35393f4a3767b1d5df08363f5795cf33
MD5 11b41b5b8297a3a5b89b0c2b57fab550
BLAKE2b-256 00d2a1d23df80d4402133637eb28600a5ae49409935830b2acb174220a8b8f79

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page