Skip to main content

KScanner is a novel combination of DBSCAN and K-Means clustering that uses automated Epsilon and MinPts approximation

Project description

KScanner

A novel combination of DBSCAN and K-Means clustering that uses automated Epsilon and MinPts approximation to identify the correct model parameters for DBSCAN. The output of DBSCAN is then fed to K-Means as the value for n_clusters to lower the total amount of iterations to determine appropriate cluster amounts.

How to use

from kscanner import scanners

scanners.full_scan(data, graph=False, kmeans_n_init=100, kmeans_max_iter=1000, kmeans_tol=0.0001)

  • return [automated_eps, unique_clusters, kmeanModel_best, elapsed_time, fig1, fig2]

scanners.auto_epsilon(data, graph=False)

  • return [automated_eps, distances, nbrs, elapsed_time, fig]

scanners.auto_minpts(data)

  • return [minpts, elapsed_time]

scanners.dbscanner(data, epsilon, minpts, graph=False)

  • return [dbscan, unique_clusters, elapsed_time, fig]

scanners.kmeans_model(data, unique_clusters, kmeans_n_init=100, kmeans_max_iter=1000, kmeans_tol=0.0001)

  • return [kmeanModel_best, unique_clusters, elapsed_time]

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

A density-based clustering algorithm that separates the high-density regions of the data from the low density regions. DBSCAN groups data points by distance, usually Euclidean, and the minimum number of points. Unlike K-Means clustering DBSCAN is not sensitive to outliers as they show up in low-density regions.

  • DBSCAN Parameters
    • Epsilon (EPS): This is the main threshold used for DBSCAN and is the minimum distance apart required for two points to be classified as neighbors.
    • To calculate the value of Eps, we take the distance between each data point to its closest neighbor using Nearest Neighbours
    • Then we can sort and plot them. From the plot, we identify Epsilon as the maximum value at the curvature of the graph
    • MinPoints: This parameter is the threshold for the minimum number of points needed to construct a cluster. Something is only a cluster in DBSCAN if the number of points in it is greater than or equal to MinPoints. Importantly, the point itself is included in the calculation.
      • Selecting MinPoints
        • If the dataset has 2 dimensions, use 4
        • If the dataset has > 2 dimensions, choose MinPts = 2*dim (Sander et al., 1998).
        • For larger datasets with a lot of noise, it is suggested to go with minPts = 2 * D.
          • Distance metric: The distance metric used when calculating distance between instances of a feature array (typically Euclidean distance)
  • Post clustering we are left with 3 types of data
    • Core: A point which is equal or greater than MinPoints and is within the Eps distance
    • Border: A point which has at least one Core point within Eps distance from itself
    • Noise: a point less than MinPoints within distance Eps from itself

K-Means Clustering

K-means clustering is a popular unsupervised machine learning algorithm. K-means identifies "k" number of centroids and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The "means" refers to averaging the data points (i.e. finding the centroid).

File Descriptions:

  • package_name: represents the main package.
  • docs: includes documentation files on how to use the package.
  • scripts: your top-level scripts.
  • src: where your code goes. It contains packages, modules, sub-packages, and so on.
  • tests: where you can put unit tests.
  • LICENSE.txt: contains the text of the license (for example, MIT).
  • CHANGES.txt: reports the changes of each release.
  • MANIFEST.in: where you put instructions on what extra files you want to include (non-code files).
  • README.txt: contains the package description (markdown format).
  • pyproject.toml: to register your build tools.
  • setup.cfg: the configuration file of your build tools.

Authors

Peerapak Adsavakulchai, Duncan Calvert, and Charles Mudd

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kscanner-1.0.0-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file kscanner-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: kscanner-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for kscanner-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2883b7a59075437e5101606beebd8b6368afd827b6ca6e2b94313167e7c2d58a
MD5 61019a2d909ddafd6d44876f8a03f03e
BLAKE2b-256 3bc6eba178427a35fc004b1bf01a8f1612beda9c611f088dd1ece4c2c5f83d6b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page