KScanner is a novel combination of DBSCAN and K-Means clustering that uses automated Epsilon and MinPts approximation
Project description
KScanner
A novel combination of DBSCAN and K-Means clustering that uses automated Epsilon and MinPts approximation to identify the correct model parameters for DBSCAN. The output of DBSCAN is then fed to K-Means as the value for n_clusters to lower the total amount of iterations to determine appropriate cluster amounts.
How to use
from kscanner import scanners
scanners.full_scan(data, graph=False, kmeans_n_init=100, kmeans_max_iter=1000, kmeans_tol=0.0001)
- return [automated_eps, unique_clusters, kmeanModel_best, elapsed_time, fig1, fig2]
scanners.auto_epsilon(data, graph=False)
- return [automated_eps, distances, nbrs, elapsed_time, fig]
scanners.auto_minpts(data)
- return [minpts, elapsed_time]
scanners.dbscanner(data, epsilon, minpts, graph=False)
- return [dbscan, unique_clusters, elapsed_time, fig]
scanners.kmeans_model(data, unique_clusters, kmeans_n_init=100, kmeans_max_iter=1000, kmeans_tol=0.0001)
- return [kmeanModel_best, unique_clusters, elapsed_time]
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
A density-based clustering algorithm that separates the high-density regions of the data from the low density regions. DBSCAN groups data points by distance, usually Euclidean, and the minimum number of points. Unlike K-Means clustering DBSCAN is not sensitive to outliers as they show up in low-density regions.
- DBSCAN Parameters
- Epsilon (EPS): This is the main threshold used for DBSCAN and is the minimum distance apart required for two points to be classified as neighbors.
- To calculate the value of Eps, we take the distance between each data point to its closest neighbor using Nearest Neighbours
- Then we can sort and plot them. From the plot, we identify Epsilon as the maximum value at the curvature of the graph
- MinPoints: This parameter is the threshold for the minimum number of points needed to construct a cluster. Something is only a cluster in DBSCAN if the number of points in it is greater than or equal to MinPoints. Importantly, the point itself is included in the calculation.
- Selecting MinPoints
- If the dataset has 2 dimensions, use 4
- If the dataset has > 2 dimensions, choose MinPts = 2*dim (Sander et al., 1998).
- For larger datasets with a lot of noise, it is suggested to go with minPts = 2 * D.
- Distance metric: The distance metric used when calculating distance between instances of a feature array (typically Euclidean distance)
- Selecting MinPoints
- Post clustering we are left with 3 types of data
- Core: A point which is equal or greater than MinPoints and is within the Eps distance
- Border: A point which has at least one Core point within Eps distance from itself
- Noise: a point less than MinPoints within distance Eps from itself
K-Means Clustering
K-means clustering is a popular unsupervised machine learning algorithm. K-means identifies "k" number of centroids and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The "means" refers to averaging the data points (i.e. finding the centroid).
File Descriptions:
- package_name: represents the main package.
- docs: includes documentation files on how to use the package.
- scripts: your top-level scripts.
- src: where your code goes. It contains packages, modules, sub-packages, and so on.
- tests: where you can put unit tests.
- LICENSE.txt: contains the text of the license (for example, MIT).
- CHANGES.txt: reports the changes of each release.
- MANIFEST.in: where you put instructions on what extra files you want to include (non-code files).
- README.txt: contains the package description (markdown format).
- pyproject.toml: to register your build tools.
- setup.cfg: the configuration file of your build tools.
Authors
Peerapak Adsavakulchai, Duncan Calvert, and Charles Mudd
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kscanner-1.0.0-py3-none-any.whl.
File metadata
- Download URL: kscanner-1.0.0-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2883b7a59075437e5101606beebd8b6368afd827b6ca6e2b94313167e7c2d58a
|
|
| MD5 |
61019a2d909ddafd6d44876f8a03f03e
|
|
| BLAKE2b-256 |
3bc6eba178427a35fc004b1bf01a8f1612beda9c611f088dd1ece4c2c5f83d6b
|