Skip to main content

Knee-point detection in Python

Project description

kneed

Knee-point detection in Python

Downloads Downloads Binder Build Status CodeFactor

This repository is an attempt to implement the kneedle algorithm, published here. Given a set of x and y values, kneed will return the knee point of the function. The knee point is the point of maximum curvature.

Table of contents

Installation

Tested with Python 3.5 and 3.6

anaconda

$ conda install -c conda-forge kneed

pip

$ pip install kneed

Clone from GitHub

$ git clone https://github.com/arvkevi/kneed.git
$ python setup.py install

Usage

These steps introduce how to use kneed by reproducing Figure 2 from the manuscript.

Input Data

The DataGenerator class is only included as a utility to generate sample datasets.

Note: x and y must be equal length arrays.

from kneed import DataGenerator, KneeLocator

x, y = DataGenerator.figure2()

print([round(i, 3) for i in x])
print([round(i, 3) for i in y])

[0.0, 0.111, 0.222, 0.333, 0.444, 0.556, 0.667, 0.778, 0.889, 1.0]
[-5.0, 0.263, 1.897, 2.692, 3.163, 3.475, 3.696, 3.861, 3.989, 4.091]

Find Knee

The knee (or elbow) point is calculated simply by instantiating the KneeLocator class with x, y and the appropriate curve and direction.
Here, kneedle.knee and/or kneedle.elbow store the point of maximum curvature.

kneedle = KneeLocator(x, y, S=1.0, curve='concave', direction='increasing')

print(round(kneedle.knee, 3))
0.222

print(round(kneedle.elbow, 3))
0.222

Visualize

The KneeLocator class also has two plotting functions for quick visualizations. Note that all (x, y) are transformed for the normalized plots

# Normalized data, normalized knee, and normalized distance curve.
kneedle.plot_knee_normalized()

# Raw data and knee.
kneedle.plot_knee()

Examples

Sensitivity Parameter (S)

The knee point selected is tunable by setting the sensitivity parameter S. From the manuscript:

The sensitivity parameter allows us to adjust how aggressive we want Kneedle to be when detecting knees. Smaller values for S detect knees quicker, while larger values are more conservative. Put simply, S is a measure of how many “flat” points we expect to see in the unmodified data curve before declaring a knee.

import numpy as np
np.random.seed(23)

sensitivity = [1, 3, 5, 10, 100, 200, 400]
knees = []
norm_knees = []

n = 1000
x = range(1, n + 1)
y = sorted(np.random.gamma(0.5, 1.0, n), reverse=True)
for s in sensitivity:
    kl = KneeLocator(x, y, curve='convex', direction='decreasing', S=s)
    knees.append(kl.knee)
    norm_knees.append(kl.norm_knee)

print(knees)
[43, 137, 178, 258, 305, 482, 482]

print([nk.round(2) for nk in norm_knees])
[0.04, 0.14, 0.18, 0.26, 0.3, 0.48, 0.48]

import matplotlib.pyplot as plt
plt.style.use('ggplot');
plt.figure(figsize=(8, 6));
plt.plot(kl.x_normalized, kl.y_normalized);
plt.plot(kl.x_distance, kl.y_distance);
colors = ['r', 'g', 'k', 'm', 'c', 'orange']
for k, c, s in zip(norm_knees, colors, sensitivity):
    plt.vlines(k, 0, 1, linestyles='--', colors=c, label=f'S = {s}');
plt.legend();

Notice that any S>200 will result in a knee at 482 (0.48, normalized) in the plot above.

Polynomial Fit

Here is an example of a "bumpy" or "noisy" line where the default scipy.interpolate.interp1d spline fitting method does not provide the best estimate for the point of maximum curvature. This example demonstrates that setting the parameter interp_method='polynomial' will choose a more accurate point by smoothing the line.

The argument for interp_method parameter is a string of either "interp1d" or "polynomial".

x = list(range(90))
y = [
    7304, 6978, 6666, 6463, 6326, 6048, 6032, 5762, 5742,
    5398, 5256, 5226, 5001, 4941, 4854, 4734, 4558, 4491,
    4411, 4333, 4234, 4139, 4056, 4022, 3867, 3808, 3745,
    3692, 3645, 3618, 3574, 3504, 3452, 3401, 3382, 3340,
    3301, 3247, 3190, 3179, 3154, 3089, 3045, 2988, 2993,
    2941, 2875, 2866, 2834, 2785, 2759, 2763, 2720, 2660,
    2690, 2635, 2632, 2574, 2555, 2545, 2513, 2491, 2496,
    2466, 2442, 2420, 2381, 2388, 2340, 2335, 2318, 2319,
    2308, 2262, 2235, 2259, 2221, 2202, 2184, 2170, 2160,
    2127, 2134, 2101, 2101, 2066, 2074, 2063, 2048, 2031
]

# the default spline fit, `interp_method='interp1d'`
kneedle = KneeLocator(x, y, S=1.0, curve='convex', direction='decreasing', interp_method='interp1d')
kneedle.plot_knee_normalized()

# The same data, only using a polynomial fit this time.
kneedle = KneeLocator(x, y, S=1.0, curve='convex', direction='decreasing', interp_method='polynomial')
kneedle.plot_knee_normalized()

NoisyGaussian

Figure 3 from the manuscript estimates the knee to be x=60 for a NoisyGaussian. This simulate 5,000 NoisyGaussian instances and finds the average.

knees = []
for i in range(5):
    x, y = DataGenerator.noisy_gaussian(mu=50, sigma=10, N=1000)
    kneedle = KneeLocator(x, y, curve='concave', direction='increasing', interp_method='polynomial')
    knees.append(kneedle.knee)

# average knee point
round(sum(knees) / len(knees), 3)
60.921

Select k clusters

Find the optimal number of clusters (k) to use in k-means clustering. See the tutorial in the notebooks directory.

KneeLocator(x, y, curve='convex', direction='decreasing')

Contributing

Contributions are welcome, if you have suggestions or would like to make improvements please submit an issue or pull request.

Citation

Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior Ville Satopa † , Jeannie Albrecht† , David Irwin‡ , and Barath Raghavan§ †Williams College, Williamstown, MA ‡University of Massachusetts Amherst, Amherst, MA § International Computer Science Institute, Berkeley, CA

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kneed-0.4.0.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kneed-0.4.0-py2.py3-none-any.whl (8.7 kB view details)

Uploaded Python 2Python 3

File details

Details for the file kneed-0.4.0.tar.gz.

File metadata

  • Download URL: kneed-0.4.0.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7

File hashes

Hashes for kneed-0.4.0.tar.gz
Algorithm Hash digest
SHA256 40af95e58bbca3f7897c1a121e5442629e848af23e57cd922ae8a1681d7f9821
MD5 2b7d65d1998509e43896933cf9aa47bd
BLAKE2b-256 f3230fcd1e2fa02e9f4f346669eb45e60240dcf3b52c0a3ad26f4b4ad1f538a6

See more details on using hashes here.

File details

Details for the file kneed-0.4.0-py2.py3-none-any.whl.

File metadata

  • Download URL: kneed-0.4.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7

File hashes

Hashes for kneed-0.4.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6aea8c495aab16b529c8b46caea83bd074d57a1f0cc5bc5ab551f8a2d53db23e
MD5 fe8e35e43b5486ab69cbb3c3d43d475a
BLAKE2b-256 c58bd4125a10afe735f8c50c77bf525ee67bd697811a54f6d894a3d914a4c121

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page