Skip to main content

Classifier based non-parametric change point detection

Project description

Classifier based non-parametric change point detection

Change point detection tries to identify times when the probability distribution of a stochastic process or time series changes. Existing methods either assume a parametric model for within-segment distributions or a based on ranks or distances, and thus fail in scenarios with reasonably large dimensionality.

changeforest implements a classifier based algorithm that consistently estimates change points without any parametric assumptions even in high-dimensional scenarios. See [1] for details.

Installation

changeforest is available on PyPI and conda-forge. To install from conda-forge (recommended), simply run

conda install -c conda-forge changeforest

Example

The following example performs random forest based change point detection on the iris dataset. This includes three classes setosa, versicolor and virginica with 50 observations each. We interpret this as a simulated time series with change points at t = 50, 100.

In [1]: from changeforest import changeforest
   ...: from sklearn.datasets import fetch_openml
   ...:
   ...: iris = fetch_openml(data_id=61)["frame"].drop(columns="class").to_numpy()
   ...: result = changeforest(iris, "random_forest", "bs")
   ...: result
Out[1]:
                    best_split max_gain p_value
(0, 150]                    50   96.212    0.01
 ¦--(0, 50]                 34    -4.65       1
 °--(50, 150]              100   51.557    0.01
     ¦--(50, 100]           80   -3.068       1
     °--(100, 150]         134   -2.063       1

In [2]: result.split_points()
Out[2]: [50, 100]

changeforest also implements methods change_in_mean and knn. While random_forest and knn implement the TwoStepSearch optimizer as described in [1], for change_in_mean the optimizer GridSearch is used. Both random_forest and knn perform model selection via a pseudo-permutation test (see [1]). For change_in_mean split candidates are kept whenever max_gain > control.minimal_gain_to_split.

The iris dataset allows for rather simple classification due to large mean shifts between classes. As a result, both change_in_mean and knn also correctly identify die true change points.

In [3]: result = changeforest(iris, "change_in_mean", "bs")
   ...: result.split_points()
Out[3]: [50, 100]

In [4]: result = changeforest(iris, "knn", "bs")
   ...: result.split_points()
Out[4]: [50, 100]

changeforest returns a tree-like object with attributes start, stop, best_split, max_gain, p_value, is_significant, optimizer_result, model_selection_result, left, right and segments. These can be interesting to further investigate the output of the algorithm. Here we plot the approximated gain curves of the first three segments:

In [5]: import matplotlib.pyplot as plt
   ...: result = changeforest(iris, "change_in_mean", "bs")
   ...: plt.plot(range(150), result.optimizer_result.gain_results[-1].gain)
   ...: plt.plot(range(50), result.left.optimizer_result.gain_results[-1].gain)
   ...: plt.plot(range(50, 150), result.right.optimizer_result.gain_results[-1].gain)
   ...: plt.legend([f"approx. gain for {x}" for x in ["(0, 150]", "(0, 50]", "(50, 150"]])
   ...: plt.show()

One can clearly observe that the approx. gain curves are piecewise linear, with maxima at the true underlying change points.

The changeforest algorithm can be tuned with hyperparameters. See here for their descriptions and default values. In Python, the parameters can be specified with the Control class which can be passed to changeforest. The following will build random forests with very few trees:

In [6]: from changeforest import Control
   ...: changeforest(iris, "random_forest", "bs", Control(random_forest_n_estimators=10))
Out[6]:
                    best_split max_gain p_value
(0, 150]                    50   96.071    0.01
 ¦--(0, 50]                 16   -3.788       1
 °--(50, 150]              100   46.544    0.01
     ¦--(50, 100]           66   -7.793    0.43
     °--(100, 150]         134   -9.329       1

References

[1] M. Londschien, S. Kovács and P. Bühlmann (2021), "Random Forests and other nonparametric classifiers for multivariate change point detection", working paper.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

changeforest-0.6.0.tar.gz (126.9 kB view hashes)

Uploaded Source

Built Distributions

changeforest-0.6.0-cp310-none-win_amd64.whl (297.8 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

changeforest-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

changeforest-0.6.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.2 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

changeforest-0.6.0-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (796.7 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

changeforest-0.6.0-cp310-cp310-macosx_10_7_x86_64.whl (410.7 kB view hashes)

Uploaded CPython 3.10 macOS 10.7+ x86-64

changeforest-0.6.0-cp39-none-win_amd64.whl (297.8 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

changeforest-0.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

changeforest-0.6.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.2 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

changeforest-0.6.0-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (796.7 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

changeforest-0.6.0-cp39-cp39-macosx_10_7_x86_64.whl (410.7 kB view hashes)

Uploaded CPython 3.9 macOS 10.7+ x86-64

changeforest-0.6.0-cp38-none-win_amd64.whl (297.9 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

changeforest-0.6.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

changeforest-0.6.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.2 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

changeforest-0.6.0-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (796.7 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

changeforest-0.6.0-cp38-cp38-macosx_10_7_x86_64.whl (410.7 kB view hashes)

Uploaded CPython 3.8 macOS 10.7+ x86-64

changeforest-0.6.0-cp37-none-win_amd64.whl (297.8 kB view hashes)

Uploaded CPython 3.7 Windows x86-64

changeforest-0.6.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

changeforest-0.6.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.2 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ ARM64

changeforest-0.6.0-cp37-cp37m-macosx_10_7_x86_64.whl (410.7 kB view hashes)

Uploaded CPython 3.7m macOS 10.7+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page