Skip to main content

Coniferous forests for better machine learning

Project description

Coniferests

Trying to make a slightly better isolation forest for anomaly detection. At the moment there are two forests:

Isolation forest

This is the reimplementation of scikit-learn's isolation forest. The low-level trees and builders are those of original isoforest. What is basically reimplemented is the score evaluation to provide better efficiency. Compare runs (4-cores Intel Core i5-6300U):

import sklearn.ensemble
import coniferest.isoforest
from coniferest.datasets import MalanchevDataset

# 1e6 data points
dataset = MalanchevDataset(inliers=2**20, outliers=2**6)

# %%time
isoforest = coniferest.isoforest.IsolationForest(n_subsamples=1024)
isoforest.fit(dataset.data)
scores = isoforest.score_samples(dataset.data)
# CPU times: user 16.4 s, sys: 26.1 ms, total: 16.4 s
# Wall time: 5.03 s

# %%time
skforest = sklearn.ensemble.IsolationForest(max_samples=1024)
skforest.fit(dataset.data)
skscores = skforest.score_samples(dataset.data)
# CPU times: user 32.3 s, sys: 4.48 s, total: 36.8 s
# Wall time: 36.8 s

And that's not the largest speedup. The more data we analyze, the more cores we have, the more trees we build -- the larger will be the speedup. At one setup (analyzing 30M objects with 100-dimensional features on 80-core computer) the author has seeen a speedup rate from 24 hours to 1 minute.

The main object of optimization is score evaluation. So if you'd like to test it without using the isolation forest reimplementation, you may use just the evaluator as follows:

# %%time
from coniferest.sklearn.isoforest import IsolationForestEvaluator

isoforest = sklearn.ensemble.IsolationForest(max_samples=1024)
isoforest.fit(dataset.data)
evaluator = IsolationForestEvaluator(isoforest)
scores = evaluator.score_samples(dataset.data)
# CPU times: user 17.1 s, sys: 13.9 ms, total: 17.2 s
# Wall time: 6.32 s

Pine forest

Pine forest is an attempt to make isolation forest capable of applying a bit of prior information. Let's take a data sample:

dataset = MalanchevDataset(inliers=100, outliers=10)
                                Plain data
     ┌───────────────────────────────────────────────────────────────┐
 1.12┤  .           .                                        .       │
     │        .  .         .                        .  .    .  .     │
 0.88┤.   . .        .                                      .      . │
     │   .                                             .             │
     │                                                               │
 0.64┤                                                               │
     │                .     .                                        │
     │         ... ..  .... ... .....                                │
  0.4┤        ....  .. .. .. .    .                                  │
     │          . ...     ..   ... .                                 │
 0.17┤        .  .  ...  ..... .  ..                   .             │
     │         .    .... .  . .. . .                 .  ..           │
     │         .   .      . . . ...                 .     .         .│
-0.07┤                                                       .       │
     │                                                               │
-0.31┤                                               .               │
     └┬──────────────┬───────────────┬───────────────┬──────────────┬┘
     -0.2           0.16            0.53            0.89          1.26

Here we have one bunch of inliers and three bunches of outliers (10 points each). What happens when we use regular isolation forest? (or just PineForest without priors)

pineforest = PineForest(n_subsamples=16)
pineforest.fit(dataset.data)
scores = pineforest.score_samples(dataset.data)
np.argsort(scores)[:10]
                         PineForest without priors
     ┌───────────────────────────────────────────────────────────────┐
 1.12┤  *           .                                        *       │
     │        .  .         .                        .  *    *  *     │
 0.88┤*   . .        .                                      *      * │
     │   .                                             .             │
     │                                                               │
 0.64┤                                                               │
     │                .     .                                        │
     │         ... ..  .... ... .....                                │
  0.4┤        ....  .. .. .. .    .                                  │
     │          . ...     ..   ... .                                 │
 0.17┤        .  .  ...  ..... .  ..                   .             │
     │         .    .... .  . .. . .                 .  ..           │
     │         .   .      . . . ...                 .     .         *│
-0.07┤                                                       .       │
     │                                                               │
-0.31┤                                               .               │
     └┬──────────────┬───────────────┬───────────────┬──────────────┬┘
     -0.2           0.16            0.53            0.89          1.26

PineForest sees the upper corner as the most anomalous with some doubt about two other bunches. Let's now add prior information "the points (0, 1) and (1, 1) are regular and the point (1, 0) is anomalous":

priors = np.array([[0.0, 1.0],
                   [1.0, 1.0],
                   [1.0, 0.0]])

prior_labels = np.array([Label.R, Label.R, Label.A])

And see what happens:

pineforest.fit_known(dataset.data, priors, prior_labels)
scores = pineforest.score_samples(dataset.data)
np.argsort(scores)[:10]
                         PineForest with 3 priors
     ┌───────────────────────────────────────────────────────────────┐
 1.12┤  .           .                                        .       │
     │        .  .         .                        .  .    .  *     │
 0.88┤.   . .        .                                      .      * │
     │   .                                             .             │
     │                                                               │
 0.64┤                                                               │
     │                .     .                                        │
     │         ... ..  .... ... .....                                │
  0.4┤        ....  .. .. .. .    .                                  │
     │          . ...     ..   ... .                                 │
 0.17┤        .  .  ...  ..... .  ..                   .             │
     │         .    .... .  . .. . .                 *  **           │
     │         .   .      . . . ...                 *     *         *│
-0.07┤                                                       *       │
     │                                                               │
-0.31┤                                               *               │
     └┬──────────────┬───────────────┬───────────────┬──────────────┬┘
     -0.2           0.16            0.53            0.89          1.26

Now the PineForest sees the lower right outliers as anomalous and still has some doubts about upper right bunch. We may supply more labeled points. And the more prior data we supply the better anomaly detection will be, hopefully.

The plots may be repeated with plotext_pineforest.py script:

cd scripts
python plotext_pineforest.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coniferest-0.0.4.tar.gz (10.5 MB view details)

Uploaded Source

Built Distribution

coniferest-0.0.4-cp39-cp39-macosx_12_0_arm64.whl (105.1 kB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

File details

Details for the file coniferest-0.0.4.tar.gz.

File metadata

  • Download URL: coniferest-0.0.4.tar.gz
  • Upload date:
  • Size: 10.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for coniferest-0.0.4.tar.gz
Algorithm Hash digest
SHA256 88b51076a6458a1c535e9929cfd2d7e5d962b3fbc78d77da87f3ece5e8c91bb0
MD5 23660c3e7356e66d28793099f6d2479e
BLAKE2b-256 71edec590f80014092ccf462a3c383fad0d0b1757d26349e052990d11718be82

See more details on using hashes here.

File details

Details for the file coniferest-0.0.4-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for coniferest-0.0.4-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 a8f9632e7614e0c8c0e7d0968c1fc5eef7d91b11f4f4ac4051e863acae9bad2c
MD5 1e2fe8c1d6174543e6be6c26f61f4e17
BLAKE2b-256 06041857299f52b46437e0d3a3fad7fc7045b74f9956f388487915d1fc17d4c4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page