Coniferous forests for better machine learning

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
Programming Language
- Python
Topic
- Scientific/Engineering

Project description

Coniferests

Trying to make a slightly better isolation forest for anomaly detection. At the moment there are two forests:

Isolation forest,
Pine forest.

Isolation forest

This is the reimplementation of scikit-learn's isolation forest. The low-level trees and builders are those of original isoforest. What is basically reimplemented is the score evaluation to provide better efficiency. Compare runs (4-cores Intel Core i5-6300U):

import sklearn.ensemble
import coniferest.isoforest
from coniferest.datasets import MalanchevDataset

# 1e6 data points
dataset = MalanchevDataset(inliers=2**20, outliers=2**6)

# %%time
isoforest = coniferest.isoforest.IsolationForest(n_subsamples=1024)
isoforest.fit(dataset.data)
scores = isoforest.score_samples(dataset.data)
# CPU times: user 16.4 s, sys: 26.1 ms, total: 16.4 s
# Wall time: 5.03 s

# %%time
skforest = sklearn.ensemble.IsolationForest(max_samples=1024)
skforest.fit(dataset.data)
skscores = skforest.score_samples(dataset.data)
# CPU times: user 32.3 s, sys: 4.48 s, total: 36.8 s
# Wall time: 36.8 s

And that's not the largest speedup. The more data we analyze, the more cores we have, the more trees we build -- the larger will be the speedup. At one setup (analyzing 30M objects with 100-dimensional features on 80-core computer) the author has seeen a speedup rate from 24 hours to 1 minute.

The main object of optimization is score evaluation. So if you'd like to test it without using the isolation forest reimplementation, you may use just the evaluator as follows:

# %%time
from coniferest.sklearn.isoforest import IsolationForestEvaluator

isoforest = sklearn.ensemble.IsolationForest(max_samples=1024)
isoforest.fit(dataset.data)
evaluator = IsolationForestEvaluator(isoforest)
scores = evaluator.score_samples(dataset.data)
# CPU times: user 17.1 s, sys: 13.9 ms, total: 17.2 s
# Wall time: 6.32 s

Pine forest

Pine forest is an attempt to make isolation forest capable of applying a bit of prior information. Let's take a data sample:

dataset = MalanchevDataset(inliers=100, outliers=10)

                                Plain data
     ┌───────────────────────────────────────────────────────────────┐
 1.12┤  .           .                                        .       │
     │        .  .         .                        .  .    .  .     │
 0.88┤.   . .        .                                      .      . │
     │   .                                             .             │
     │                                                               │
 0.64┤                                                               │
     │                .     .                                        │
     │         ... ..  .... ... .....                                │
  0.4┤        ....  .. .. .. .    .                                  │
     │          . ...     ..   ... .                                 │
 0.17┤        .  .  ...  ..... .  ..                   .             │
     │         .    .... .  . .. . .                 .  ..           │
     │         .   .      . . . ...                 .     .         .│
-0.07┤                                                       .       │
     │                                                               │
-0.31┤                                               .               │
     └┬──────────────┬───────────────┬───────────────┬──────────────┬┘
     -0.2           0.16            0.53            0.89          1.26

Here we have one bunch of inliers and three bunches of outliers (10 points each). What happens when we use regular isolation forest? (or just PineForest without priors)

pineforest = PineForest(n_subsamples=16)
pineforest.fit(dataset.data)
scores = pineforest.score_samples(dataset.data)
np.argsort(scores)[:10]

                         PineForest without priors
     ┌───────────────────────────────────────────────────────────────┐
 1.12┤  *           .                                        *       │
     │        .  .         .                        .  *    *  *     │
 0.88┤*   . .        .                                      *      * │
     │   .                                             .             │
     │                                                               │
 0.64┤                                                               │
     │                .     .                                        │
     │         ... ..  .... ... .....                                │
  0.4┤        ....  .. .. .. .    .                                  │
     │          . ...     ..   ... .                                 │
 0.17┤        .  .  ...  ..... .  ..                   .             │
     │         .    .... .  . .. . .                 .  ..           │
     │         .   .      . . . ...                 .     .         *│
-0.07┤                                                       .       │
     │                                                               │
-0.31┤                                               .               │
     └┬──────────────┬───────────────┬───────────────┬──────────────┬┘
     -0.2           0.16            0.53            0.89          1.26

PineForest sees the upper corner as the most anomalous with some doubt about two other bunches. Let's now add prior information "the points (0, 1) and (1, 1) are regular and the point (1, 0) is anomalous":

priors = np.array([[0.0, 1.0],
                   [1.0, 1.0],
                   [1.0, 0.0]])

prior_labels = np.array([Label.R, Label.R, Label.A])

And see what happens:

pineforest.fit_known(dataset.data, priors, prior_labels)
scores = pineforest.score_samples(dataset.data)
np.argsort(scores)[:10]

                         PineForest with 3 priors
     ┌───────────────────────────────────────────────────────────────┐
 1.12┤  .           .                                        .       │
     │        .  .         .                        .  .    .  *     │
 0.88┤.   . .        .                                      .      * │
     │   .                                             .             │
     │                                                               │
 0.64┤                                                               │
     │                .     .                                        │
     │         ... ..  .... ... .....                                │
  0.4┤        ....  .. .. .. .    .                                  │
     │          . ...     ..   ... .                                 │
 0.17┤        .  .  ...  ..... .  ..                   .             │
     │         .    .... .  . .. . .                 *  **           │
     │         .   .      . . . ...                 *     *         *│
-0.07┤                                                       *       │
     │                                                               │
-0.31┤                                               *               │
     └┬──────────────┬───────────────┬───────────────┬──────────────┬┘
     -0.2           0.16            0.53            0.89          1.26

Now the PineForest sees the lower right outliers as anomalous and still has some doubts about upper right bunch. We may supply more labeled points. And the more prior data we supply the better anomaly detection will be, hopefully.

The plots may be repeated with plotext_pineforest.py script:

cd scripts
python plotext_pineforest.py

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
Programming Language
- Python
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

0.0.14

May 16, 2024

0.0.13

Mar 14, 2024

0.0.12

Feb 28, 2024

0.0.11

Jul 8, 2023

0.0.10

Jun 10, 2023

0.0.9

Jun 3, 2023

0.0.8

May 29, 2023

0.0.7

May 26, 2023

0.0.6

May 25, 2023

This version

0.0.4

Jul 7, 2022

0.0.3

Jul 4, 2022

0.0.2

Aug 13, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coniferest-0.0.4.tar.gz (10.5 MB view hashes)

Uploaded Jul 7, 2022 Source

Built Distribution

coniferest-0.0.4-cp39-cp39-macosx_12_0_arm64.whl (105.1 kB view hashes)

Uploaded Jul 7, 2022 CPython 3.9 macOS 12.0+ ARM64

Hashes for coniferest-0.0.4.tar.gz

Hashes for coniferest-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`88b51076a6458a1c535e9929cfd2d7e5d962b3fbc78d77da87f3ece5e8c91bb0`
MD5	`23660c3e7356e66d28793099f6d2479e`
BLAKE2b-256	`71edec590f80014092ccf462a3c383fad0d0b1757d26349e052990d11718be82`

Hashes for coniferest-0.0.4-cp39-cp39-macosx_12_0_arm64.whl

Hashes for coniferest-0.0.4-cp39-cp39-macosx_12_0_arm64.whl
Algorithm	Hash digest
SHA256	`a8f9632e7614e0c8c0e7d0968c1fc5eef7d91b11f4f4ac4051e863acae9bad2c`
MD5	`1e2fe8c1d6174543e6be6c26f61f4e17`
BLAKE2b-256	`06041857299f52b46437e0d3a3fad7fc7045b74f9956f388487915d1fc17d4c4`