Skip to main content

Geometric nearest-neighbor workflows with threshold-based anomaly detection.

Project description

GeometricNearestNeighborsProcessor

A Python package for Geometric Nearest Neighbors (GNN) workflows: data rescaling, fast anomalies finding, similarity matrices derivation.

This Python package is a translation to Python of the Wolfram Language software monad "MonadicGeometricNearestNeighbors", [AAp1]. (The R package "GNNMon-R", [AAp4], is another translation of [AAp1].)

Remark: In order to keep outputs consistent with R and WL package the data frame outputs of the Python package use camel case for the column names. (That might change in the future.)


Purpose and theoretical background

Consider the following computational tasks for a given set of $n$-dimensional (nD) points $P$:

  1. Find the points of $P$ that are outliers or anomalies
  2. Find the points of another set $P_1$ that can be seen anomalies wrt to $P$
  3. For a given nD point find its Nearest Neighbors (NNs) in $P$
    • The points of $P$ can have labels
    • It might be desired to get the distances and labels of the NNs
  4. Plot the points $P$ with minimal setup or specification writing
  5. Give the (sparse) proximity matrix of $P$ for a specified number of neighbors

Let us define an anomalous point as one that is "too far" from the other points.
Which points are "too far" from the rest can be determined by examining statistics of distances between each point and $k$ nearest neighbors of it.

More concretely, point anomalies are found in the following way:

  1. Input:
    • Points P as a data frame, list, or dictionary
    • Number of nearest neighbors n
    • Distance function d
    • Aggregation function a
  2. For each point of P
    • Find its n nearest neighbors
    • Aggregate with a the corresponding n distance
  3. Using the statistics for the previous step -- a 1D array -- find outlier identification parameters
    • Like, Hampel-, SPLUS-, Quartile parameters
  4. Identify anomalies using the parameters of the previous step

Usage examples

Load packages:

from RandomDataGenerators import *
from OutlierIdentifiers import *

import numpy
import random

import pandas as pd
import plotly.express as px

Generate random points:

help(random_data_frame)
dfPoints = random_data_frame(n_rows=30, columns_spec = ["X", "Y"], generators= {"X": numpy.random.normal, "Y": numpy.random.normal})
print(dfPoints.shape)
print(dfPoints[1:6])

Here is a summary:

dfPoints.describe()

Here is a plot of the points:

fig = px.scatter(dfPoints, x="X", y="Y", template="plotly_dark")
fig.show()

A typical pipeline of geometric nearest neighbors processing:

gnnObj = (GeometricNearestNeighborsProcessor(dsPoints)
   .make_nearest_function(distance_function = "EuclideanDistance")
   .compute_thresholds(number_of_nearest_neighbors = 10, aggregation_function = "mean", outlier_identifier = "QuartileIdentifierParameters")
   .find_anomalies()
   .echo_function_value("Anomaly points:", lambda x: print(x))
   .plot(title="Random points", template="plotly_dark")
)

Show the plot obtained above:

gnnObj.take_value().show()

Here we generate another set of random points using the same random point generators:

dfPoints2 = random_data_frame(n_rows=40, columns_spec = ["X", "Y"], generators= {"X": numpy.random.normal, "Y": numpy.random.normal})
print(dfPoints2.shape)

Here the points of second set are classified into being anomalous or not:

gnnObj.classify(dfPoints2).take_value()

See the notebook "Usage-examples.ipynb" for more detailed examples.


Implementation details

  • The package provides the class GeometricNearestNeighborsProcessor that can be used to construct chainable, monadic pipeline-like behavior.

  • Plotting of the data points is done via "plotly" -- just a scatter plot of 2D points for now.

  • The chainable behavior of the methods of the class GeometricNearestNeighborsProcessoris implemented by following the principle that all methods return self, except the so-called "takers".

    • I.e., methods with names that start with "take_".
  • The core Nearest Neighbors (NNs) finding functionality is provided by the "scikit-learn" class NearestNeighbors.

  • The NNs finding algorithms used by GeometricNearestNeighborsProcessor are "scan" and "kdtree".

    • "scan" is implemented in the class GeometricNearestNeighborsProcessor instead of delegating to scikit-learn's NearestNeighbors "brute" algorithm.
    • "kdtree" delegates to NearestNeighbors "kd_tree" algorithm.

References

Wolfram Language

[AAp1] Anton Antonov, MonadicGeometricNearestNeighbors, Wolfram Language paclet, (2023-2025), Wolfram Language Paclet Repository.

[AAp2] Anton Antonov, OutlierIdentifiers, Wolfram Language paclet, (2023), Wolfram Language Paclet Repository.

R

[AAp3] Anton Antonov, OutlierIdentifiers, R package, (2019-2024), GitHub/antononcube.

[AAp4] Anton Antonov, GNNMon-R, R package, (2019-2025), GitHub/antononcube.

[AAp5] Anton Antonov, KDTreeAlgorithm, R package, (2025), GitHub/antononcube.

Python

[AAp6] Anton Antonov, RandomDataGenerators, Python package, (2021-2026), GitHub/antononcube. (PIPy.org.)

[AAp7] Anton Antonov, OutlierIdentifiers, Python package, (2024), GitHub/antononcube. (PIPy.org.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geometricnearestneighborsprocessor-0.1.0.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file geometricnearestneighborsprocessor-0.1.0.tar.gz.

File metadata

File hashes

Hashes for geometricnearestneighborsprocessor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 480ecadf894fcf7ecdef62a518e2e0c5edae6545272f6161f83123bf696f5acc
MD5 fa48547c9aac2043fd65b6cbf26a8268
BLAKE2b-256 d67975fc4b361a1f196b12d7f1769fad2a2a0a35ad1db6452671bf11bf06bfc7

See more details on using hashes here.

File details

Details for the file geometricnearestneighborsprocessor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for geometricnearestneighborsprocessor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 03e13f373487197f2afe46dee897bf8c85fc4a500c19638c98a06dd75c5e4366
MD5 e9b875eb2f1670fbdf9e5906817f0633
BLAKE2b-256 80e15a170d4b2131db89457672b876a74d5c5ac970d4bb7c39de26190442b5e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page