Skip to main content

Python implementation of the bisect algorithm for hidden outlier generation.

Project description

Hidden Outlier Generation

CI PyPI version Python 3.10+ License: MIT

A Python library for generating synthetic hidden outliers using the BISECT algorithm.

Background

Douglas Hawkins defined an outlier as "an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism." In two dimensions, outliers are easy to spot. But in high-dimensional data (hundreds or thousands of features), the notion of "far from everything else" breaks down. The curse of dimensionality makes traditional outlier detection unreliable.

Hidden outliers are anomalies that look perfectly normal in the full feature space but reveal their true nature in specific feature subspaces. Consider a financial transaction: the amount, time, and location might each look normal individually, but the specific combination creates an anomaly. These are exactly the outliers that slip through conventional detection methods, and they appear frequently in fraud detection, infrastructure monitoring, and healthcare analytics.

This library implements the BISECT algorithm for generating hidden outliers, combined with manifold learning techniques to make generation tractable in high-dimensional spaces.

How It Works

The BISECT algorithm finds points in the "area of disagreement" between full-space and subspace outlier detection models:

  1. Origin Selection: Start from an inlier point (using strategies like weighted sampling toward the boundary)
  2. Direction: Pick a random direction by sampling from a d-dimensional unit sphere
  3. Bisection: Use binary search along the direction to find the boundary where outlier status changes

For high-dimensional data, the library supports projecting data onto a learned manifold (via autoencoders or PCA), generating hidden outliers in the tractable latent space, then decoding back to the original space.

What are Hidden Outliers?

Hidden outliers are data points that exhibit different outlier behavior depending on which feature subspace you examine:

  • H1 (Subspace Hidden): Outlier in some feature subspace but NOT in the full feature space
  • H2 (Full-space Hidden): Outlier in the full feature space but NOT in any subspace

These are useful for benchmarking outlier detection algorithms, especially subspace-aware methods.

Installation

pip install hidden-outlier-generation

Or with uv:

uv add hidden-outlier-generation

Quick Start

import numpy as np
from pyod.models.lof import LOF
from hog_bisect import BisectHOGen

# Your dataset
data = np.random.randn(200, 5)

# Create generator
generator = BisectHOGen(
    data=data,
    outlier_detection_method=LOF,
    seed=42
)

# Generate hidden outliers
hidden_outliers = generator.fit_generate(gen_points=50)

print(f"Generated {len(hidden_outliers)} hidden outliers")
generator.print_summary()

Features

  • Multiple origin strategies: centroid, least outlier, random, weighted
  • Flexible detection methods: Any PyOD detector (LOF, KNN, IForest, etc.)
  • Parallel processing: Use n_jobs=-1 for multi-core execution
  • Reproducible: Seed parameter for deterministic results
  • Type hints: Full typing support with py.typed marker

API Reference

BisectHOGen

BisectHOGen(
    data: np.ndarray,                    # Input dataset (n_samples, n_features)
    outlier_detection_method=LOF,        # PyOD detector class
    seed: int = 5,                       # Random seed
    max_dimensions: int = 11             # Threshold for random subspace sampling
)

fit_generate()

generator.fit_generate(
    gen_points: int = 100,               # Number of candidate points
    check_fast: bool = True,             # Fast subspace checking
    is_fixed_interval_length: bool = True,
    get_origin_type: str = "weighted",   # Origin strategy
    verbose: bool = False,
    n_jobs: int = 1                      # Parallel workers (-1 for all cores)
) -> np.ndarray                          # Array of hidden outliers

Origin Types

Type Description
"centroid" Use data mean as origin (deterministic)
"least outlier" Use most normal point (stable)
"random" Random inlier each iteration (diverse)
"weighted" Weighted random toward normal points (recommended)

Examples

The repository includes example scripts in the examples/ directory:

# Clone the repo to access examples
git clone https://github.com/dschulmeist/hidden-outlier-generation
cd hidden-outlier-generation

# Run examples
python examples/basic_usage.py
python examples/compare_origins.py
python examples/compare_detectors.py
python examples/visualize_outliers.py  # requires matplotlib

Development

# Clone and install with dev dependencies
git clone https://github.com/dschulmeist/hidden-outlier-generation
cd hidden-outlier-generation
uv sync --all-extras

# Run tests
uv run pytest

# Run linting
uv run ruff check

See CONTRIBUTING.md for development guidelines and release process.

Research

This library was developed as part of a bachelor thesis at Karlsruhe Institute of Technology (KIT) exploring hidden outlier generation using deep learning and manifold learning. Key findings:

  • Autoencoders capture non-linear manifold structure better than linear methods like PCA
  • Hidden outliers can serve as synthetic positive class for training classifiers, turning unsupervised anomaly detection into a supervised problem
  • Manifold projection makes high-dimensional hidden outlier generation tractable by reducing the exponential subspace search problem

Experiment notebooks demonstrating these findings are coming soon.

License

MIT License - see LICENSE for details.

Citation

If you use this library in your research, please cite:

@software{hidden_outlier_generation,
  title = {Hidden Outlier Generation},
  author = {Schulmeister, David},
  url = {https://github.com/dschulmeist/hidden-outlier-generation},
  year = {2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hidden_outlier_generation-1.0.1.tar.gz (207.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hidden_outlier_generation-1.0.1-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file hidden_outlier_generation-1.0.1.tar.gz.

File metadata

File hashes

Hashes for hidden_outlier_generation-1.0.1.tar.gz
Algorithm Hash digest
SHA256 7a1529b7c2754ff010989d50672a5b3e724d36ba5077922966e036e8f37f277f
MD5 2b59c0da7a82e168ca1d0f4ddd22b936
BLAKE2b-256 b4ea9b9b575b9a003672821467e310b59c5376965b56e39d5d96fdce670b5160

See more details on using hashes here.

File details

Details for the file hidden_outlier_generation-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for hidden_outlier_generation-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bc368ee2c2388d0d4b3837c931d9263d69cbac542d0f42464043d998ead8ef50
MD5 9024d658bd5fdbe1db57ebdc35846575
BLAKE2b-256 fd079274d7063e56280451196e9184cefb21f4d297a4bb96877b83f9ffad4d41

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page