Python implementation of the bisect algorithm for hidden outlier generation.
Project description
Hidden Outlier Generation
A Python library for generating synthetic hidden outliers using the BISECT algorithm.
Background
Douglas Hawkins defined an outlier as "an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism." In two dimensions, outliers are easy to spot. But in high-dimensional data (hundreds or thousands of features), the notion of "far from everything else" breaks down. The curse of dimensionality makes traditional outlier detection unreliable.
Hidden outliers are anomalies that look perfectly normal in the full feature space but reveal their true nature in specific feature subspaces. Consider a financial transaction: the amount, time, and location might each look normal individually, but the specific combination creates an anomaly. These are exactly the outliers that slip through conventional detection methods, and they appear frequently in fraud detection, infrastructure monitoring, and healthcare analytics.
This library implements the BISECT algorithm for generating hidden outliers, combined with manifold learning techniques to make generation tractable in high-dimensional spaces.
How It Works
The BISECT algorithm finds points in the "area of disagreement" between full-space and subspace outlier detection models:
- Origin Selection: Start from an inlier point (using strategies like weighted sampling toward the boundary)
- Direction: Pick a random direction by sampling from a d-dimensional unit sphere
- Bisection: Use binary search along the direction to find the boundary where outlier status changes
For high-dimensional data, the library supports projecting data onto a learned manifold (via autoencoders or PCA), generating hidden outliers in the tractable latent space, then decoding back to the original space.
What are Hidden Outliers?
Hidden outliers are data points that exhibit different outlier behavior depending on which feature subspace you examine:
- H1 (Subspace Hidden): Outlier in some feature subspace but NOT in the full feature space
- H2 (Full-space Hidden): Outlier in the full feature space but NOT in any subspace
These are useful for benchmarking outlier detection algorithms, especially subspace-aware methods.
Installation
pip install hidden-outlier-generation
Or with uv:
uv add hidden-outlier-generation
Quick Start
import numpy as np
from pyod.models.lof import LOF
from hog_bisect import BisectHOGen
# Your dataset
data = np.random.randn(200, 5)
# Create generator
generator = BisectHOGen(
data=data,
outlier_detection_method=LOF,
seed=42
)
# Generate hidden outliers
hidden_outliers = generator.fit_generate(gen_points=50)
print(f"Generated {len(hidden_outliers)} hidden outliers")
generator.print_summary()
Features
- Multiple origin strategies: centroid, least outlier, random, weighted
- Flexible detection methods: Any PyOD detector (LOF, KNN, IForest, etc.)
- Parallel processing: Use
n_jobs=-1for multi-core execution - Reproducible: Seed parameter for deterministic results
- Type hints: Full typing support with py.typed marker
API Reference
BisectHOGen
BisectHOGen(
data: np.ndarray, # Input dataset (n_samples, n_features)
outlier_detection_method=LOF, # PyOD detector class
seed: int = 5, # Random seed
max_dimensions: int = 11 # Threshold for random subspace sampling
)
fit_generate()
generator.fit_generate(
gen_points: int = 100, # Number of candidate points
check_fast: bool = True, # Fast subspace checking
is_fixed_interval_length: bool = True,
get_origin_type: str = "weighted", # Origin strategy
verbose: bool = False,
n_jobs: int = 1 # Parallel workers (-1 for all cores)
) -> np.ndarray # Array of hidden outliers
Origin Types
| Type | Description |
|---|---|
"centroid" |
Use data mean as origin (deterministic) |
"least outlier" |
Use most normal point (stable) |
"random" |
Random inlier each iteration (diverse) |
"weighted" |
Weighted random toward normal points (recommended) |
Examples
The repository includes example scripts in the examples/ directory:
# Clone the repo to access examples
git clone https://github.com/dschulmeist/hidden-outlier-generation
cd hidden-outlier-generation
# Run examples
python examples/basic_usage.py
python examples/compare_origins.py
python examples/compare_detectors.py
python examples/visualize_outliers.py # requires matplotlib
Development
# Clone and install with dev dependencies
git clone https://github.com/dschulmeist/hidden-outlier-generation
cd hidden-outlier-generation
uv sync --all-extras
# Run tests
uv run pytest
# Run linting
uv run ruff check
See CONTRIBUTING.md for development guidelines and release process.
Research
This library was developed as part of a bachelor thesis at Karlsruhe Institute of Technology (KIT) exploring hidden outlier generation using deep learning and manifold learning. Key findings:
- Autoencoders capture non-linear manifold structure better than linear methods like PCA
- Hidden outliers can serve as synthetic positive class for training classifiers, turning unsupervised anomaly detection into a supervised problem
- Manifold projection makes high-dimensional hidden outlier generation tractable by reducing the exponential subspace search problem
Experiment notebooks demonstrating these findings are coming soon.
License
MIT License - see LICENSE for details.
Citation
If you use this library in your research, please cite:
@software{hidden_outlier_generation,
title = {Hidden Outlier Generation},
author = {Schulmeister, David},
url = {https://github.com/dschulmeist/hidden-outlier-generation},
year = {2023}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters