Skip to main content

SegmentAE: A Python Library for Anomaly Detection Optimization

Project description

LinkedIn Contributors Stargazers MIT License Downloads Month Downloads

Framework Overview

SegmentAE is designed to enhance anomaly detection performance through the optimization of reconstruction error by integrating and intersecting clustering methods with tabular autoencoders. Built with enterprise-grade architecture, it provides a versatile, scalable, and robust solution for anomaly detection applications in domains such as financial fraud detection, network security, and industrial monitoring.

Key Architectural Features (v2.0+)

  • Professional Architecture: Clean separation of concerns with robust principles
  • Type Safety: Comprehensive Pydantic validation and type hints throughout
  • Design Patterns: Registry, Strategy, and Template Method patterns
  • Enum-Based Configuration: Type-safe constants for all parameters
  • Custom Exceptions: Informative error messages with actionable suggestions

Key Features and Capabilities

1. General Applicability on Tabular Datasets

SegmentAE is engineered to handle a wide range of tabular datasets, making it suitable for various anomaly detection tasks across different use case contexts. It can be seamlessly integrated into diverse applications, ensuring broad utility and adaptability.

2. Optimization and Customization

The framework offers complete configurability for each component of the anomaly detection pipeline, including:

  • Data Preprocessing: Encoding, scaling, and imputation with Pydantic validation
  • Clustering Algorithms: Registry-based clustering with easy extensibility
  • Autoencoder Integration: Support for custom Keras/TensorFlow models or built-in implementations

Each component can be fine-tuned to achieve optimal performance tailored to specific use cases.

3. Enhanced Detection Performance

By leveraging a combination of clustering algorithms and advanced anomaly detection techniques, SegmentAE aims to improve the accuracy and reliability of anomaly detection. The integration of tabular autoencoders with clustering mechanisms ensures that the framework effectively captures and identifies different patterns in the input data, optimizing the reconstruction error for each cluster, thereby enhancing predictive performance.

Main Development Tools

Major frameworks used to build this project:

Where to Get It

Binary installer for the latest released version is available at the Python Package Index (PyPI).

GitHub Project Link: https://github.com/TsLu1s/SegmentAE

Installation

To install this package from the PyPI repository, run the following command:

pip install segmentae

SegmentAE - Technical Components and Pipeline Structure

The SegmentAE framework consists of several integrated components, each playing a critical role in the optimization of anomaly detection through clustering and tabular autoencoders. The pipeline is structured with professional design patterns to ensure seamless data flow and modular customization.

1. Data Preprocessing

Proper preprocessing is crucial for ensuring the quality and consistency of data. The preprocessing module now includes:

  • Pydantic Validation: Automatic type checking and conversion
  • Type-Safe Configuration: Enum-based parameter selection
  • Missing Value Imputation: Simple statistical imputation methods
  • Normalization: MinMax, Standard, and Robust scaling options
  • Categorical Encoding: Inverse Frequency, Label, and One-Hot Encoding

Example:

from segmentae.preprocessing import Preprocessing
from segmentae.core import EncoderType, ScalerType

# Type-safe configuration with enums
pr = Preprocessing(
    encoder=EncoderType.IFREQUENCY,  
    scaler=ScalerType.MINMAX,
    imputer="Simple"                # Strings also are supported
)
pr.fit(X_train)
X_transformed = pr.transform(X_test)

2. Clustering

Clustering forms the backbone of the SegmentAE framework, provided with easy extensibility:

  • Registry Pattern: Clean model registration and instantiation
  • Type Safety: Pydantic validation for all parameters
  • Four Algorithms: K-Means, MiniBatch K-Means, Gaussian Mixture, Agglomerative
  • Extensible Design: Easy to add new clustering algorithms

Example:

from segmentae.clustering import Clustering
from segmentae.core import ClusterModel

cl = Clustering(
    cluster_model=[ClusterModel.KMEANS],  # Enum-based
    n_clusters=3
)
cl.clustering_fit(X_train)

3. Anomaly Detection - Autoencoders

The core of the SegmentAE framework employs advanced autoencoder architectures:

  • Three Baseline Implementations: Dense, BatchNorm, and Ensemble autoencoders
  • Custom Model Support: Integrate any Keras/TensorFlow model
  • Full Customization: Network architecture, training epochs, activation layers, and more
  • Type-Safe Integration: Validated through protocols

The framework includes three baseline autoencoder algorithms for user application, allowing complete customization of network architecture, training parameters, and activation functions.

Custom Model Integration: You can build your own autoencoder model (Keras-based) and integrate it seamlessly into the SegmentAE pipeline → Custom Model

Unlabeled Data Support: Application example for totally unlabeled data available here → Unlabeled Example

SegmentAE - Predictive Application

The following example demonstrates the complete workflow from data loading to anomaly detection using a DenseAutoencoder integrated with KMeans clustering.

import pandas as pd
from segmentae import SegmentAE, Preprocessing, Clustering
from segmentae.autoencoders import DenseAutoencoder
from segmentae.core import EncoderType, ScalerType, ClusterModel, ThresholdMetric
from segmentae.data_sources import load_dataset
from segmentae.metrics import metrics_classification
from sklearn.model_selection import train_test_split

############################################################################################
### Data Loading

train, test, target = load_dataset(
    dataset_selection='htru2_dataset',
    split_ratio=0.75
)

test, future_data = train_test_split(test, train_size=0.9, random_state=5)

# Reset indices (required)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
future_data = future_data.reset_index(drop=True)

# Separate features and targets
X_train, y_train = train.drop(columns=[target]).copy(), train[target].astype(int)
X_test, y_test = test.drop(columns=[target]).copy(), test[target].astype(int)
X_future_data = future_data.drop(columns=[target]).copy()
y_future_data = future_data[target].astype(int)

############################################################################################
### Preprocessing with Type-Safe Configuration (v2.0+)

pr = Preprocessing(
    encoder=EncoderType.IFREQUENCY,  # Type-safe enum: EncoderType.LABEL, EncoderType.ONEHOT
    scaler=ScalerType.MINMAX,        # Type-safe enum: ScalerType.STANDARD, ScalerType.ROBUST
    imputer=None
)

pr.fit(X=X_train)
X_train = pr.transform(X=X_train)
X_test = pr.transform(X=X_test)
X_future_data = pr.transform(X=X_future_data)

############################################################################################
### Clustering Implementation with Type-Safe Registry

cl_model = Clustering(
    cluster_model=[ClusterModel.KMEANS],  # Type-safe enum: ClusterModel.MINIBATCH_KMEANS, ClusterModel.GMM, ClusterModel.AGGLOMERATIVE
    n_clusters=3
)
cl_model.clustering_fit(X=X_train)

############################################################################################
### Autoencoder Implementation

denseAutoencoder = DenseAutoencoder(
    hidden_dims=[16, 12, 8, 4],
    encoder_activation='relu',
    decoder_activation='relu',
    optimizer='adam',
    learning_rate=0.001,
    epochs=150,
    val_size=0.15,
    stopping_patient=20,
    dropout_rate=0.1,
    batch_size=None
)
denseAutoencoder.fit(input_data=X_train)
denseAutoencoder.summary()

############################################################################################
### Autoencoder + Clustering Integration

sg = SegmentAE(ae_model=denseAutoencoder, cl_model=cl_model)

############################################################################################
### Train Reconstruction with Type-Safe Metric (v2.0+)

sg.reconstruction(
    input_data=X_train,
    threshold_metric=ThresholdMetric.MSE  # Type-safe enum: ThresholdMetric.MAE, ThresholdMetric.RMSE, ThresholdMetric.MAX_ERROR
)

############################################################################################
### Reconstruction Performance Evaluation

results = sg.evaluation(
    input_data=X_test,
    target_col=y_test,
    threshold_ratio=2.0
)

# Access test metadata by cluster
preds_test, recon_metrics_test = sg.preds_test, sg.reconstruction_test

# View global metrics
print(results['global metrics'])
print(results['clusters metrics'])

############################################################################################
### Multiple Threshold Ratio Evaluation

threshold_ratios = [0.75, 1, 1.5, 2, 3, 4]

global_results = pd.concat([
    sg.evaluation(input_data=X_test, target_col=y_test, threshold_ratio=thr)["global metrics"]
    for thr in threshold_ratios
])

print("\nThreshold Optimization Results:")
print(global_results)

############################################################################################
### Anomaly Detection Predictions

best_ratio = global_results.sort_values(by="Accuracy", ascending=False).iloc[0]["Threshold Ratio"]

predictions = sg.detections(
    input_data=X_future_data,
    threshold_ratio=best_ratio
)

# Use the new metrics module for evaluation
final_metrics = metrics_classification(
    y_true=y_future_data,
    y_pred=predictions["Predicted Anomalies"]
)

print("\nFinal Performance Metrics:")
print(f"Accuracy: {final_metrics['Accuracy']}")
print(f"Precision: {final_metrics['Precision']}")
print(f"Recall: {final_metrics['Recall']}")
print(f"F1 Score: {final_metrics['F1 Score']}")

Grid Search Optimizer

SegmentAE includes a comprehensive optimization methodology through the SegmentAE_Optimizer class to systematically identify optimal configurations.

The optimizer evaluates combinations of:

  • Multiple autoencoders
  • Different clustering algorithms
  • Various cluster numbers
  • Different threshold ratios

Example:

from segmentae.optimization import SegmentAE_Optimizer
from segmentae.core import ClusterModel

# Type-safe enum-based optimization 
optimizer = SegmentAE_Optimizer(
    autoencoder_models=[autoencoder1, autoencoder2],
    n_clusters_list=[2, 3, 4],
    cluster_models=[ClusterModel.KMEANS, ClusterModel.GMM, ClusterModel.MINIBATCH_KMEANS],  # Type-safe enums
    threshold_ratios=[1, 1.5, 2, 3],
    performance_metric='f1_score'  # or 'Accuracy', 'Precision', 'Recall'
)
# Note: Strings are also supported 
# cluster_models=["KMeans", "GMM", "MiniBatchKMeans"]

# Run grid search
best_model = optimizer.optimize(X_train, X_test, y_test)

# View results
print(f"Best Performance: {optimizer.best_performance}")
print(f"Best Configuration:")
print(f"  - Clusters: {optimizer.best_n_clusters}")
print(f"  - Threshold: {optimizer.best_threshold_ratio}")
print("\nLeaderboard:")
print(optimizer.leaderboard.head(10))

For a complete optimizer example → Optimizer Application

Template Example Applications

1. Basic Custom Model

Use your own Keras autoencoder with SegmentAE:

  • Example: basic_model.py
  • Shows custom Sequential model integration
  • Demonstrates multiple threshold evaluation

2. Baseline Autoencoders

Use built-in DenseAutoencoder or BatchNormAutoencoder:

  • Example: baseline_models.py
  • Shows built-in autoencoder usage
  • Includes model summary and training visualization

3. Grid Search Optimization

Find optimal configuration automatically:

  • Example: optimizer_application.py
  • Evaluates multiple autoencoders and clustering configs
  • Multiple clustering algorithms
  • Generates performance leaderboard

4. Unlabeled Data Detection

Detect anomalies without ground truth labels:

Interactive Notebooks

For a more interactive experience, feel free to explore the Jupyter notebooks with step-by-step execution and guidelines:

📓 Interactive Notebooks

If you use SegmentAE in your research, please cite:

@software{segmentae2024,
  author = {Luís Fernando Santos},
  title = {SegmentAE: A Python Library for Anomaly Detection Optimization},
  year = {2024},
  publisher = {PyPI},
  url = {https://pypi.org/project/segmentae/}
}

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Luis Santos - LinkedIn

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

segmentae-1.5.26-py3-none-any.whl (48.7 kB view details)

Uploaded Python 3

File details

Details for the file segmentae-1.5.26-py3-none-any.whl.

File metadata

  • Download URL: segmentae-1.5.26-py3-none-any.whl
  • Upload date:
  • Size: 48.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for segmentae-1.5.26-py3-none-any.whl
Algorithm Hash digest
SHA256 e4268702762bf1af9b41ff0ae1511d498742f0c0a6b6df52bb30546408c4f48b
MD5 c0045aeb2afea1c262f4b2cbfc27b80e
BLAKE2b-256 ba98b7804655174bd5d03e84c481bd2e2e29c349bba0bb87c6efffad29c79f77

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page