SegmentAE: A Python Library for Anomaly Detection Optimization
Project description
Framework Overview
SegmentAE is designed to enhance anomaly detection performance through the optimization of reconstruction error by integrating and intersecting clustering methods with tabular autoencoders. Built with enterprise-grade architecture, it provides a versatile, scalable, and robust solution for anomaly detection applications in domains such as financial fraud detection, network security, and industrial monitoring.
Key Architectural Features (v2.0+)
- Professional Architecture: Clean separation of concerns with robust principles
- Type Safety: Comprehensive Pydantic validation and type hints throughout
- Design Patterns: Registry, Strategy, and Template Method patterns
- Enum-Based Configuration: Type-safe constants for all parameters
- Custom Exceptions: Informative error messages with actionable suggestions
Key Features and Capabilities
1. General Applicability on Tabular Datasets
SegmentAE is engineered to handle a wide range of tabular datasets, making it suitable for various anomaly detection tasks across different use case contexts. It can be seamlessly integrated into diverse applications, ensuring broad utility and adaptability.
2. Optimization and Customization
The framework offers complete configurability for each component of the anomaly detection pipeline, including:
- Data Preprocessing: Encoding, scaling, and imputation with Pydantic validation
- Clustering Algorithms: Registry-based clustering with easy extensibility
- Autoencoder Integration: Support for custom Keras/TensorFlow models or built-in implementations
Each component can be fine-tuned to achieve optimal performance tailored to specific use cases.
3. Enhanced Detection Performance
By leveraging a combination of clustering algorithms and advanced anomaly detection techniques, SegmentAE aims to improve the accuracy and reliability of anomaly detection. The integration of tabular autoencoders with clustering mechanisms ensures that the framework effectively captures and identifies different patterns in the input data, optimizing the reconstruction error for each cluster, thereby enhancing predictive performance.
Main Development Tools
Major frameworks used to build this project:
Where to Get It
Binary installer for the latest released version is available at the Python Package Index (PyPI).
GitHub Project Link: https://github.com/TsLu1s/SegmentAE
Installation
To install this package from the PyPI repository, run the following command:
pip install segmentae
SegmentAE - Technical Components and Pipeline Structure
The SegmentAE framework consists of several integrated components, each playing a critical role in the optimization of anomaly detection through clustering and tabular autoencoders. The pipeline is structured with professional design patterns to ensure seamless data flow and modular customization.
1. Data Preprocessing
Proper preprocessing is crucial for ensuring the quality and consistency of data. The preprocessing module now includes:
- Pydantic Validation: Automatic type checking and conversion
- Type-Safe Configuration: Enum-based parameter selection
- Missing Value Imputation: Simple statistical imputation methods
- Normalization: MinMax, Standard, and Robust scaling options
- Categorical Encoding: Inverse Frequency, Label, and One-Hot Encoding
Example:
from segmentae.preprocessing import Preprocessing
from segmentae.core import EncoderType, ScalerType
# Type-safe configuration with enums
pr = Preprocessing(
encoder=EncoderType.IFREQUENCY,
scaler=ScalerType.MINMAX,
imputer="Simple" # Strings also are supported
)
pr.fit(X_train)
X_transformed = pr.transform(X_test)
2. Clustering
Clustering forms the backbone of the SegmentAE framework, provided with easy extensibility:
- Registry Pattern: Clean model registration and instantiation
- Type Safety: Pydantic validation for all parameters
- Four Algorithms: K-Means, MiniBatch K-Means, Gaussian Mixture, Agglomerative
- Extensible Design: Easy to add new clustering algorithms
Example:
from segmentae.clustering import Clustering
from segmentae.core import ClusterModel
cl = Clustering(
cluster_model=[ClusterModel.KMEANS], # Enum-based
n_clusters=3
)
cl.clustering_fit(X_train)
3. Anomaly Detection - Autoencoders
The core of the SegmentAE framework employs advanced autoencoder architectures:
- Three Baseline Implementations: Dense, BatchNorm, and Ensemble autoencoders
- Custom Model Support: Integrate any Keras/TensorFlow model
- Full Customization: Network architecture, training epochs, activation layers, and more
- Type-Safe Integration: Validated through protocols
The framework includes three baseline autoencoder algorithms for user application, allowing complete customization of network architecture, training parameters, and activation functions.
Custom Model Integration:
You can build your own autoencoder model (Keras-based) and integrate it seamlessly into the SegmentAE pipeline →
Unlabeled Data Support:
Application example for totally unlabeled data available here →
SegmentAE - Predictive Application
The following example demonstrates the complete workflow from data loading to anomaly detection using a DenseAutoencoder integrated with KMeans clustering.
import pandas as pd
from segmentae import SegmentAE, Preprocessing, Clustering
from segmentae.autoencoders import DenseAutoencoder
from segmentae.core import EncoderType, ScalerType, ClusterModel, ThresholdMetric
from segmentae.data_sources import load_dataset
from segmentae.metrics import metrics_classification
from sklearn.model_selection import train_test_split
############################################################################################
### Data Loading
train, test, target = load_dataset(
dataset_selection='htru2_dataset',
split_ratio=0.75
)
test, future_data = train_test_split(test, train_size=0.9, random_state=5)
# Reset indices (required)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
future_data = future_data.reset_index(drop=True)
# Separate features and targets
X_train, y_train = train.drop(columns=[target]).copy(), train[target].astype(int)
X_test, y_test = test.drop(columns=[target]).copy(), test[target].astype(int)
X_future_data = future_data.drop(columns=[target]).copy()
y_future_data = future_data[target].astype(int)
############################################################################################
### Preprocessing with Type-Safe Configuration (v2.0+)
pr = Preprocessing(
encoder=EncoderType.IFREQUENCY, # Type-safe enum: EncoderType.LABEL, EncoderType.ONEHOT
scaler=ScalerType.MINMAX, # Type-safe enum: ScalerType.STANDARD, ScalerType.ROBUST
imputer=None
)
pr.fit(X=X_train)
X_train = pr.transform(X=X_train)
X_test = pr.transform(X=X_test)
X_future_data = pr.transform(X=X_future_data)
############################################################################################
### Clustering Implementation with Type-Safe Registry
cl_model = Clustering(
cluster_model=[ClusterModel.KMEANS], # Type-safe enum: ClusterModel.MINIBATCH_KMEANS, ClusterModel.GMM, ClusterModel.AGGLOMERATIVE
n_clusters=3
)
cl_model.clustering_fit(X=X_train)
############################################################################################
### Autoencoder Implementation
denseAutoencoder = DenseAutoencoder(
hidden_dims=[16, 12, 8, 4],
encoder_activation='relu',
decoder_activation='relu',
optimizer='adam',
learning_rate=0.001,
epochs=150,
val_size=0.15,
stopping_patient=20,
dropout_rate=0.1,
batch_size=None
)
denseAutoencoder.fit(input_data=X_train)
denseAutoencoder.summary()
############################################################################################
### Autoencoder + Clustering Integration
sg = SegmentAE(ae_model=denseAutoencoder, cl_model=cl_model)
############################################################################################
### Train Reconstruction with Type-Safe Metric (v2.0+)
sg.reconstruction(
input_data=X_train,
threshold_metric=ThresholdMetric.MSE # Type-safe enum: ThresholdMetric.MAE, ThresholdMetric.RMSE, ThresholdMetric.MAX_ERROR
)
############################################################################################
### Reconstruction Performance Evaluation
results = sg.evaluation(
input_data=X_test,
target_col=y_test,
threshold_ratio=2.0
)
# Access test metadata by cluster
preds_test, recon_metrics_test = sg.preds_test, sg.reconstruction_test
# View global metrics
print(results['global metrics'])
print(results['clusters metrics'])
############################################################################################
### Multiple Threshold Ratio Evaluation
threshold_ratios = [0.75, 1, 1.5, 2, 3, 4]
global_results = pd.concat([
sg.evaluation(input_data=X_test, target_col=y_test, threshold_ratio=thr)["global metrics"]
for thr in threshold_ratios
])
print("\nThreshold Optimization Results:")
print(global_results)
############################################################################################
### Anomaly Detection Predictions
best_ratio = global_results.sort_values(by="Accuracy", ascending=False).iloc[0]["Threshold Ratio"]
predictions = sg.detections(
input_data=X_future_data,
threshold_ratio=best_ratio
)
# Use the new metrics module for evaluation
final_metrics = metrics_classification(
y_true=y_future_data,
y_pred=predictions["Predicted Anomalies"]
)
print("\nFinal Performance Metrics:")
print(f"Accuracy: {final_metrics['Accuracy']}")
print(f"Precision: {final_metrics['Precision']}")
print(f"Recall: {final_metrics['Recall']}")
print(f"F1 Score: {final_metrics['F1 Score']}")
Grid Search Optimizer
SegmentAE includes a comprehensive optimization methodology through the SegmentAE_Optimizer class to systematically identify optimal configurations.
The optimizer evaluates combinations of:
- Multiple autoencoders
- Different clustering algorithms
- Various cluster numbers
- Different threshold ratios
Example:
from segmentae.optimization import SegmentAE_Optimizer
from segmentae.core import ClusterModel
# Type-safe enum-based optimization
optimizer = SegmentAE_Optimizer(
autoencoder_models=[autoencoder1, autoencoder2],
n_clusters_list=[2, 3, 4],
cluster_models=[ClusterModel.KMEANS, ClusterModel.GMM, ClusterModel.MINIBATCH_KMEANS], # Type-safe enums
threshold_ratios=[1, 1.5, 2, 3],
performance_metric='f1_score' # or 'Accuracy', 'Precision', 'Recall'
)
# Note: Strings are also supported
# cluster_models=["KMeans", "GMM", "MiniBatchKMeans"]
# Run grid search
best_model = optimizer.optimize(X_train, X_test, y_test)
# View results
print(f"Best Performance: {optimizer.best_performance}")
print(f"Best Configuration:")
print(f" - Clusters: {optimizer.best_n_clusters}")
print(f" - Threshold: {optimizer.best_threshold_ratio}")
print("\nLeaderboard:")
print(optimizer.leaderboard.head(10))
For a complete optimizer example →
Template Example Applications
1. Basic Custom Model
Use your own Keras autoencoder with SegmentAE:
- Example: basic_model.py
- Shows custom Sequential model integration
- Demonstrates multiple threshold evaluation
2. Baseline Autoencoders
Use built-in DenseAutoencoder or BatchNormAutoencoder:
- Example: baseline_models.py
- Shows built-in autoencoder usage
- Includes model summary and training visualization
3. Grid Search Optimization
Find optimal configuration automatically:
- Example: optimizer_application.py
- Evaluates multiple autoencoders and clustering configs
- Multiple clustering algorithms
- Generates performance leaderboard
4. Unlabeled Data Detection
Detect anomalies without ground truth labels:
- Example: unlabeled_application.py
- Shows reconstruction-only workflow
- Useful for production deployment
Interactive Notebooks
For a more interactive experience, feel free to explore the Jupyter notebooks with step-by-step execution and guidelines:
If you use SegmentAE in your research, please cite:
@software{segmentae2024,
author = {Luís Fernando Santos},
title = {SegmentAE: A Python Library for Anomaly Detection Optimization},
year = {2024},
publisher = {PyPI},
url = {https://pypi.org/project/segmentae/}
}
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Luis Santos - LinkedIn
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file segmentae-1.5.26-py3-none-any.whl.
File metadata
- Download URL: segmentae-1.5.26-py3-none-any.whl
- Upload date:
- Size: 48.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4268702762bf1af9b41ff0ae1511d498742f0c0a6b6df52bb30546408c4f48b
|
|
| MD5 |
c0045aeb2afea1c262f4b2cbfc27b80e
|
|
| BLAKE2b-256 |
ba98b7804655174bd5d03e84c481bd2e2e29c349bba0bb87c6efffad29c79f77
|