Automated machine learning system for imbalanced medical data with survival analysis, unsupervised learning, and hyperparameter optimization
Project description
AutoImblearn
AutoImblearn is a comprehensive Automated Machine Learning (AutoML) system designed for imbalanced medical data with support for classification, survival analysis, and unsupervised learning. It automates the selection of preprocessing techniques, resampling strategies, model selection, and hyperparameter optimization across multiple learning paradigms.
๐ Key Features
Multiple Learning Paradigms
- Supervised Classification: Imbalanced binary/multiclass classification
- Survival Analysis: Time-to-event analysis with censoring
- Unsupervised Learning: Clustering, dimensionality reduction, anomaly detection
- Hybrid Methods: Combined resampling and classification
- AutoML Integration: Out-of-the-box AutoML frameworks
Comprehensive Model Library (50+ Models)
- 20+ Classifiers: Logistic Regression, SVM, Random Forest, XGBoost, Neural Networks, etc.
- 15+ Resampling Methods: SMOTE variants, undersampling, oversampling, hybrid techniques
- 9 Survival Models: Cox Proportional Hazards, Random Survival Forest, SVM variants
- 6 Clustering Algorithms: KMeans, DBSCAN, Hierarchical, GMM, MeanShift, Spectral
- 6 Dimensionality Reduction: PCA, t-SNE, UMAP, TruncatedSVD, ICA, NMF
- 4 Anomaly Detection: IsolationForest, OneClassSVM, LOF, EllipticEnvelope
- 5+ Imputation Methods: Mean, Median, KNN, Iterative, HyperImpute
- 3 AutoML Frameworks: Auto-sklearn, TPOT, H2O AutoML
Advanced Capabilities
- Automated Pipeline Search: Greedy search with budget controls
- Docker-Based Architecture: Isolated, reproducible model training
- Survival-Aware Processing: Handles censored data and structured survival arrays
- Intelligent Caching: Reuses imputation results across experiments
- K-Fold Cross-Validation: Robust performance estimation
- Multiple Metrics: AUROC, F1, Precision, Recall, C-index, Silhouette, etc.
๐ฆ Installation
Basic Installation
pip install AutoImblearn
Installation with Optional Dependencies
For specific use cases, install with extras:
# For web-based visualization (Django frontend)
pip install AutoImblearn[web]
# For advanced imputation methods
pip install AutoImblearn[imputer]
# For all resampling techniques
pip install AutoImblearn[resampler]
# For survival analysis
pip install AutoImblearn[survival]
# For unsupervised learning (UMAP)
pip install AutoImblearn[unsupervised]
# For all features
pip install AutoImblearn[all]
Requirements
- Python โฅ 3.9
- Docker (for model training)
- scikit-learn โฅ 1.3.0
- pandas โฅ 2.0.0
- numpy โฅ 1.24.0
๐ Quick Start
1. Classification Pipeline
from AutoImblearn.core.runpipe import RunPipe
from AutoImblearn.core.autoimblearn import AutoImblearn
class Args:
dataset = "diabetes.csv"
target = "outcome"
path = "/data"
metric = "auroc"
n_splits = 5
repeat = 0
train_ratio = 1.0
args = Args()
# Initialize pipeline runner
run_pipe = RunPipe(args)
run_pipe.loadData()
# Run a specific pipeline: [imputer, resampler, classifier]
result = run_pipe.fit(['knn', 'smote', 'lr'])
print(f"AUROC: {result}")
# Or search for best pipeline automatically
automl = AutoImblearn(run_pipe, metric='auroc')
best_pipeline, n_evals, best_score = automl.find_best(max_iterations=50)
print(f"Best Pipeline: {best_pipeline}")
print(f"Best Score: {best_score}")
2. Survival Analysis Pipeline
# For time-to-event analysis with censored data
args.metric = "c_index" # Concordance index for survival
# Run survival pipeline: [imputer, survival_resampler, survival_model]
result = run_pipe.fit(['median', 'rus', 'CPH']) # Cox Proportional Hazards
print(f"C-index: {result}")
3. Unsupervised Learning Pipeline
# Clustering example
args.metric = "silhouette"
# Run clustering pipeline: [imputer, clustering_model]
result = run_pipe.fit(['knn', 'kmeans'])
print(f"Silhouette Score: {result}")
# Dimensionality reduction example
args.metric = "reconstruction"
result = run_pipe.fit(['median', 'pca'])
# Anomaly detection example
args.metric = "f1"
result = run_pipe.fit(['mean', 'isoforest'])
4. Hybrid Pipeline
# Combined resampling + classification in one step
# Run hybrid pipeline: [imputer, hybrid_method]
result = run_pipe.fit(['median', 'autosmote'])
5. AutoML Pipeline
# Pure AutoML approach (handles everything internally)
# Run AutoML: [automl_framework]
result = run_pipe.fit_automl(['autosklearn'])
๐๏ธ Pipeline Types
AutoImblearn supports 8 distinct pipeline types:
| Pipeline Type | Structure | Example | Use Case |
|---|---|---|---|
| Classification | [imputer, resampler, classifier] |
['knn', 'smote', 'lr'] |
Imbalanced classification |
| Survival | [imputer, survival_resampler, survival_model] |
['median', 'rus', 'CPH'] |
Time-to-event analysis |
| Hybrid | [imputer, hybrid_method] |
['median', 'autosmote'] |
Combined resampling+classification |
| AutoML | [automl_framework] |
['autosklearn'] |
Automated ML |
| Clustering | [imputer, clustering_model] |
['knn', 'kmeans'] |
Pattern discovery |
| Reduction | [imputer, reduction_model] |
['median', 'pca'] |
Dimensionality reduction |
| Anomaly | [imputer, anomaly_model] |
['mean', 'isoforest'] |
Outlier detection |
| Survival Clustering | [imputer, survival_unsupervised] |
['median', 'survival_tree'] |
Risk stratification |
๐ Available Models
Imputers (5)
mean,median,knn,iter,hyperimpute
Classifiers (20+)
Sklearn-based:
lr- Logistic Regressionsvm- Support Vector Machinedt- Decision Treerf- Random Forestab- AdaBoostgb- Gradient Boostingknn_clf- K-Nearest Neighborsgnb- Gaussian Naive Bayesmlp- Multi-Layer Perceptronlda- Linear Discriminant Analysisqda- Quadratic Discriminant Analysis
XGBoost-based:
xgb- XGBoost Classifierxgb_rf- XGBoost Random Forest
Resamplers (15+)
Imblearn-based:
rus- Random Under-Samplingros- Random Over-Samplingnm- Near Misscnn- Condensed Nearest Neighborenn- Edited Nearest Neighborsallknn- All K-NNsmote_enn- SMOTE + ENNsmote_tomek- SMOTE + Tomek Links
SMOTE-based:
smote- SMOTEborderline_smote- Borderline-SMOTEsvm_smote- SVM-SMOTEadasyn- ADASYNkmeans_smote- K-Means SMOTE
Survival Models (9)
CPH- Cox Proportional HazardsRSF- Random Survival ForestSVM- Survival SVMKSVM- Kernel Survival SVMLASSO- LASSO CoxL1- L1-penalized CoxL2- L2-penalized CoxCSA- Component-wise Gradient BoostingLRSF- Linear Random Survival Forest
Survival Resamplers (3)
rus- Random Under-Sampling (survival-aware)ros- Random Over-Sampling (survival-aware)smote- SMOTE (survival-aware)
Unsupervised Models
Clustering (6):
kmeans- K-Means Clusteringdbscan- DBSCANhierarchical- Agglomerative Clusteringgmm- Gaussian Mixture Modelmeanshift- Mean Shiftspectral- Spectral Clustering
Dimensionality Reduction (6):
pca- Principal Component Analysistsne- t-SNEumap- UMAPsvd- Truncated SVDica- Independent Component Analysisnmf- Non-negative Matrix Factorization
Anomaly Detection (4):
isoforest- Isolation Forestocsvm- One-Class SVMlof- Local Outlier Factorelliptic- Elliptic Envelope
Survival Unsupervised (2):
survival_tree- Survival Tree (subgroup discovery)survival_kmeans- K-Means on survival data
Hybrid Methods (2)
autosmote- AutoSMOTE (adaptive SMOTE with RL)autorsp- Automated Resampler Selection
AutoML Frameworks (3)
autosklearn- Auto-sklearntpot- TPOTh2o- H2O AutoML
๐๏ธ Architecture
Docker-Based Design
AutoImblearn uses a client-server architecture where each model runs in an isolated Docker container:
โโโโโโโโโโโโโโโโโโโ
โ Python Client โ โโ Flask REST API in Docker
โ (run.py) โ (Docker/app.py)
โโโโโโโโโโโโโโโโโโโ
Benefits:
- Isolation: Each model has its own dependencies
- Reproducibility: Consistent environment across machines
- Scalability: Easy to deploy on clusters
- Security: Sandboxed execution
Pipeline Execution Flow
1. Data Loading
โ
2. K-Fold Splitting (on raw data)
โ
3. For each fold:
a. Imputation (FIT on train, TRANSFORM both)
b. Resampling (ONLY on train)
c. Model Training
d. Prediction & Evaluation
โ
4. Average Results
โ
5. Save & Cache
Intelligent Caching
Imputation results are cached per fold to avoid redundant computation:
# Cached file: interim/{dataset}/imp_{imputer}_fold{n}.p
if cached_file_exists:
load_from_cache() # Fast!
else:
run_imputation()
save_to_cache()
๐ง Configuration
Metrics Supported
Classification:
auroc- Area Under ROC Curvef1- F1 Scoreprecision- Precisionrecall- Recallaccuracy- Accuracy
Survival:
c_index- Concordance Indexc_uno- Uno's C-index
Unsupervised:
silhouette- Silhouette Score (clustering)calinski- Calinski-Harabasz Index (clustering)davies_bouldin- Davies-Bouldin Index (clustering)reconstruction- Reconstruction Error (reduction)log_rank- Log-rank Test (survival clustering)
Search Budget Controls
automl.find_best(
max_iterations=100, # Max pipeline evaluations
time_budget_seconds=3600, # Max time (1 hour)
early_stopping_patience=10 # Stop if no improvement
)
๐ Web Interface
AutoImblearn includes a Django web frontend for interactive pipeline configuration:
Features:
- Visual Pipeline Builder: Drag-and-drop interface
- Dataset Upload: CSV file handling
- Feature Analysis: Distribution plots and categorical detection
- Pipeline Type Selection: Choose from 8 pipeline types
- Model Selection: Multi-select from available models
- Training Dashboard: Real-time progress tracking
- Results Visualization: Performance metrics and comparisons
Launch Web Interface:
cd django_frontend
python manage.py runserver
Navigate to http://localhost:8000 to access the interface.
๐ Advanced Usage
Custom Pipeline Search
from AutoImblearn.core.autoimblearn import AutoImblearn
# Restrict search space
automl.imputers = ['knn', 'median']
automl.resamplers = ['smote', 'adasyn']
automl.classifiers = ['lr', 'rf', 'xgb']
# Run search with custom space
best_pipeline, n_evals, best_score = automl.find_best(
max_iterations=30,
time_budget_seconds=1800
)
Survival Data Format
Survival data requires a structured array with two fields:
import numpy as np
from sksurv.util import Surv
# Create survival array
y = Surv.from_arrays(
event=[True, False, True, False], # Event occurred?
time=[100, 200, 150, 300] # Time to event/censoring
)
# Structured array format:
# dtype=[('Status', bool), ('Survival_in_days', float)]
Direct Model Usage
from AutoImblearn.pipelines import classifiers, resamplers, imputers
# Instantiate specific models
imputer_factory = imputers['knn']
imputer = imputer_factory(data_folder='/data')
resampler_factory = resamplers['smote']
resampler = resampler_factory(data_folder='/data')
classifier_factory = classifiers['lr']
classifier = classifier_factory(data_folder='/data')
# Use models
X_train_imputed = imputer.fit_transform(args, X_train)
X_train_resampled, y_train_resampled = resampler.fit_resample(X_train_imputed, y_train)
classifier.fit(X_train_resampled, y_train_resampled)
predictions = classifier.predict(X_test)
๐ Development
Project Structure
AutoImblearn/
โโโ components/
โ โโโ classifiers/ # Classification models
โ โโโ resamplers/ # Resampling techniques
โ โโโ imputers/ # Imputation methods
โ โโโ survival/ # Survival analysis models
โ โ โโโ _supervised/ # Survival models (CPH, RSF, etc.)
โ โ โโโ _resamplers/ # Survival-aware resampling
โ โ โโโ _unsupervised/ # Survival clustering
โ โโโ unsupervised/ # Unsupervised learning
โ โ โโโ _clustering/ # Clustering algorithms
โ โ โโโ _reduction/ # Dimensionality reduction
โ โ โโโ _anomaly/ # Anomaly detection
โ โโโ automls/ # AutoML frameworks
โ โโโ hybrids/ # Hybrid methods
โ โโโ api/ # Base API classes
โโโ core/
โ โโโ runpipe.py # Pipeline execution
โ โโโ autoimblearn.py # AutoML search
โ โโโ pipeline_strategies.py # Strategy pattern
โโโ pipelines/ # Pipeline wrappers
โโโ processing/ # Data preprocessing utilities
โโโ utils/ # Helper functions
Building Docker Images
Each model has its own Dockerfile:
# Build a specific model image
cd AutoImblearn/components/classifiers/_sklearnbased
docker build -t sklearn-classifier-api .
# Build all images
cd AutoImblearn
./build_all_images.sh # If script exists
Running Tests
# Install dev dependencies
pip install AutoImblearn[dev]
# Run tests
pytest tests/
# Run with coverage
pytest --cov=AutoImblearn tests/
๐ Citation
If you use AutoImblearn in your research, please cite:
@software{autoimblearn2024,
title = {AutoImblearn: Automated Machine Learning for Imbalanced Medical Data},
author = {Wang, Hank},
year = {2024},
version = {0.3.0},
url = {https://github.com/Wanghongkua/Auto-Imblearn2}
}
๐ License
This project is licensed under the BSD 3-Clause License. See LICENSE for details.
๐ค Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ Acknowledgments
- Built on top of scikit-learn, imbalanced-learn, and scikit-survival
- Docker-based architecture inspired by microservices design patterns
- AutoML search adapted from CASH (Combined Algorithm Selection and Hyperparameter optimization)
๐ง Contact
Author: Hank Wang Email: hankwang1991@gmail.com
For bug reports and feature requests, please use the GitHub Issues page.
Happy AutoML-ing! ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autoimblearn-0.3.8.tar.gz.
File metadata
- Download URL: autoimblearn-0.3.8.tar.gz
- Upload date:
- Size: 131.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
477046567aec147696fd42497edc328dca18680a2fad59ac5de638e08b3c01c2
|
|
| MD5 |
046f0c6b266059c65cbbc6085a6d7255
|
|
| BLAKE2b-256 |
08ad3395aa0fee3bc1663db4782921cb8ba89228a37eb3f9438a225f2c663291
|
File details
Details for the file autoimblearn-0.3.8-py3-none-any.whl.
File metadata
- Download URL: autoimblearn-0.3.8-py3-none-any.whl
- Upload date:
- Size: 187.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
127fe45f0c8d3638eea66ed0d71ce4cf1de2e6b52a0a4a7fc61d9f64e8153d0f
|
|
| MD5 |
f40b33bc915a98c69d0b14ba5b86c5c2
|
|
| BLAKE2b-256 |
e6be3d3ee4a2c581087d2d3e81d1fc86264a04abd8808e6c4fcbee1e112e7a3f
|