Skip to main content

Automated machine learning system for imbalanced medical data with survival analysis, unsupervised learning, and hyperparameter optimization

Project description

AutoImblearn

AutoImblearn is a comprehensive Automated Machine Learning (AutoML) system designed for imbalanced medical data with support for classification, survival analysis, and unsupervised learning. It automates the selection of preprocessing techniques, resampling strategies, model selection, and hyperparameter optimization across multiple learning paradigms.

Python Version License Version


๐ŸŒŸ Key Features

Multiple Learning Paradigms

  • Supervised Classification: Imbalanced binary/multiclass classification
  • Survival Analysis: Time-to-event analysis with censoring
  • Unsupervised Learning: Clustering, dimensionality reduction, anomaly detection
  • Hybrid Methods: Combined resampling and classification
  • AutoML Integration: Out-of-the-box AutoML frameworks

Comprehensive Model Library (50+ Models)

  • 20+ Classifiers: Logistic Regression, SVM, Random Forest, XGBoost, Neural Networks, etc.
  • 15+ Resampling Methods: SMOTE variants, undersampling, oversampling, hybrid techniques
  • 9 Survival Models: Cox Proportional Hazards, Random Survival Forest, SVM variants
  • 6 Clustering Algorithms: KMeans, DBSCAN, Hierarchical, GMM, MeanShift, Spectral
  • 6 Dimensionality Reduction: PCA, t-SNE, UMAP, TruncatedSVD, ICA, NMF
  • 4 Anomaly Detection: IsolationForest, OneClassSVM, LOF, EllipticEnvelope
  • 5+ Imputation Methods: Mean, Median, KNN, Iterative, HyperImpute
  • 3 AutoML Frameworks: Auto-sklearn, TPOT, H2O AutoML

Advanced Capabilities

  • Automated Pipeline Search: Greedy search with budget controls
  • Docker-Based Architecture: Isolated, reproducible model training
  • Survival-Aware Processing: Handles censored data and structured survival arrays
  • Intelligent Caching: Reuses imputation results across experiments
  • K-Fold Cross-Validation: Robust performance estimation
  • Multiple Metrics: AUROC, F1, Precision, Recall, C-index, Silhouette, etc.

๐Ÿ“ฆ Installation

Basic Installation

pip install AutoImblearn

Installation with Optional Dependencies

For specific use cases, install with extras:

# For web-based visualization (Django frontend)
pip install AutoImblearn[web]

# For advanced imputation methods
pip install AutoImblearn[imputer]

# For all resampling techniques
pip install AutoImblearn[resampler]

# For survival analysis
pip install AutoImblearn[survival]

# For unsupervised learning (UMAP)
pip install AutoImblearn[unsupervised]

# For all features
pip install AutoImblearn[all]

Requirements

  • Python โ‰ฅ 3.9
  • Docker (for model training)
  • scikit-learn โ‰ฅ 1.3.0
  • pandas โ‰ฅ 2.0.0
  • numpy โ‰ฅ 1.24.0

๐Ÿš€ Quick Start

1. Classification Pipeline

from AutoImblearn.core.runpipe import RunPipe
from AutoImblearn.core.autoimblearn import AutoImblearn

class Args:
    dataset = "diabetes.csv"
    target = "outcome"
    path = "/data"
    metric = "auroc"
    n_splits = 5
    repeat = 0
    train_ratio = 1.0

args = Args()

# Initialize pipeline runner
run_pipe = RunPipe(args)
run_pipe.loadData()

# Run a specific pipeline: [imputer, resampler, classifier]
result = run_pipe.fit(['knn', 'smote', 'lr'])
print(f"AUROC: {result}")

# Or search for best pipeline automatically
automl = AutoImblearn(run_pipe, metric='auroc')
best_pipeline, n_evals, best_score = automl.find_best(max_iterations=50)
print(f"Best Pipeline: {best_pipeline}")
print(f"Best Score: {best_score}")

2. Survival Analysis Pipeline

# For time-to-event analysis with censored data

args.metric = "c_index"  # Concordance index for survival

# Run survival pipeline: [imputer, survival_resampler, survival_model]
result = run_pipe.fit(['median', 'rus', 'CPH'])  # Cox Proportional Hazards
print(f"C-index: {result}")

3. Unsupervised Learning Pipeline

# Clustering example
args.metric = "silhouette"

# Run clustering pipeline: [imputer, clustering_model]
result = run_pipe.fit(['knn', 'kmeans'])
print(f"Silhouette Score: {result}")

# Dimensionality reduction example
args.metric = "reconstruction"
result = run_pipe.fit(['median', 'pca'])

# Anomaly detection example
args.metric = "f1"
result = run_pipe.fit(['mean', 'isoforest'])

4. Hybrid Pipeline

# Combined resampling + classification in one step

# Run hybrid pipeline: [imputer, hybrid_method]
result = run_pipe.fit(['median', 'autosmote'])

5. AutoML Pipeline

# Pure AutoML approach (handles everything internally)

# Run AutoML: [automl_framework]
result = run_pipe.fit_automl(['autosklearn'])

๐Ÿ—๏ธ Pipeline Types

AutoImblearn supports 8 distinct pipeline types:

Pipeline Type Structure Example Use Case
Classification [imputer, resampler, classifier] ['knn', 'smote', 'lr'] Imbalanced classification
Survival [imputer, survival_resampler, survival_model] ['median', 'rus', 'CPH'] Time-to-event analysis
Hybrid [imputer, hybrid_method] ['median', 'autosmote'] Combined resampling+classification
AutoML [automl_framework] ['autosklearn'] Automated ML
Clustering [imputer, clustering_model] ['knn', 'kmeans'] Pattern discovery
Reduction [imputer, reduction_model] ['median', 'pca'] Dimensionality reduction
Anomaly [imputer, anomaly_model] ['mean', 'isoforest'] Outlier detection
Survival Clustering [imputer, survival_unsupervised] ['median', 'survival_tree'] Risk stratification

๐Ÿ“Š Available Models

Imputers (5)

  • mean, median, knn, iter, hyperimpute

Classifiers (20+)

Sklearn-based:

  • lr - Logistic Regression
  • svm - Support Vector Machine
  • dt - Decision Tree
  • rf - Random Forest
  • ab - AdaBoost
  • gb - Gradient Boosting
  • knn_clf - K-Nearest Neighbors
  • gnb - Gaussian Naive Bayes
  • mlp - Multi-Layer Perceptron
  • lda - Linear Discriminant Analysis
  • qda - Quadratic Discriminant Analysis

XGBoost-based:

  • xgb - XGBoost Classifier
  • xgb_rf - XGBoost Random Forest

Resamplers (15+)

Imblearn-based:

  • rus - Random Under-Sampling
  • ros - Random Over-Sampling
  • nm - Near Miss
  • cnn - Condensed Nearest Neighbor
  • enn - Edited Nearest Neighbors
  • allknn - All K-NN
  • smote_enn - SMOTE + ENN
  • smote_tomek - SMOTE + Tomek Links

SMOTE-based:

  • smote - SMOTE
  • borderline_smote - Borderline-SMOTE
  • svm_smote - SVM-SMOTE
  • adasyn - ADASYN
  • kmeans_smote - K-Means SMOTE

Survival Models (9)

  • CPH - Cox Proportional Hazards
  • RSF - Random Survival Forest
  • SVM - Survival SVM
  • KSVM - Kernel Survival SVM
  • LASSO - LASSO Cox
  • L1 - L1-penalized Cox
  • L2 - L2-penalized Cox
  • CSA - Component-wise Gradient Boosting
  • LRSF - Linear Random Survival Forest

Survival Resamplers (3)

  • rus - Random Under-Sampling (survival-aware)
  • ros - Random Over-Sampling (survival-aware)
  • smote - SMOTE (survival-aware)

Unsupervised Models

Clustering (6):

  • kmeans - K-Means Clustering
  • dbscan - DBSCAN
  • hierarchical - Agglomerative Clustering
  • gmm - Gaussian Mixture Model
  • meanshift - Mean Shift
  • spectral - Spectral Clustering

Dimensionality Reduction (6):

  • pca - Principal Component Analysis
  • tsne - t-SNE
  • umap - UMAP
  • svd - Truncated SVD
  • ica - Independent Component Analysis
  • nmf - Non-negative Matrix Factorization

Anomaly Detection (4):

  • isoforest - Isolation Forest
  • ocsvm - One-Class SVM
  • lof - Local Outlier Factor
  • elliptic - Elliptic Envelope

Survival Unsupervised (2):

  • survival_tree - Survival Tree (subgroup discovery)
  • survival_kmeans - K-Means on survival data

Hybrid Methods (2)

  • autosmote - AutoSMOTE (adaptive SMOTE with RL)
  • autorsp - Automated Resampler Selection

AutoML Frameworks (3)

  • autosklearn - Auto-sklearn
  • tpot - TPOT
  • h2o - H2O AutoML

๐Ÿ›๏ธ Architecture

Docker-Based Design

AutoImblearn uses a client-server architecture where each model runs in an isolated Docker container:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Python Client โ”‚  โ†โ†’  Flask REST API in Docker
โ”‚   (run.py)      โ”‚      (Docker/app.py)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Benefits:

  • Isolation: Each model has its own dependencies
  • Reproducibility: Consistent environment across machines
  • Scalability: Easy to deploy on clusters
  • Security: Sandboxed execution

Pipeline Execution Flow

1. Data Loading
   โ†“
2. K-Fold Splitting (on raw data)
   โ†“
3. For each fold:
   a. Imputation (FIT on train, TRANSFORM both)
   b. Resampling (ONLY on train)
   c. Model Training
   d. Prediction & Evaluation
   โ†“
4. Average Results
   โ†“
5. Save & Cache

Intelligent Caching

Imputation results are cached per fold to avoid redundant computation:

# Cached file: interim/{dataset}/imp_{imputer}_fold{n}.p
if cached_file_exists:
    load_from_cache()  # Fast!
else:
    run_imputation()
    save_to_cache()

๐Ÿ”ง Configuration

Metrics Supported

Classification:

  • auroc - Area Under ROC Curve
  • f1 - F1 Score
  • precision - Precision
  • recall - Recall
  • accuracy - Accuracy

Survival:

  • c_index - Concordance Index
  • c_uno - Uno's C-index

Unsupervised:

  • silhouette - Silhouette Score (clustering)
  • calinski - Calinski-Harabasz Index (clustering)
  • davies_bouldin - Davies-Bouldin Index (clustering)
  • reconstruction - Reconstruction Error (reduction)
  • log_rank - Log-rank Test (survival clustering)

Search Budget Controls

automl.find_best(
    max_iterations=100,           # Max pipeline evaluations
    time_budget_seconds=3600,     # Max time (1 hour)
    early_stopping_patience=10    # Stop if no improvement
)

๐ŸŒ Web Interface

AutoImblearn includes a Django web frontend for interactive pipeline configuration:

Features:

  • Visual Pipeline Builder: Drag-and-drop interface
  • Dataset Upload: CSV file handling
  • Feature Analysis: Distribution plots and categorical detection
  • Pipeline Type Selection: Choose from 8 pipeline types
  • Model Selection: Multi-select from available models
  • Training Dashboard: Real-time progress tracking
  • Results Visualization: Performance metrics and comparisons

Launch Web Interface:

cd django_frontend
python manage.py runserver

Navigate to http://localhost:8000 to access the interface.


๐Ÿ“š Advanced Usage

Custom Pipeline Search

from AutoImblearn.core.autoimblearn import AutoImblearn

# Restrict search space
automl.imputers = ['knn', 'median']
automl.resamplers = ['smote', 'adasyn']
automl.classifiers = ['lr', 'rf', 'xgb']

# Run search with custom space
best_pipeline, n_evals, best_score = automl.find_best(
    max_iterations=30,
    time_budget_seconds=1800
)

Survival Data Format

Survival data requires a structured array with two fields:

import numpy as np
from sksurv.util import Surv

# Create survival array
y = Surv.from_arrays(
    event=[True, False, True, False],      # Event occurred?
    time=[100, 200, 150, 300]              # Time to event/censoring
)

# Structured array format:
# dtype=[('Status', bool), ('Survival_in_days', float)]

Direct Model Usage

from AutoImblearn.pipelines import classifiers, resamplers, imputers

# Instantiate specific models
imputer_factory = imputers['knn']
imputer = imputer_factory(data_folder='/data')

resampler_factory = resamplers['smote']
resampler = resampler_factory(data_folder='/data')

classifier_factory = classifiers['lr']
classifier = classifier_factory(data_folder='/data')

# Use models
X_train_imputed = imputer.fit_transform(args, X_train)
X_train_resampled, y_train_resampled = resampler.fit_resample(X_train_imputed, y_train)
classifier.fit(X_train_resampled, y_train_resampled)
predictions = classifier.predict(X_test)

๐Ÿ› Development

Project Structure

AutoImblearn/
โ”œโ”€โ”€ components/
โ”‚   โ”œโ”€โ”€ classifiers/          # Classification models
โ”‚   โ”œโ”€โ”€ resamplers/           # Resampling techniques
โ”‚   โ”œโ”€โ”€ imputers/             # Imputation methods
โ”‚   โ”œโ”€โ”€ survival/             # Survival analysis models
โ”‚   โ”‚   โ”œโ”€โ”€ _supervised/      # Survival models (CPH, RSF, etc.)
โ”‚   โ”‚   โ”œโ”€โ”€ _resamplers/      # Survival-aware resampling
โ”‚   โ”‚   โ””โ”€โ”€ _unsupervised/    # Survival clustering
โ”‚   โ”œโ”€โ”€ unsupervised/         # Unsupervised learning
โ”‚   โ”‚   โ”œโ”€โ”€ _clustering/      # Clustering algorithms
โ”‚   โ”‚   โ”œโ”€โ”€ _reduction/       # Dimensionality reduction
โ”‚   โ”‚   โ””โ”€โ”€ _anomaly/         # Anomaly detection
โ”‚   โ”œโ”€โ”€ automls/              # AutoML frameworks
โ”‚   โ”œโ”€โ”€ hybrids/              # Hybrid methods
โ”‚   โ””โ”€โ”€ api/                  # Base API classes
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ runpipe.py            # Pipeline execution
โ”‚   โ”œโ”€โ”€ autoimblearn.py       # AutoML search
โ”‚   โ””โ”€โ”€ pipeline_strategies.py # Strategy pattern
โ”œโ”€โ”€ pipelines/                # Pipeline wrappers
โ”œโ”€โ”€ processing/               # Data preprocessing utilities
โ””โ”€โ”€ utils/                    # Helper functions

Building Docker Images

Each model has its own Dockerfile:

# Build a specific model image
cd AutoImblearn/components/classifiers/_sklearnbased
docker build -t sklearn-classifier-api .

# Build all images
cd AutoImblearn
./build_all_images.sh  # If script exists

Running Tests

# Install dev dependencies
pip install AutoImblearn[dev]

# Run tests
pytest tests/

# Run with coverage
pytest --cov=AutoImblearn tests/

๐Ÿ“– Citation

If you use AutoImblearn in your research, please cite:

@software{autoimblearn2024,
  title = {AutoImblearn: Automated Machine Learning for Imbalanced Medical Data},
  author = {Wang, Hank},
  year = {2024},
  version = {0.3.0},
  url = {https://github.com/Wanghongkua/Auto-Imblearn2}
}

๐Ÿ“„ License

This project is licensed under the BSD 3-Clause License. See LICENSE for details.


๐Ÿค Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ™ Acknowledgments

  • Built on top of scikit-learn, imbalanced-learn, and scikit-survival
  • Docker-based architecture inspired by microservices design patterns
  • AutoML search adapted from CASH (Combined Algorithm Selection and Hyperparameter optimization)

๐Ÿ“ง Contact

Author: Hank Wang Email: hankwang1991@gmail.com

For bug reports and feature requests, please use the GitHub Issues page.


Happy AutoML-ing! ๐Ÿš€

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoimblearn-0.3.0.tar.gz (123.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autoimblearn-0.3.0-py3-none-any.whl (169.2 kB view details)

Uploaded Python 3

File details

Details for the file autoimblearn-0.3.0.tar.gz.

File metadata

  • Download URL: autoimblearn-0.3.0.tar.gz
  • Upload date:
  • Size: 123.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for autoimblearn-0.3.0.tar.gz
Algorithm Hash digest
SHA256 f48166fb00d07bb8835d7cba11703f44d2cf6affb606f884aaf0624497e4f353
MD5 4f3bf3a6817648c61e07fe4aa3fc7345
BLAKE2b-256 c3603144440d1cfeee254fd0b7e51676fb856c42818001b11c0f1c5eaf3e84a6

See more details on using hashes here.

File details

Details for the file autoimblearn-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: autoimblearn-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 169.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for autoimblearn-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 18a69f6e33925527e1d5caa981101880d6c43a311c76b74096285d683576f8a9
MD5 2639475b8d9f43ee4183b213d665dac2
BLAKE2b-256 ce649d3556444086d0ab215bd25d594cd8c28d4c650da883158d7fa8e2c8e48d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page