Skip to main content

Automated machine learning system for imbalanced medical data with survival analysis, unsupervised learning, and hyperparameter optimization

Project description

AutoImblearn

AutoImblearn is a comprehensive Automated Machine Learning (AutoML) system designed for imbalanced medical data with support for classification, survival analysis, and unsupervised learning. It automates the selection of preprocessing techniques, resampling strategies, model selection, and hyperparameter optimization across multiple learning paradigms.

Python Version License Version


๐ŸŒŸ Key Features

Multiple Learning Paradigms

  • Supervised Classification: Imbalanced binary/multiclass classification
  • Survival Analysis: Time-to-event analysis with censoring
  • Unsupervised Learning: Clustering, dimensionality reduction, anomaly detection
  • Hybrid Methods: Combined resampling and classification
  • AutoML Integration: Out-of-the-box AutoML frameworks

Comprehensive Model Library (50+ Models)

  • 20+ Classifiers: Logistic Regression, SVM, Random Forest, XGBoost, Neural Networks, etc.
  • 15+ Resampling Methods: SMOTE variants, undersampling, oversampling, hybrid techniques
  • 9 Survival Models: Cox Proportional Hazards, Random Survival Forest, SVM variants
  • 6 Clustering Algorithms: KMeans, DBSCAN, Hierarchical, GMM, MeanShift, Spectral
  • 6 Dimensionality Reduction: PCA, t-SNE, UMAP, TruncatedSVD, ICA, NMF
  • 4 Anomaly Detection: IsolationForest, OneClassSVM, LOF, EllipticEnvelope
  • 5+ Imputation Methods: Mean, Median, KNN, Iterative, HyperImpute
  • 3 AutoML Frameworks: Auto-sklearn, TPOT, H2O AutoML

Advanced Capabilities

  • Automated Pipeline Search: Greedy search with budget controls
  • Docker-Based Architecture: Isolated, reproducible model training
  • Survival-Aware Processing: Handles censored data and structured survival arrays
  • Intelligent Caching: Reuses imputation results across experiments
  • K-Fold Cross-Validation: Robust performance estimation
  • Multiple Metrics: AUROC, F1, Precision, Recall, C-index, Silhouette, etc.

๐Ÿ“ฆ Installation

Basic Installation

pip install AutoImblearn

Installation with Optional Dependencies

For specific use cases, install with extras:

# For web-based visualization (Django frontend)
pip install AutoImblearn[web]

# For advanced imputation methods
pip install AutoImblearn[imputer]

# For all resampling techniques
pip install AutoImblearn[resampler]

# For survival analysis
pip install AutoImblearn[survival]

# For unsupervised learning (UMAP)
pip install AutoImblearn[unsupervised]

# For all features
pip install AutoImblearn[all]

Requirements

  • Python โ‰ฅ 3.9
  • Docker (for model training)
  • scikit-learn โ‰ฅ 1.3.0
  • pandas โ‰ฅ 2.0.0
  • numpy โ‰ฅ 1.24.0

๐Ÿš€ Quick Start

1. Classification Pipeline

from AutoImblearn.core.runpipe import RunPipe
from AutoImblearn.core.autoimblearn import AutoImblearn

class Args:
    dataset = "diabetes.csv"
    target = "outcome"
    path = "/data"
    metric = "auroc"
    n_splits = 5
    repeat = 0
    train_ratio = 1.0

args = Args()

# Initialize pipeline runner
run_pipe = RunPipe(args)
run_pipe.loadData()

# Run a specific pipeline: [imputer, resampler, classifier]
result = run_pipe.fit(['knn', 'smote', 'lr'])
print(f"AUROC: {result}")

# Or search for best pipeline automatically
automl = AutoImblearn(run_pipe, metric='auroc')
best_pipeline, n_evals, best_score = automl.find_best(max_iterations=50)
print(f"Best Pipeline: {best_pipeline}")
print(f"Best Score: {best_score}")

2. Survival Analysis Pipeline

# For time-to-event analysis with censored data

args.metric = "c_index"  # Concordance index for survival

# Run survival pipeline: [imputer, survival_resampler, survival_model]
result = run_pipe.fit(['median', 'rus', 'CPH'])  # Cox Proportional Hazards
print(f"C-index: {result}")

3. Unsupervised Learning Pipeline

# Clustering example
args.metric = "silhouette"

# Run clustering pipeline: [imputer, clustering_model]
result = run_pipe.fit(['knn', 'kmeans'])
print(f"Silhouette Score: {result}")

# Dimensionality reduction example
args.metric = "reconstruction"
result = run_pipe.fit(['median', 'pca'])

# Anomaly detection example
args.metric = "f1"
result = run_pipe.fit(['mean', 'isoforest'])

4. Hybrid Pipeline

# Combined resampling + classification in one step

# Run hybrid pipeline: [imputer, hybrid_method]
result = run_pipe.fit(['median', 'autosmote'])

5. AutoML Pipeline

# Pure AutoML approach (handles everything internally)

# Run AutoML: [automl_framework]
result = run_pipe.fit_automl(['autosklearn'])

๐Ÿ—๏ธ Pipeline Types

AutoImblearn supports 8 distinct pipeline types:

Pipeline Type Structure Example Use Case
Classification [imputer, resampler, classifier] ['knn', 'smote', 'lr'] Imbalanced classification
Survival [imputer, survival_resampler, survival_model] ['median', 'rus', 'CPH'] Time-to-event analysis
Hybrid [imputer, hybrid_method] ['median', 'autosmote'] Combined resampling+classification
AutoML [automl_framework] ['autosklearn'] Automated ML
Clustering [imputer, clustering_model] ['knn', 'kmeans'] Pattern discovery
Reduction [imputer, reduction_model] ['median', 'pca'] Dimensionality reduction
Anomaly [imputer, anomaly_model] ['mean', 'isoforest'] Outlier detection
Survival Clustering [imputer, survival_unsupervised] ['median', 'survival_tree'] Risk stratification

๐Ÿ“Š Available Models

Imputers (5)

  • mean, median, knn, iter, hyperimpute

Classifiers (20+)

Sklearn-based:

  • lr - Logistic Regression
  • svm - Support Vector Machine
  • dt - Decision Tree
  • rf - Random Forest
  • ab - AdaBoost
  • gb - Gradient Boosting
  • knn_clf - K-Nearest Neighbors
  • gnb - Gaussian Naive Bayes
  • mlp - Multi-Layer Perceptron
  • lda - Linear Discriminant Analysis
  • qda - Quadratic Discriminant Analysis

XGBoost-based:

  • xgb - XGBoost Classifier
  • xgb_rf - XGBoost Random Forest

Resamplers (15+)

Imblearn-based:

  • rus - Random Under-Sampling
  • ros - Random Over-Sampling
  • nm - Near Miss
  • cnn - Condensed Nearest Neighbor
  • enn - Edited Nearest Neighbors
  • allknn - All K-NN
  • smote_enn - SMOTE + ENN
  • smote_tomek - SMOTE + Tomek Links

SMOTE-based:

  • smote - SMOTE
  • borderline_smote - Borderline-SMOTE
  • svm_smote - SVM-SMOTE
  • adasyn - ADASYN
  • kmeans_smote - K-Means SMOTE

Survival Models (9)

  • CPH - Cox Proportional Hazards
  • RSF - Random Survival Forest
  • SVM - Survival SVM
  • KSVM - Kernel Survival SVM
  • LASSO - LASSO Cox
  • L1 - L1-penalized Cox
  • L2 - L2-penalized Cox
  • CSA - Component-wise Gradient Boosting
  • LRSF - Linear Random Survival Forest

Survival Resamplers (3)

  • rus - Random Under-Sampling (survival-aware)
  • ros - Random Over-Sampling (survival-aware)
  • smote - SMOTE (survival-aware)

Unsupervised Models

Clustering (6):

  • kmeans - K-Means Clustering
  • dbscan - DBSCAN
  • hierarchical - Agglomerative Clustering
  • gmm - Gaussian Mixture Model
  • meanshift - Mean Shift
  • spectral - Spectral Clustering

Dimensionality Reduction (6):

  • pca - Principal Component Analysis
  • tsne - t-SNE
  • umap - UMAP
  • svd - Truncated SVD
  • ica - Independent Component Analysis
  • nmf - Non-negative Matrix Factorization

Anomaly Detection (4):

  • isoforest - Isolation Forest
  • ocsvm - One-Class SVM
  • lof - Local Outlier Factor
  • elliptic - Elliptic Envelope

Survival Unsupervised (2):

  • survival_tree - Survival Tree (subgroup discovery)
  • survival_kmeans - K-Means on survival data

Hybrid Methods (2)

  • autosmote - AutoSMOTE (adaptive SMOTE with RL)
  • autorsp - Automated Resampler Selection (macro F1 reinforcement)

AutoML Frameworks (3)

  • autosklearn - Auto-sklearn
  • tpot - TPOT
  • h2o - H2O AutoML

๐Ÿ›๏ธ Architecture

Docker-Based Design

AutoImblearn uses a client-server architecture where each model runs in an isolated Docker container:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Python Client โ”‚  โ†โ†’  Flask REST API in Docker
โ”‚   (run.py)      โ”‚      (Docker/app.py)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Benefits:

  • Isolation: Each model has its own dependencies
  • Reproducibility: Consistent environment across machines
  • Scalability: Easy to deploy on clusters
  • Security: Sandboxed execution

Pipeline Execution Flow

1. Data Loading
   โ†“
2. K-Fold Splitting (on raw data)
   โ†“
3. For each fold:
   a. Imputation (FIT on train, TRANSFORM both)
   b. Resampling (ONLY on train)
   c. Model Training
   d. Prediction & Evaluation
   โ†“
4. Average Results
   โ†“
5. Save & Cache

Intelligent Caching

Imputation results are cached per fold to avoid redundant computation:

# Cached file: interim/{dataset}/imp_{imputer}_fold{n}.p
if cached_file_exists:
    load_from_cache()  # Fast!
else:
    run_imputation()
    save_to_cache()

๐Ÿ”ง Configuration

Metrics Supported

Classification:

  • auroc - Area Under ROC Curve
  • f1 - F1 Score
  • precision - Precision
  • recall - Recall
  • accuracy - Accuracy

Survival:

  • c_index - Concordance Index
  • c_uno - Uno's C-index

Unsupervised:

  • silhouette - Silhouette Score (clustering)
  • calinski - Calinski-Harabasz Index (clustering)
  • davies_bouldin - Davies-Bouldin Index (clustering)
  • reconstruction - Reconstruction Error (reduction)
  • log_rank - Log-rank Test (survival clustering)

Search Budget Controls

automl.find_best(
    max_iterations=100,           # Max pipeline evaluations
    time_budget_seconds=3600,     # Max time (1 hour)
    early_stopping_patience=10    # Stop if no improvement
)

๐ŸŒ Web Interface

AutoImblearn includes a Django web frontend for interactive pipeline configuration:

Features:

  • Visual Pipeline Builder: Drag-and-drop interface
  • Dataset Upload: CSV file handling
  • Feature Analysis: Distribution plots and categorical detection
  • Pipeline Type Selection: Choose from 8 pipeline types
  • Model Selection: Multi-select from available models
  • Training Dashboard: Real-time progress tracking
  • Results Visualization: Performance metrics and comparisons

Launch Web Interface:

cd django_frontend
python manage.py runserver

Navigate to http://localhost:8000 to access the interface.


๐Ÿ“š Advanced Usage

Custom Pipeline Search

from AutoImblearn.core.autoimblearn import AutoImblearn

# Restrict search space
automl.imputers = ['knn', 'median']
automl.resamplers = ['smote', 'adasyn']
automl.classifiers = ['lr', 'rf', 'xgb']

# Run search with custom space
best_pipeline, n_evals, best_score = automl.find_best(
    max_iterations=30,
    time_budget_seconds=1800
)

Survival Data Format

Survival data requires a structured array with two fields:

import numpy as np
from sksurv.util import Surv

# Create survival array
y = Surv.from_arrays(
    event=[True, False, True, False],      # Event occurred?
    time=[100, 200, 150, 300]              # Time to event/censoring
)

# Structured array format:
# dtype=[('Status', bool), ('Survival_in_days', float)]

Direct Model Usage

from AutoImblearn.pipelines import classifiers, resamplers, imputers

# Instantiate specific models
imputer_factory = imputers['knn']
imputer = imputer_factory(data_folder='/data')

resampler_factory = resamplers['smote']
resampler = resampler_factory(data_folder='/data')

classifier_factory = classifiers['lr']
classifier = classifier_factory(data_folder='/data')

# Use models
X_train_imputed = imputer.fit_transform(args, X_train)
X_train_resampled, y_train_resampled = resampler.fit_resample(X_train_imputed, y_train)
classifier.fit(X_train_resampled, y_train_resampled)
predictions = classifier.predict(X_test)

๐Ÿ› Development

Project Structure

AutoImblearn/
โ”œโ”€โ”€ components/
โ”‚   โ”œโ”€โ”€ classifiers/          # Classification models
โ”‚   โ”œโ”€โ”€ resamplers/           # Resampling techniques
โ”‚   โ”œโ”€โ”€ imputers/             # Imputation methods
โ”‚   โ”œโ”€โ”€ survival/             # Survival analysis models
โ”‚   โ”‚   โ”œโ”€โ”€ _supervised/      # Survival models (CPH, RSF, etc.)
โ”‚   โ”‚   โ”œโ”€โ”€ _resamplers/      # Survival-aware resampling
โ”‚   โ”‚   โ””โ”€โ”€ _unsupervised/    # Survival clustering
โ”‚   โ”œโ”€โ”€ unsupervised/         # Unsupervised learning
โ”‚   โ”‚   โ”œโ”€โ”€ _clustering/      # Clustering algorithms
โ”‚   โ”‚   โ”œโ”€โ”€ _reduction/       # Dimensionality reduction
โ”‚   โ”‚   โ””โ”€โ”€ _anomaly/         # Anomaly detection
โ”‚   โ”œโ”€โ”€ automls/              # AutoML frameworks
โ”‚   โ”œโ”€โ”€ hybrids/              # Hybrid methods
โ”‚   โ””โ”€โ”€ api/                  # Base API classes
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ runpipe.py            # Pipeline execution
โ”‚   โ”œโ”€โ”€ autoimblearn.py       # AutoML search
โ”‚   โ””โ”€โ”€ pipeline_strategies.py # Strategy pattern
โ”œโ”€โ”€ pipelines/                # Pipeline wrappers
โ”œโ”€โ”€ processing/               # Data preprocessing utilities
โ””โ”€โ”€ utils/                    # Helper functions

Building Docker Images

Each model has its own Dockerfile:

# Build a specific model image
cd AutoImblearn/components/classifiers/_sklearnbased
docker build -t sklearn-classifier-api .

# Build all images
cd AutoImblearn
./build_all_images.sh  # If script exists

Running Tests

# Install dev dependencies
pip install AutoImblearn[dev]

# Run tests
pytest tests/

# Run with coverage
pytest --cov=AutoImblearn tests/

๐Ÿ“– Citation

If you use AutoImblearn in your research, please cite:

@software{autoimblearn2024,
  title = {AutoImblearn: Automated Machine Learning for Imbalanced Medical Data},
  author = {Wang, Hank},
  year = {2024},
  version = {0.3.0},
  url = {https://github.com/Wanghongkua/Auto-Imblearn2}
}

๐Ÿ“„ License

This project is licensed under the BSD 3-Clause License. See LICENSE for details.


๐Ÿค Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ™ Acknowledgments

  • Built on top of scikit-learn, imbalanced-learn, and scikit-survival
  • Docker-based architecture inspired by microservices design patterns
  • AutoML search adapted from CASH (Combined Algorithm Selection and Hyperparameter optimization)

๐Ÿ“ง Contact

Author: Hank Wang Email: hankwang1991@gmail.com

For bug reports and feature requests, please use the GitHub Issues page.


Happy AutoML-ing! ๐Ÿš€

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoimblearn-0.3.25.tar.gz (153.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autoimblearn-0.3.25-py3-none-any.whl (224.0 kB view details)

Uploaded Python 3

File details

Details for the file autoimblearn-0.3.25.tar.gz.

File metadata

  • Download URL: autoimblearn-0.3.25.tar.gz
  • Upload date:
  • Size: 153.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for autoimblearn-0.3.25.tar.gz
Algorithm Hash digest
SHA256 082e54a00e2200430e02bbe8f5d72adf178a9197a5e4ee34d455d2eb92d7036f
MD5 5dc945d1bec53cd549c2f72e1cd09066
BLAKE2b-256 3c4364a2dbdf1348af671dcbdaa2ad6d323e10f6651dbc7272fbe14d6472a23f

See more details on using hashes here.

File details

Details for the file autoimblearn-0.3.25-py3-none-any.whl.

File metadata

  • Download URL: autoimblearn-0.3.25-py3-none-any.whl
  • Upload date:
  • Size: 224.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for autoimblearn-0.3.25-py3-none-any.whl
Algorithm Hash digest
SHA256 723087f3062472517a9ab1efeb6763545a1fe1f2c32c82912e63b78998cff3a2
MD5 efbda5ab0c953069456d1b0219d7881b
BLAKE2b-256 06775913bde0ae3e725bcac84f82edd0c8db025b4a834a53ac078b2e3edda60b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page