Automated machine learning system for imbalanced medical data with survival analysis, unsupervised learning, and hyperparameter optimization

These details have not been verified by PyPI

Project description

AutoImblearn

AutoImblearn is a comprehensive Automated Machine Learning (AutoML) system designed for imbalanced medical data with support for classification, survival analysis, and unsupervised learning. It automates the selection of preprocessing techniques, resampling strategies, model selection, and hyperparameter optimization across multiple learning paradigms.

🌟 Key Features

Multiple Learning Paradigms

Supervised Classification: Imbalanced binary/multiclass classification
Survival Analysis: Time-to-event analysis with censoring
Unsupervised Learning: Clustering, dimensionality reduction, anomaly detection
Hybrid Methods: Combined resampling and classification
AutoML Integration: Out-of-the-box AutoML frameworks

Comprehensive Model Library (50+ Models)

20+ Classifiers: Logistic Regression, SVM, Random Forest, XGBoost, Neural Networks, etc.
15+ Resampling Methods: SMOTE variants, undersampling, oversampling, hybrid techniques
9 Survival Models: Cox Proportional Hazards, Random Survival Forest, SVM variants
6 Clustering Algorithms: KMeans, DBSCAN, Hierarchical, GMM, MeanShift, Spectral
6 Dimensionality Reduction: PCA, t-SNE, UMAP, TruncatedSVD, ICA, NMF
4 Anomaly Detection: IsolationForest, OneClassSVM, LOF, EllipticEnvelope
5+ Imputation Methods: Mean, Median, KNN, Iterative, HyperImpute
3 AutoML Frameworks: Auto-sklearn, TPOT, H2O AutoML

Advanced Capabilities

Automated Pipeline Search: Greedy search with budget controls
Docker-Based Architecture: Isolated, reproducible model training
Survival-Aware Processing: Handles censored data and structured survival arrays
Intelligent Caching: Reuses imputation results across experiments
K-Fold Cross-Validation: Robust performance estimation
Multiple Metrics: AUROC, F1, Precision, Recall, C-index, Silhouette, etc.

📦 Installation

Basic Installation

pip install AutoImblearn

Installation with Optional Dependencies

For specific use cases, install with extras:

# For web-based visualization (Django frontend)
pip install AutoImblearn[web]

# For advanced imputation methods
pip install AutoImblearn[imputer]

# For all resampling techniques
pip install AutoImblearn[resampler]

# For survival analysis
pip install AutoImblearn[survival]

# For unsupervised learning (UMAP)
pip install AutoImblearn[unsupervised]

# For all features
pip install AutoImblearn[all]

Requirements

Python ≥ 3.9
Docker (for model training)
scikit-learn ≥ 1.3.0
pandas ≥ 2.0.0
numpy ≥ 1.24.0

🚀 Quick Start

1. Classification Pipeline

from AutoImblearn.core.runpipe import RunPipe
from AutoImblearn.core.autoimblearn import AutoImblearn

class Args:
    dataset = "diabetes.csv"
    target = "outcome"
    path = "/data"
    metric = "auroc"
    n_splits = 5
    repeat = 0
    train_ratio = 1.0

args = Args()

# Initialize pipeline runner
run_pipe = RunPipe(args)
run_pipe.loadData()

# Run a specific pipeline: [imputer, resampler, classifier]
result = run_pipe.fit(['knn', 'smote', 'lr'])
print(f"AUROC: {result}")

# Or search for best pipeline automatically
automl = AutoImblearn(run_pipe, metric='auroc')
best_pipeline, n_evals, best_score = automl.find_best(max_iterations=50)
print(f"Best Pipeline: {best_pipeline}")
print(f"Best Score: {best_score}")

2. Survival Analysis Pipeline

# For time-to-event analysis with censored data

args.metric = "c_index"  # Concordance index for survival

# Run survival pipeline: [imputer, survival_resampler, survival_model]
result = run_pipe.fit(['median', 'rus', 'CPH'])  # Cox Proportional Hazards
print(f"C-index: {result}")

3. Unsupervised Learning Pipeline

# Clustering example
args.metric = "silhouette"

# Run clustering pipeline: [imputer, clustering_model]
result = run_pipe.fit(['knn', 'kmeans'])
print(f"Silhouette Score: {result}")

# Dimensionality reduction example
args.metric = "reconstruction"
result = run_pipe.fit(['median', 'pca'])

# Anomaly detection example
args.metric = "f1"
result = run_pipe.fit(['mean', 'isoforest'])

4. Hybrid Pipeline

# Combined resampling + classification in one step

# Run hybrid pipeline: [imputer, hybrid_method]
result = run_pipe.fit(['median', 'autosmote'])

5. AutoML Pipeline

# Pure AutoML approach (handles everything internally)

# Run AutoML: [automl_framework]
result = run_pipe.fit_automl(['autosklearn'])

🏗️ Pipeline Types

AutoImblearn supports 8 distinct pipeline types:

Pipeline Type	Structure	Example	Use Case
Classification	`[imputer, resampler, classifier]`	`['knn', 'smote', 'lr']`	Imbalanced classification
Survival	`[imputer, survival_resampler, survival_model]`	`['median', 'rus', 'CPH']`	Time-to-event analysis
Hybrid	`[imputer, hybrid_method]`	`['median', 'autosmote']`	Combined resampling+classification
AutoML	`[automl_framework]`	`['autosklearn']`	Automated ML
Clustering	`[imputer, clustering_model]`	`['knn', 'kmeans']`	Pattern discovery
Reduction	`[imputer, reduction_model]`	`['median', 'pca']`	Dimensionality reduction
Anomaly	`[imputer, anomaly_model]`	`['mean', 'isoforest']`	Outlier detection
Survival Clustering	`[imputer, survival_unsupervised]`	`['median', 'survival_tree']`	Risk stratification

📊 Available Models

Imputers (5)

mean, median, knn, iter, hyperimpute

Classifiers (20+)

Sklearn-based:

lr - Logistic Regression
svm - Support Vector Machine
dt - Decision Tree
rf - Random Forest
ab - AdaBoost
gb - Gradient Boosting
knn_clf - K-Nearest Neighbors
gnb - Gaussian Naive Bayes
mlp - Multi-Layer Perceptron
lda - Linear Discriminant Analysis
qda - Quadratic Discriminant Analysis

XGBoost-based:

xgb - XGBoost Classifier
xgb_rf - XGBoost Random Forest

Resamplers (15+)

Imblearn-based:

rus - Random Under-Sampling
ros - Random Over-Sampling
nm - Near Miss
cnn - Condensed Nearest Neighbor
enn - Edited Nearest Neighbors
allknn - All K-NN
smote_enn - SMOTE + ENN
smote_tomek - SMOTE + Tomek Links

SMOTE-based:

smote - SMOTE
borderline_smote - Borderline-SMOTE
svm_smote - SVM-SMOTE
adasyn - ADASYN
kmeans_smote - K-Means SMOTE

Survival Models (9)

CPH - Cox Proportional Hazards
RSF - Random Survival Forest
SVM - Survival SVM
KSVM - Kernel Survival SVM
LASSO - LASSO Cox
L1 - L1-penalized Cox
L2 - L2-penalized Cox
CSA - Component-wise Gradient Boosting
LRSF - Linear Random Survival Forest

Survival Resamplers (3)

rus - Random Under-Sampling (survival-aware)
ros - Random Over-Sampling (survival-aware)
smote - SMOTE (survival-aware)

Unsupervised Models

Clustering (6):

kmeans - K-Means Clustering
dbscan - DBSCAN
hierarchical - Agglomerative Clustering
gmm - Gaussian Mixture Model
meanshift - Mean Shift
spectral - Spectral Clustering

Dimensionality Reduction (6):

pca - Principal Component Analysis
tsne - t-SNE
umap - UMAP
svd - Truncated SVD
ica - Independent Component Analysis
nmf - Non-negative Matrix Factorization

Anomaly Detection (4):

isoforest - Isolation Forest
ocsvm - One-Class SVM
lof - Local Outlier Factor
elliptic - Elliptic Envelope

Survival Unsupervised (2):

survival_tree - Survival Tree (subgroup discovery)
survival_kmeans - K-Means on survival data

Hybrid Methods (2)

autosmote - AutoSMOTE (adaptive SMOTE with RL)
autorsp - Automated Resampler Selection

AutoML Frameworks (3)

autosklearn - Auto-sklearn
tpot - TPOT
h2o - H2O AutoML

🏛️ Architecture

Docker-Based Design

AutoImblearn uses a client-server architecture where each model runs in an isolated Docker container:

┌─────────────────┐
│   Python Client │  ←→  Flask REST API in Docker
│   (run.py)      │      (Docker/app.py)
└─────────────────┘

Benefits:

Isolation: Each model has its own dependencies
Reproducibility: Consistent environment across machines
Scalability: Easy to deploy on clusters
Security: Sandboxed execution

Pipeline Execution Flow

1. Data Loading
   ↓
2. K-Fold Splitting (on raw data)
   ↓
3. For each fold:
   a. Imputation (FIT on train, TRANSFORM both)
   b. Resampling (ONLY on train)
   c. Model Training
   d. Prediction & Evaluation
   ↓
4. Average Results
   ↓
5. Save & Cache

Intelligent Caching

Imputation results are cached per fold to avoid redundant computation:

# Cached file: interim/{dataset}/imp_{imputer}_fold{n}.p
if cached_file_exists:
    load_from_cache()  # Fast!
else:
    run_imputation()
    save_to_cache()

🔧 Configuration

Metrics Supported

Classification:

auroc - Area Under ROC Curve
f1 - F1 Score
precision - Precision
recall - Recall
accuracy - Accuracy

Survival:

c_index - Concordance Index
c_uno - Uno's C-index

Unsupervised:

silhouette - Silhouette Score (clustering)
calinski - Calinski-Harabasz Index (clustering)
davies_bouldin - Davies-Bouldin Index (clustering)
reconstruction - Reconstruction Error (reduction)
log_rank - Log-rank Test (survival clustering)

Search Budget Controls

automl.find_best(
    max_iterations=100,           # Max pipeline evaluations
    time_budget_seconds=3600,     # Max time (1 hour)
    early_stopping_patience=10    # Stop if no improvement
)

🌐 Web Interface

AutoImblearn includes a Django web frontend for interactive pipeline configuration:

Features:

Visual Pipeline Builder: Drag-and-drop interface
Dataset Upload: CSV file handling
Feature Analysis: Distribution plots and categorical detection
Pipeline Type Selection: Choose from 8 pipeline types
Model Selection: Multi-select from available models
Training Dashboard: Real-time progress tracking
Results Visualization: Performance metrics and comparisons

Launch Web Interface:

cd django_frontend
python manage.py runserver

Navigate to http://localhost:8000 to access the interface.

📚 Advanced Usage

Custom Pipeline Search

from AutoImblearn.core.autoimblearn import AutoImblearn

# Restrict search space
automl.imputers = ['knn', 'median']
automl.resamplers = ['smote', 'adasyn']
automl.classifiers = ['lr', 'rf', 'xgb']

# Run search with custom space
best_pipeline, n_evals, best_score = automl.find_best(
    max_iterations=30,
    time_budget_seconds=1800
)

Survival Data Format

Survival data requires a structured array with two fields:

import numpy as np
from sksurv.util import Surv

# Create survival array
y = Surv.from_arrays(
    event=[True, False, True, False],      # Event occurred?
    time=[100, 200, 150, 300]              # Time to event/censoring
)

# Structured array format:
# dtype=[('Status', bool), ('Survival_in_days', float)]

Direct Model Usage

from AutoImblearn.pipelines import classifiers, resamplers, imputers

# Instantiate specific models
imputer_factory = imputers['knn']
imputer = imputer_factory(data_folder='/data')

resampler_factory = resamplers['smote']
resampler = resampler_factory(data_folder='/data')

classifier_factory = classifiers['lr']
classifier = classifier_factory(data_folder='/data')

# Use models
X_train_imputed = imputer.fit_transform(args, X_train)
X_train_resampled, y_train_resampled = resampler.fit_resample(X_train_imputed, y_train)
classifier.fit(X_train_resampled, y_train_resampled)
predictions = classifier.predict(X_test)

🐛 Development

Project Structure

AutoImblearn/
├── components/
│   ├── classifiers/          # Classification models
│   ├── resamplers/           # Resampling techniques
│   ├── imputers/             # Imputation methods
│   ├── survival/             # Survival analysis models
│   │   ├── _supervised/      # Survival models (CPH, RSF, etc.)
│   │   ├── _resamplers/      # Survival-aware resampling
│   │   └── _unsupervised/    # Survival clustering
│   ├── unsupervised/         # Unsupervised learning
│   │   ├── _clustering/      # Clustering algorithms
│   │   ├── _reduction/       # Dimensionality reduction
│   │   └── _anomaly/         # Anomaly detection
│   ├── automls/              # AutoML frameworks
│   ├── hybrids/              # Hybrid methods
│   └── api/                  # Base API classes
├── core/
│   ├── runpipe.py            # Pipeline execution
│   ├── autoimblearn.py       # AutoML search
│   └── pipeline_strategies.py # Strategy pattern
├── pipelines/                # Pipeline wrappers
├── processing/               # Data preprocessing utilities
└── utils/                    # Helper functions

Building Docker Images

Each model has its own Dockerfile:

# Build a specific model image
cd AutoImblearn/components/classifiers/_sklearnbased
docker build -t sklearn-classifier-api .

# Build all images
cd AutoImblearn
./build_all_images.sh  # If script exists

Running Tests

# Install dev dependencies
pip install AutoImblearn[dev]

# Run tests
pytest tests/

# Run with coverage
pytest --cov=AutoImblearn tests/

📖 Citation

If you use AutoImblearn in your research, please cite:

@software{autoimblearn2024,
  title = {AutoImblearn: Automated Machine Learning for Imbalanced Medical Data},
  author = {Wang, Hank},
  year = {2024},
  version = {0.3.0},
  url = {https://github.com/Wanghongkua/Auto-Imblearn2}
}

📄 License

This project is licensed under the BSD 3-Clause License. See LICENSE for details.

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

🙏 Acknowledgments

Built on top of scikit-learn, imbalanced-learn, and scikit-survival
Docker-based architecture inspired by microservices design patterns
AutoML search adapted from CASH (Combined Algorithm Selection and Hyperparameter optimization)

📧 Contact

Author: Hank Wang Email: hankwang1991@gmail.com

For bug reports and feature requests, please use the GitHub Issues page.

Happy AutoML-ing! 🚀

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.31

Mar 19, 2026

0.3.30

Mar 19, 2026

0.3.28

Mar 11, 2026

0.3.27

Mar 11, 2026

0.3.26

Mar 11, 2026

0.3.25

Mar 5, 2026

0.3.24

Dec 23, 2025

0.3.23

Dec 23, 2025

0.3.22

Dec 22, 2025

0.3.21

Dec 16, 2025

0.3.20

Dec 16, 2025

0.3.19

Dec 15, 2025

0.3.18

Dec 11, 2025

0.3.17

Dec 11, 2025

0.3.16

Dec 2, 2025

0.3.15

Dec 2, 2025

0.3.14

Dec 2, 2025

0.3.13

Nov 30, 2025

0.3.12

Nov 12, 2025

0.3.11

Nov 6, 2025

0.3.10

Nov 5, 2025

0.3.9

Nov 5, 2025

0.3.8

Nov 5, 2025

0.3.7

Nov 4, 2025

0.3.6

Nov 3, 2025

0.3.5

Nov 3, 2025

0.3.4

Oct 31, 2025

0.3.3

Oct 29, 2025

0.3.2

Oct 29, 2025

0.3.1

Oct 14, 2025

This version

0.3.0

Oct 14, 2025

0.1.34

Oct 10, 2025

0.1.33

Oct 10, 2025

0.1.32

Oct 9, 2025

0.1.31

Oct 9, 2025

0.1.30

Oct 9, 2025

0.1.29

Oct 9, 2025

0.1.28

Oct 9, 2025

0.1.26

Oct 8, 2025

0.1.25

Oct 7, 2025

0.1.24

Oct 7, 2025

0.1.23

Oct 7, 2025

0.1.22

Sep 26, 2025

0.1.21

Sep 26, 2025

0.1.20

Sep 26, 2025

0.1.19

Sep 25, 2025

0.1.18

Jul 30, 2025

0.1.17

Jul 24, 2025

0.1.16

Jul 24, 2025

0.1.15

Jul 24, 2025

0.1.14

Jul 24, 2025

0.1.13

Jul 24, 2025

0.1.12

Jul 24, 2025

0.1.11

Jul 23, 2025

0.1.10

Jul 2, 2025

0.1.9

Jul 2, 2025

0.1.8

Jul 1, 2025

0.1.7

Jul 1, 2025

0.1.6

Jul 1, 2025

0.1.5

Jul 1, 2025

0.1.4

Jul 1, 2025

0.1.3

Jul 1, 2025

0.1.2

May 28, 2025

0.1.1

May 21, 2025

0.1.0

May 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoimblearn-0.3.0.tar.gz (123.5 kB view details)

Uploaded Oct 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autoimblearn-0.3.0-py3-none-any.whl (169.2 kB view details)

Uploaded Oct 14, 2025 Python 3

File details

Details for the file autoimblearn-0.3.0.tar.gz.

File metadata

Download URL: autoimblearn-0.3.0.tar.gz
Upload date: Oct 14, 2025
Size: 123.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for autoimblearn-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`f48166fb00d07bb8835d7cba11703f44d2cf6affb606f884aaf0624497e4f353`
MD5	`4f3bf3a6817648c61e07fe4aa3fc7345`
BLAKE2b-256	`c3603144440d1cfeee254fd0b7e51676fb856c42818001b11c0f1c5eaf3e84a6`

See more details on using hashes here.

File details

Details for the file autoimblearn-0.3.0-py3-none-any.whl.

File metadata

Download URL: autoimblearn-0.3.0-py3-none-any.whl
Upload date: Oct 14, 2025
Size: 169.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for autoimblearn-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`18a69f6e33925527e1d5caa981101880d6c43a311c76b74096285d683576f8a9`
MD5	`2639475b8d9f43ee4183b213d665dac2`
BLAKE2b-256	`ce649d3556444086d0ab215bd25d594cd8c28d4c650da883158d7fa8e2c8e48d`

See more details on using hashes here.

AutoImblearn 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

AutoImblearn

🌟 Key Features

Multiple Learning Paradigms

Comprehensive Model Library (50+ Models)

Advanced Capabilities

📦 Installation

Basic Installation

Installation with Optional Dependencies

Requirements

🚀 Quick Start

1. Classification Pipeline

2. Survival Analysis Pipeline

3. Unsupervised Learning Pipeline

4. Hybrid Pipeline

5. AutoML Pipeline

🏗️ Pipeline Types

📊 Available Models

Imputers (5)

Classifiers (20+)

Resamplers (15+)

Survival Models (9)

Survival Resamplers (3)

Unsupervised Models

Hybrid Methods (2)

AutoML Frameworks (3)

🏛️ Architecture

Docker-Based Design

Pipeline Execution Flow

Intelligent Caching

🔧 Configuration

Metrics Supported

Search Budget Controls

🌐 Web Interface

Features:

Launch Web Interface:

📚 Advanced Usage

Custom Pipeline Search

Survival Data Format

Direct Model Usage

🐛 Development

Project Structure

Building Docker Images

Running Tests

📖 Citation

📄 License

🤝 Contributing

🙏 Acknowledgments

📧 Contact

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes