A system for automating the design of predictive modeling pipelines tailored for clinical prognosis.
Project description
AutoPrognosis - A system for automating the design of predictive modeling pipelines tailored for clinical prognosis.
:key: Features
- :rocket:Automatically learns ensembles of pipelines for classification or survival analysis.
- :cyclone: Easy to extend pluginable architecture.
- :fire: Interpretability tools.
:rocket:Installation
Using pip
The library can be installed from PyPI using
$ pip install autoprognosis
or from source, using
$ pip install .
Redis (Optional, but recommended)
AutoPrognosis can use Redis as a backend to improve the performance and quality of the searches.
For that, install the redis-server package following the steps described on the official site.
:boom: Sample Usage
More advanced use cases can be found on our tutorials section.
List the available classifiers
from autoprognosis.plugins.prediction.classifiers import Classifiers
print(Classifiers().list_available())
Create a study for classifiers
from pathlib import Path
from sklearn.datasets import load_breast_cancer
from autoprognosis.studies.classifiers import ClassifierStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_estimator
X, Y = load_breast_cancer(return_X_y=True, as_frame=True)
df = X.copy()
df["target"] = Y
workspace = Path("workspace")
study_name = "example"
study = ClassifierStudy(
study_name=study_name,
dataset=df, # pandas DataFrame
target="target", # the label column in the dataset
num_iter=100, # how many trials to do for each candidate
timeout=60, # seconds
classifiers=["logistic_regression", "lda", "qda"],
workspace=workspace,
)
study.run()
output = workspace / study_name / "model.p"
model = load_model_from_file(output)
metrics = evaluate_estimator(model, X, Y)
print(f"model {model.name()} -> {metrics['clf']}")
List available survival analysis estimators
from autoprognosis.plugins.prediction.risk_estimation import RiskEstimation
print(RiskEstimation().list_available())
Survival analysis study
# stdlib
import os
from pathlib import Path
# third party
import numpy as np
from pycox import datasets
# autoprognosis absolute
from autoprognosis.studies.risk_estimation import RiskEstimationStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_survival_estimator
df = datasets.gbsg.read_df()
df = df[df["duration"] > 0]
X = df.drop(columns = ["duration"])
T = df["duration"]
Y = df["event"]
eval_time_horizons = np.linspace(T.min(), T.max(), 5)[1:-1]
workspace = Path("workspace")
study_name = "example_risks"
study = RiskEstimationStudy(
study_name=study_name,
dataset=df,
target="event",
time_to_event="duration",
time_horizons=eval_time_horizons,
num_iter=10,
num_study_iter=1,
timeout=10,
risk_estimators=["cox_ph", "survival_xgboost"],
score_threshold=0.5,
workspace=workspace,
)
study.run()
output = workspace / study_name / "model.p"
model = load_model_from_file(output)
metrics = evaluate_survival_estimator(model, X, T, Y, eval_time_horizons)
print(f"Model {model.name()} score: {metrics['clf']}")
:high_brightness: Tutorials
Plugins
AutoML
- Classification tasks
- Classification tasks with imputation
- Survival analysis tasks
- Survival analysis tasks with imputation
:cyclone: Building a demonstrator
After running a study, a model template will be available in the workspace, in the model.p
file.
Based on this template, you can create a demonstrator using the scripts/build_demonstrator.py
script.
Usage: build_demonstrator.py [OPTIONS]
Options:
--name TEXT The title of the demonstrator
--task_type TEXT classification/risk_estimation
--dashboard_type TEXT streamlit or dash. Default: streamlit
--dataset_path TEXT Path to the dataset csv
--model_path TEXT Path to the model template, usually model.p
--time_column TEXT Only for risk_estimation tasks. Which column in
the dataset is used for time-to-event
--target_column TEXT Which column in the dataset is the outcome
--horizons TEXT Only for risk_estimation tasks. Which time
horizons to plot.
--explainers TEXT Which explainers to include. There can be multiple
explainer names, separated by a comma. Available
explainers:
kernel_shap,invase,shap_permutation_sampler,lime.
--imputers TEXT Which imputer to use. Available imputers:
['sinkhorn', 'EM', 'mice', 'ice', 'hyperimpute',
'most_frequent', 'median', 'missforest',
'softimpute', 'nop', 'mean', 'gain']
--plot_alternatives TEXT Only for risk_estimation. List of categorical
columns by which to split the graphs. For example,
plot outcome for different treatments available.
--output TEXT Where to save the demonstrator files. The content
of the folder can be directly used for
deployments(for example, to Heroku).
--help Show this message and exit.
Build a demonstrator for a classification task
For this task, the scripts needs access to the model template workspace/model.p
(generated after running a study), the baseline dataset "dataset.csv", and the target column target
in the dataset, which contains the outcomes. Based on that, the demonstrator can be built using:
python ./scripts/build_demonstrator.py \
--model_path=workspace/model.p \
--dataset_path=dataset.csv \
--target_column=target \
--task_type=classification
The result is a folder, output/image_bin
, containing all the files necessary for running the demonstrator.
You can start the demonstrator using
cd output/image_bin/
pip install -r ./requirements.txt
python ./app.py
The contents of the output/image_bin
can be used for cloud deployments, for example, Heroku.
Optionally, you can customize the output
option to store the output files. The default is set to output/image_bin
.
Build a demonstrator for a survival analysis task
For this task, the scripts needs access to the model template workspace/model.p
(generated after running a study), the baseline dataset "dataset.csv", the target column target
in the dataset, the time_to_event column time_to_event
, and the plotted time horizons. Based on that, the demonstrator can be built using:
python ./scripts/build_demonstrator.py \
--model_path=workspace/model.p \
--dataset_path=dataset.csv \
--time_column=time_to_event \
--target_column=target \
--horizons="14,27,41" # use your own time horizons here, separated by a comma
--task_type=risk_estimation
The result is a folder, output/image_bin
, containing all the files necessary for running the demonstrator.
You can start the demonstrator using
cd output/image_bin/
pip install -r ./requirements.txt
python ./app.py
The contents of the output/image_bin
can be used for cloud deployments, for example, Heroku.
Customizing the demonstrator
You can customize your demonstrator, by selected multiple explainers.
python ./scripts/build_demonstrator.py \
--model_path=workspace/model.p \
--dataset_path=dataset.csv \
--target_column=target \
--task_type=classification
--explainers="invase,kernel_shap"
:zap: Plugins
Imputation methods
from autoprognosis.plugins.imputers import Imputers
imputer = Imputers().get(<NAME>)
Name | Description |
---|---|
hyperimpute | Iterative imputer using both regression and classification methods based on linear models, trees, XGBoost, CatBoost and neural nets |
mean | Replace the missing values using the mean along each column with SimpleImputer |
median | Replace the missing values using the median along each column with SimpleImputer |
most_frequent | Replace the missing values using the most frequent value along each column with SimpleImputer |
missforest | Iterative imputation method based on Random Forests using IterativeImputer and ExtraTreesRegressor |
ice | Iterative imputation method based on regularized linear regression using IterativeImputer and BayesianRidge |
mice | Multiple imputations based on ICE using IterativeImputer and BayesianRidge |
softimpute | Low-rank matrix approximation via nuclear-norm regularization |
EM | Iterative procedure which uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization) - EM imputation algorithm |
gain | GAIN: Missing Data Imputation using Generative Adversarial Nets |
Preprocessing methods
from autoprognosis.plugins.preprocessors import Preprocessors
preprocessor = Preprocessors().get(<NAME>)
Name | Description |
---|---|
maxabs_scaler | Scale each feature by its maximum absolute value. MaxAbsScaler |
scaler | Standardize features by removing the mean and scaling to unit variance. - StandardScaler |
feature_normalizer | Normalize samples individually to unit norm. Normalizer |
normal_transform | Transform features using quantiles information.QuantileTransformer |
uniform_transform | Transform features using quantiles information.QuantileTransformer |
minmax_scaler | Transform features by scaling each feature to a given range.MinMaxScaler |
Classification
from autoprognosis.plugins.prediction.classifiers import Classifiers
classifier = Classifiers().get(<NAME>)
Name | Description |
---|---|
neural_nets | PyTorch based neural net classifier. |
logistic_regression | LogisticRegression |
catboost | Gradient boosting on decision trees - CatBoost |
random_forest | A random forest classifier. RandomForestClassifier |
tabnet | TabNet : Attentive Interpretable Tabular Learning |
xgboost | XGBoostClassifier |
Survival Analysis
from autoprognosis.plugins.prediction.risk_estimation import RiskEstimation
predictor = RiskEstimation().get(<NAME>)
Name | Description |
---|---|
survival_xgboost | XGBoost Survival Embeddings |
loglogistic_aft | Log-Logistic AFT model |
deephit | DeepHit: A Deep Learning Approach to Survival Analysis with Competing Risks |
cox_ph | Cox’s proportional hazard model |
weibull_aft | Weibull AFT model. |
lognormal_aft | Log-Normal AFT model |
coxnet | CoxNet is a Cox proportional hazards model also referred to as DeepSurv |
Regression
from autoprognosis.plugins.prediction.regression import Regression
regressor = Regression().get(<NAME>)
Name | Description |
---|---|
tabnet_regressor | TabNet : Attentive Interpretable Tabular Learning |
catboost_regressor | Gradient boosting on decision trees - CatBoost |
random_forest_regressor | RandomForestRegressor |
xgboost_regressor | XGBoostClassifier |
neural_nets_regression | PyTorch-based neural net regressor. |
linear_regression | LinearRegression |
Explainers
from autoprognosis.plugins.explainers import Explainers
explainer = Explainers().get(<NAME>)
Name | Description |
---|---|
risk_effect_size | Feature importance using Cohen's distance between probabilities |
lime | Lime: Explaining the predictions of any machine learning classifier |
symbolic_pursuit | [Symbolic Pursuit ](Learning outside the black-box: at the pursuit of interpretable models) |
shap_permutation_sampler | SHAP Permutation Sampler |
kernel_shap | SHAP KernelExplainer |
invase | INVASE: Instance-wise Variable Selection |
Uncertainty
from autoprognosis.plugins.uncertainty import UncertaintyQuantification
model = UncertaintyQuantification().get(<NAME>)
Name | Description |
---|---|
cohort_explainer | |
conformal_prediction | |
jackknife |
:hammer: Test
After installing the library, the tests can be executed using pytest
$ pip install .[testing]
$ pytest -vxs -m "not slow"
Citing
If you use this code, please cite the associated paper:
TODO
References
- AutoPrognosis: Automated Clinical Prognostic Modeling via Bayesian Optimization with Structured Kernel Learning
- Prognostication and Risk Factors for Cystic Fibrosis via Automated Machine Learning
- Cardiovascular Disease Risk Prediction using Automated Machine Learning: A Prospective Study of 423,604 UK Biobank Participants
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for autoprognosis-0.1.1-py2.py3-none-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c85a55d33fcb1eaf9b25c18d8c05f27b617beab2c68f41026f3cd9c49962ce3 |
|
MD5 | 8982fd10967915e9479b2db3b8b4bb6c |
|
BLAKE2b-256 | 59c9e2ff1a97b83a532cc92c67df78d36f6e0af607eab9dc1f40a177ee5c7f01 |
Hashes for autoprognosis-0.1.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38d963b5a403acaf22028b547d461526288ebba765e8d8f59ff5b23f961c67cf |
|
MD5 | f71d700f6b0070dd7d93c74407b1ea0f |
|
BLAKE2b-256 | 795180e55bf94b9577d407d1e9cf57eae4e3b73b876b25e42fe097451dc03997 |