Skip to main content

INGOT-DR (INterpretable GrOup Testing for Drug Resistance)

Project description

INGOT-DR

INGOT-DR ( INterpretable GrOup Testing for Drug Resistance) is an interpretable rule-based predictive model base on Group Testing and Boolean Compressed Sesing. For more details and citation please see the INGOT-DR paper. To access scripts used to produce the results in the paper please visit INGOT-DR Project. To access the data used in the paper please visit/cite M.tuberculosis dataset for drug resistant.

##Table of content

Installation

INGOT-DR can be installed from PyPI.

pip install ingotdr

Usage

INGOT-DR is implemented as a scikit-learn classifier. As a result, this classifier is compatible with most of scikit-learn tools (e.g. cross validation and hyper-parameter tuning tools). In the following section, we provide some usage examples:

Arguments

ingot.INGOTClassifier( w_weight=1, lambda_p=1, lambda_z=1, lambda_e=1, false_positive_rate_upper_bound=None,
                       false_negative_rate_upper_bound=None, max_rule_size=None, rounding_threshold=1e-5,
                       lp_relaxation=False, only_slack_lp_relaxation=False, lp_rounding_threshold=0,
                       is_it_noiseless=False, solver_name='PULP_CBC_CMD', solver_options=None)
Name Type Description Default
w_weight vector, float A vector, float to provide prior weight to w. 1.0
lambda_p float Regularization coefficient for positive labels. 1.0
lambda_z float Regularization coefficient for negative/zero labels. 1.0
lambda_e float Regularization coefficient for all slack variables. 1.0
false_positive_rate_upper_bound float False positive rate (FPR) upper bound. None
false_negative_rate_upper_bound float False negative rate(FNR) upper bound. None
max_rule_size int Maximum rule size. None
rounding_threshold float Threshold for ILP solutions for Rounding to 0 and 1. 1e-5
lp_relaxation bool A flag to use the lp relaxed version. False
only_slack_lp_relaxation bool A flag to only use the lp relaxed slack variables. False
lp_rounding_threshold float Threshold for lp solutions for Rounding to 0 and 1. Range from 0 to 1. 0.0
is_it_noiseless bool A flag to specify whether the problem is noisy or noiseless. False
solver_name str Solver's name provided by Pulp. 'PULP_CBC_CMD'
solver_options dict Solver's options provided by Pulp. None

Methods

Method Description
fit(X,y) Fit the model with respect to the given data.
get_params_dictionary(variable_type='w') Provide a dictionary of individuals with their status obtained by decoder. Type of the variable.e.g. 'w', 'ep' or 'en'
solution() Provide a vector of binary features importance. i.e. 1 if feature was used in the model 0 otherwise.
predict(X) Provide a predicted labels for X.
score(X,y) Provide the accuracy of self.predict(X) with respect to y
learned_rule(return_type='feature_name') Return a list of rules. return_type can be 'feature_name' or 'feature_id'.
write(fileType='mps', **kwargs) Create a file from the problem. fileType can be 'mps', 'lp', 'json' or 'display'. 'display' shows the ILP/LP problem on screen.

Training and evaluation

Example: The following is an example of training a classifier to predict resistance to second line drug Ciprofloxacin in TB isolates. In this example the feature matrix indicates presence/absence of SNPs in TB isolates, and the label vector represents the drug resistance phenotype. Sample data is available here.

from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
import pandas as pd
import ingot

feature_matrix = 'SNPsMatrix_ciprofloxacin.csv'
label_vector =  'ciprofloxacinLabel.csv'

X = pd.read_csv(feature_matrix, index_col=0)
y = pd.read_csv(label_vector, index_col=0).to_numpy().ravel()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2, stratify=y)

clf = ingot.INGOTClassifier(lambda_p=10, lambda_z=0.01, false_positive_rate_upper_bound=0.1,
                            max_rule_size=20, solver_name='CPLEX_PY')
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

print("Balanced accuracy: {}".format(balanced_accuracy_score(y_test, y_pred)))
print("Accuracy: {}".format(clf.score(X_test,y_test)))
print("Features in the learned rule: {}".format(clf.learned_rule()))

Output:

Note: Results may slightly vary for different solvers. Please see Choosing the solver.

Balanced accuracy: 0.8449477351916377
Accuracy: 0.9550561797752809
Features in the learned rule: ['7570, C, T', '7572, T, C', '7581, G, T', '7582, A, C', '7582, A, G']

Hyper-parameter tuning

Hyper-parameter tuning via scikit-learn Grid Search CV:

Example:

from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
import pandas as pd
import ingot
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

feature_matrix = 'SNPsMatrix_ciprofloxacin.csv'
label_vector =  'ciprofloxacinLabel.csv'

X = pd.read_csv(feature_matrix, index_col=0)
y = pd.read_csv(label_vector, index_col=0).to_numpy().ravel()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2, stratify=y)

clf = ingot.INGOTClassifier(false_positive_rate_upper_bound=0.1, max_rule_size=20, solver_name='CPLEX_PY',
                            solver_options={'timeLimit': 1800})

scoring = dict(Accuracy='accuracy', balanced_accuracy=make_scorer(balanced_accuracy_score))
param_grid={'lambda_p': [ 1, 10, 100 ], 'lambda_z': [ 0.01, 0.1, 1 ]}
grid = GridSearchCV(estimator=clf, param_grid= param_grid, scoring=scoring, cv=5, refit ='balanced_accuracy',
                    n_jobs=-1, verbose= 3)
grid.fit(X_train, y_train)

y_pred = grid.predict(X_test)

print("Balanced accuracy: {}".format(balanced_accuracy_score(y_test, y_pred)))
print('Best params: {}'.format(grid.best_params_))

Output:

Balanced accuracy: 0.8449477351916377
Best params: {'lambda_p': 10, 'lambda_z': 0.01}

Optimizing for different target metric

Note: w_weight and lambda_e are not part of the main ILP (Eq (11)) defined in the INGOT-DR paper. These two variables are defined to provide freedom when Optimizing for different target metric (section 1.4) is needed. The complete objective function with these two variables would be:

complete objective function

Example: Classifier corresponding to Eq (16) with maximum rule size k=20 and specificity lower bound t= 90% can be defined as following:

clf = ingot.INGOTClassifier(w_weight=0, lambda_z=0, false_positive_rate_upper_bound=0.1, max_rule_size=20,
                            solver_name='CPLEX_PY')

The following table shows the combination of arguments needed to define some of ILPs in the paper

lp_relaxation only_slack_lp_relaxation is_it_noiseless Equation number in the paper
False False False Eq (11)
False True True Eq (3)
False True False Eq (4) with objective function of Eq (11)
False False True Eq (3)
True True False LP relaxation of Eq (4) with objective function of Eq (11)
True False False LP relaxation of Eq (4) with objective function of Eq (11)
True False True LP relaxation of Eq (3)
True True True LP relaxation of Eq (3)

Note: True value of lp_relaxation or is_it_noiseless with override only_slack_lp_relaxation. i.e. if one of them is True then value of only_slack_lp_relaxation is not important.

Note: To recreate and work with Eq (4), you only need to use combination in row 3 and use or tune lambda_e instead of lambda_p and lambda_z. For example:

param_grid={'lambda_e': [0.01, 0.1,  1, 10, 100 ]}
grid = GridSearchCV(estimator=clf, param_grid= param_grid, scoring=scoring, cv=5, refit ='balanced_accuracy',
                    n_jobs=-1, verbose= 3)

Choosing the solver

INGOT-DR supports a variety of solvers through the PuLP application programming interface (API). Solvers such as GLPK, COIN-OR CLP/CBC, CPLEX, GUROBI, MOSEK, XPRESS, CHOCO, MIPCL, SCIP.

List of available solvers on your machine:

import pulp as pl
solver_list = pl.listSolvers(onlyAvailable=True)

Name and properties of the solver can be specified via solver_name and solver_options. e.g:

clf = ingot.INGOTClassifier(solver_name='CPLEX_PY', solver_options={'timeLimit': 1800})

In the INGOT-DR paper, 'CPLEX_PY' is the main solver. Results may slightly vary for different solvers. IBM CPLEX for academic use is available here.

Citation:

For general use please cite our paper: INGOT-DR: an interpretable classifier forpredicting drug resistance in M. tuberculosis. (bibtex)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ingotdr-0.0.5.tar.gz (11.5 kB view hashes)

Uploaded Source

Built Distribution

ingotdr-0.0.5-py3-none-any.whl (9.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page