INGOT-DR (INterpretable GrOup Testing for Drug Resistance)
Project description
INGOT-DR
INGOT-DR ( INterpretable GrOup Testing for Drug Resistance) is an interpretable rule-based predictive model base on Group Testing and Boolean Compressed Sesing. For more details and citation please see the INGOT-DR paper. To access scripts used to produce the results in the paper please visit INGOT-DR Project. To access the data used in the paper please visit/cite M.tuberculosis dataset for drug resistant.
##Table of content
Installation
INGOT-DR can be installed from PyPI.
pip install ingotdr
Usage
INGOT-DR is implemented as a scikit-learn classifier. As a result, this classifier is compatible with most of scikit-learn tools (e.g. cross validation and hyper-parameter tuning tools). In the following section, we provide some usage examples:
Arguments
ingot.INGOTClassifier( w_weight=1, lambda_p=1, lambda_z=1, lambda_e=1, false_positive_rate_upper_bound=None,
false_negative_rate_upper_bound=None, max_rule_size=None, rounding_threshold=1e-5,
lp_relaxation=False, only_slack_lp_relaxation=False, lp_rounding_threshold=0,
is_it_noiseless=False, solver_name='PULP_CBC_CMD', solver_options=None)
Name | Type | Description | Default |
---|---|---|---|
w_weight | vector, float | A vector, float to provide prior weight to w. | 1.0 |
lambda_p | float | Regularization coefficient for positive labels. | 1.0 |
lambda_z | float | Regularization coefficient for negative/zero labels. | 1.0 |
lambda_e | float | Regularization coefficient for all slack variables. | 1.0 |
false_positive_rate_upper_bound | float | False positive rate (FPR) upper bound. | None |
false_negative_rate_upper_bound | float | False negative rate(FNR) upper bound. | None |
max_rule_size | int | Maximum rule size. | None |
rounding_threshold | float | Threshold for ILP solutions for Rounding to 0 and 1. | 1e-5 |
lp_relaxation | bool | A flag to use the lp relaxed version. | False |
only_slack_lp_relaxation | bool | A flag to only use the lp relaxed slack variables. | False |
lp_rounding_threshold | float | Threshold for lp solutions for Rounding to 0 and 1. Range from 0 to 1. | 0.0 |
is_it_noiseless | bool | A flag to specify whether the problem is noisy or noiseless. | False |
solver_name | str | Solver's name provided by Pulp. | 'PULP_CBC_CMD' |
solver_options | dict | Solver's options provided by Pulp. | None |
Methods
Method | Description |
---|---|
fit(X,y) |
Fit the model with respect to the given data. |
get_params_dictionary(variable_type='w') |
Provide a dictionary of individuals with their status obtained by decoder. Type of the variable.e.g. 'w', 'ep' or 'en' |
solution() |
Provide a vector of binary features importance. i.e. 1 if feature was used in the model 0 otherwise. |
predict(X) |
Provide a predicted labels for X. |
score(X,y) |
Provide the accuracy of self.predict(X) with respect to y |
learned_rule(return_type='feature_name') |
Return a list of rules. return_type can be 'feature_name' or 'feature_id'. |
write(fileType='mps', **kwargs) |
Create a file from the problem. fileType can be 'mps', 'lp', 'json' or 'display'. 'display' shows the ILP/LP problem on screen. |
Training and evaluation
Example: The following is an example of training a classifier to predict resistance to second line drug Ciprofloxacin in TB isolates. In this example the feature matrix indicates presence/absence of SNPs in TB isolates, and the label vector represents the drug resistance phenotype. Sample data is available here.
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
import pandas as pd
import ingot
feature_matrix = 'SNPsMatrix_ciprofloxacin.csv'
label_vector = 'ciprofloxacinLabel.csv'
X = pd.read_csv(feature_matrix, index_col=0)
y = pd.read_csv(label_vector, index_col=0).to_numpy().ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2, stratify=y)
clf = ingot.INGOTClassifier(lambda_p=10, lambda_z=0.01, false_positive_rate_upper_bound=0.1,
max_rule_size=20, solver_name='CPLEX_PY')
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Balanced accuracy: {}".format(balanced_accuracy_score(y_test, y_pred)))
print("Accuracy: {}".format(clf.score(X_test,y_test)))
print("Features in the learned rule: {}".format(clf.learned_rule()))
Output:
Note: Results may slightly vary for different solvers. Please see Choosing the solver.
Balanced accuracy: 0.8449477351916377
Accuracy: 0.9550561797752809
Features in the learned rule: ['7570, C, T', '7572, T, C', '7581, G, T', '7582, A, C', '7582, A, G']
Hyper-parameter tuning
Hyper-parameter tuning via scikit-learn Grid Search CV:
Example:
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
import pandas as pd
import ingot
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
feature_matrix = 'SNPsMatrix_ciprofloxacin.csv'
label_vector = 'ciprofloxacinLabel.csv'
X = pd.read_csv(feature_matrix, index_col=0)
y = pd.read_csv(label_vector, index_col=0).to_numpy().ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2, stratify=y)
clf = ingot.INGOTClassifier(false_positive_rate_upper_bound=0.1, max_rule_size=20, solver_name='CPLEX_PY',
solver_options={'timeLimit': 1800})
scoring = dict(Accuracy='accuracy', balanced_accuracy=make_scorer(balanced_accuracy_score))
param_grid={'lambda_p': [ 1, 10, 100 ], 'lambda_z': [ 0.01, 0.1, 1 ]}
grid = GridSearchCV(estimator=clf, param_grid= param_grid, scoring=scoring, cv=5, refit ='balanced_accuracy',
n_jobs=-1, verbose= 3)
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)
print("Balanced accuracy: {}".format(balanced_accuracy_score(y_test, y_pred)))
print('Best params: {}'.format(grid.best_params_))
Output:
Balanced accuracy: 0.8449477351916377
Best params: {'lambda_p': 10, 'lambda_z': 0.01}
Optimizing for different target metric
Note: w_weight and lambda_e are not part of the main ILP (Eq (11)) defined in the INGOT-DR paper. These two variables are defined to provide freedom when Optimizing for different target metric (section 1.4) is needed. The complete objective function with these two variables would be:
Example: Classifier corresponding to Eq (16) with maximum rule size k=20 and specificity lower bound t= 90% can be defined as following:
clf = ingot.INGOTClassifier(w_weight=0, lambda_z=0, false_positive_rate_upper_bound=0.1, max_rule_size=20,
solver_name='CPLEX_PY')
The following table shows the combination of arguments needed to define some of ILPs in the paper
lp_relaxation | only_slack_lp_relaxation | is_it_noiseless | Equation number in the paper |
---|---|---|---|
False | False | False | Eq (11) |
False | True | True | Eq (3) |
False | True | False | Eq (4) with objective function of Eq (11) |
False | False | True | Eq (3) |
True | True | False | LP relaxation of Eq (4) with objective function of Eq (11) |
True | False | False | LP relaxation of Eq (4) with objective function of Eq (11) |
True | False | True | LP relaxation of Eq (3) |
True | True | True | LP relaxation of Eq (3) |
Note: True value of lp_relaxation or is_it_noiseless with override only_slack_lp_relaxation. i.e. if one of them is True then value of only_slack_lp_relaxation is not important.
Note: To recreate and work with Eq (4), you only need to use combination in row 3 and use or tune lambda_e
instead of lambda_p
and lambda_z
. For example:
param_grid={'lambda_e': [0.01, 0.1, 1, 10, 100 ]}
grid = GridSearchCV(estimator=clf, param_grid= param_grid, scoring=scoring, cv=5, refit ='balanced_accuracy',
n_jobs=-1, verbose= 3)
Choosing the solver
INGOT-DR supports a variety of solvers through the PuLP application programming interface (API). Solvers such as GLPK, COIN-OR CLP/CBC, CPLEX, GUROBI, MOSEK, XPRESS, CHOCO, MIPCL, SCIP.
List of available solvers on your machine:
import pulp as pl
solver_list = pl.listSolvers(onlyAvailable=True)
Name and properties of the solver can be specified via solver_name
and
solver_options
. e.g:
clf = ingot.INGOTClassifier(solver_name='CPLEX_PY', solver_options={'timeLimit': 1800})
In the INGOT-DR paper, 'CPLEX_PY'
is the main solver. Results may slightly vary for different solvers. IBM CPLEX for academic use is available
here.
Citation:
For general use please cite our paper: INGOT-DR: an interpretable classifier forpredicting drug resistance in M. tuberculosis. (bibtex)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.