## Project description

# INGOT-DR

**INGOT-DR** ( **IN**terpretable **G**r**O**up **T**esting for **D**rug **R**esistance) is an interpretable rule-based
predictive model base on **Group Testing** and **Boolean Compressed Sesing**. For more details and citation please see the
INGOT-DR paper. To access scripts used to produce the results in the paper please visit
INGOT-DR Project. To access the data used in the paper
please visit/cite
M.tuberculosis dataset for drug resistant.

##Table of content

## Installation

INGOT-DR can be installed from PyPI.

```
pip install ingotdr
```

## Usage

INGOT-DR is implemented as a scikit-learn classifier. As a result, this classifier is compatible with most of scikit-learn tools (e.g. cross validation and hyper-parameter tuning tools). In the following section, we provide some usage examples:

### Arguments

```
ingot.INGOTClassifier( w_weight=1, lambda_p=1, lambda_z=1, lambda_e=1, false_positive_rate_upper_bound=None,
false_negative_rate_upper_bound=None, max_rule_size=None, rounding_threshold=1e-5,
lp_relaxation=False, only_slack_lp_relaxation=False, lp_rounding_threshold=0,
is_it_noiseless=False, solver_name='PULP_CBC_CMD', solver_options=None)
```

Name | Type | Description | Default |
---|---|---|---|

w_weight | vector, float | A vector, float to provide prior weight to w. |
1.0 |

lambda_p | float | Regularization coefficient for positive labels. | 1.0 |

lambda_z | float | Regularization coefficient for negative/zero labels. | 1.0 |

lambda_e | float | Regularization coefficient for all slack variables. | 1.0 |

false_positive_rate_upper_bound | float | False positive rate (FPR) upper bound. | None |

false_negative_rate_upper_bound | float | False negative rate(FNR) upper bound. | None |

max_rule_size | int | Maximum rule size. | None |

rounding_threshold | float | Threshold for ILP solutions for Rounding to 0 and 1. | 1e-5 |

lp_relaxation | bool | A flag to use the lp relaxed version. | False |

only_slack_lp_relaxation | bool | A flag to only use the lp relaxed slack variables. | False |

lp_rounding_threshold | float | Threshold for lp solutions for Rounding to 0 and 1. Range from 0 to 1. | 0.0 |

is_it_noiseless | bool | A flag to specify whether the problem is noisy or noiseless. | False |

solver_name | str | Solver's name provided by Pulp. | 'PULP_CBC_CMD' |

solver_options | dict | Solver's options provided by Pulp. | None |

### Methods

Method | Description |
---|---|

`fit(X,y)` |
Fit the model with respect to the given data. |

`get_params_dictionary(variable_type='w')` |
Provide a dictionary of individuals with their status obtained by decoder. Type of the variable.e.g. 'w', 'ep' or 'en' |

`solution()` |
Provide a vector of binary features importance. i.e. 1 if feature was used in the model 0 otherwise. |

`predict(X)` |
Provide a predicted labels for X. |

`score(X,y)` |
Provide the accuracy of `self.predict(X)` with respect to `y` |

`learned_rule(return_type='feature_name')` |
Return a list of rules. return_type can be 'feature_name' or 'feature_id'. |

`write(fileType='mps', **kwargs)` |
Create a file from the problem. `fileType` can be 'mps', 'lp', 'json' or 'display'. 'display' shows the ILP/LP problem on screen. |

### Training and evaluation

**Example:**
The following is an example of training a classifier to predict resistance to second line drug *Ciprofloxacin* in TB isolates. In this example the
feature matrix indicates presence/absence of SNPs in TB isolates, and the label vector represents the drug resistance phenotype.
Sample data is available here.

```
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
import pandas as pd
import ingot
feature_matrix = 'SNPsMatrix_ciprofloxacin.csv'
label_vector = 'ciprofloxacinLabel.csv'
X = pd.read_csv(feature_matrix, index_col=0)
y = pd.read_csv(label_vector, index_col=0).to_numpy().ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2, stratify=y)
clf = ingot.INGOTClassifier(lambda_p=10, lambda_z=0.01, false_positive_rate_upper_bound=0.1,
max_rule_size=20, solver_name='CPLEX_PY')
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Balanced accuracy: {}".format(balanced_accuracy_score(y_test, y_pred)))
print("Accuracy: {}".format(clf.score(X_test,y_test)))
print("Features in the learned rule: {}".format(clf.learned_rule()))
```

Output:

**Note:** Results may slightly vary for different solvers. Please see Choosing the solver.

```
Balanced accuracy: 0.8449477351916377
Accuracy: 0.9550561797752809
Features in the learned rule: ['7570, C, T', '7572, T, C', '7581, G, T', '7582, A, C', '7582, A, G']
```

### Hyper-parameter tuning

Hyper-parameter tuning via scikit-learn Grid Search CV:

**Example:**

```
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
import pandas as pd
import ingot
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
feature_matrix = 'SNPsMatrix_ciprofloxacin.csv'
label_vector = 'ciprofloxacinLabel.csv'
X = pd.read_csv(feature_matrix, index_col=0)
y = pd.read_csv(label_vector, index_col=0).to_numpy().ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2, stratify=y)
clf = ingot.INGOTClassifier(false_positive_rate_upper_bound=0.1, max_rule_size=20, solver_name='CPLEX_PY',
solver_options={'timeLimit': 1800})
scoring = dict(Accuracy='accuracy', balanced_accuracy=make_scorer(balanced_accuracy_score))
param_grid={'lambda_p': [ 1, 10, 100 ], 'lambda_z': [ 0.01, 0.1, 1 ]}
grid = GridSearchCV(estimator=clf, param_grid= param_grid, scoring=scoring, cv=5, refit ='balanced_accuracy',
n_jobs=-1, verbose= 3)
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)
print("Balanced accuracy: {}".format(balanced_accuracy_score(y_test, y_pred)))
print('Best params: {}'.format(grid.best_params_))
```

Output:

```
Balanced accuracy: 0.8449477351916377
Best params: {'lambda_p': 10, 'lambda_z': 0.01}
```

### Optimizing for different target metric

**Note:** *w_weight* and *lambda_e* are not part of the main ILP (Eq (11)) defined in the INGOT-DR paper. These two variables
are defined to provide freedom when *Optimizing for different target metric* (section 1.4) is needed. The
complete objective function with these two variables would be:

**Example:**
Classifier corresponding to Eq (16) with maximum rule size k=20 and specificity lower bound t= 90% can be defined as following:

```
clf = ingot.INGOTClassifier(w_weight=0, lambda_z=0, false_positive_rate_upper_bound=0.1, max_rule_size=20,
solver_name='CPLEX_PY')
```

The following table shows the combination of arguments needed to define some of ILPs in the paper

lp_relaxation | only_slack_lp_relaxation | is_it_noiseless | Equation number in the paper |
---|---|---|---|

False | False | False | Eq (11) |

False | True | True | Eq (3) |

False | True | False | Eq (4) with objective function of Eq (11) |

False | False | True | Eq (3) |

True | True | False | LP relaxation of Eq (4) with objective function of Eq (11) |

True | False | False | LP relaxation of Eq (4) with objective function of Eq (11) |

True | False | True | LP relaxation of Eq (3) |

True | True | True | LP relaxation of Eq (3) |

**Note:** True value of *lp_relaxation* or *is_it_noiseless* with override *only_slack_lp_relaxation*. i.e. if one of them is True
then value of *only_slack_lp_relaxation* is not important.

**Note:** To recreate and work with Eq (4), you only need to use combination in row 3 and use or tune `lambda_e`

instead of `lambda_p`

and `lambda_z`

. For example:

```
param_grid={'lambda_e': [0.01, 0.1, 1, 10, 100 ]}
grid = GridSearchCV(estimator=clf, param_grid= param_grid, scoring=scoring, cv=5, refit ='balanced_accuracy',
n_jobs=-1, verbose= 3)
```

### Choosing the solver

INGOT-DR supports a variety of solvers through the PuLP application programming interface (API). Solvers such as GLPK, COIN-OR CLP/CBC, CPLEX, GUROBI, MOSEK, XPRESS, CHOCO, MIPCL, SCIP.

List of available solvers on your machine:

```
import pulp as pl
solver_list = pl.listSolvers(onlyAvailable=True)
```

Name and properties of the solver can be specified via `solver_name`

and
`solver_options`

. e.g:

```
clf = ingot.INGOTClassifier(solver_name='CPLEX_PY', solver_options={'timeLimit': 1800})
```

In the INGOT-DR paper, `'CPLEX_PY'`

is the main solver. Results may slightly vary for different solvers. IBM CPLEX for academic use is available
here.

## Citation:

For general use please cite our paper: INGOT-DR: an interpretable classifier forpredicting drug resistance in M. tuberculosis. (bibtex)

