Ruleset covering algorithms for explainable machine learning
Project description
wittgenstein
And is there not also the case where we play and--make up the rules as we go along?
-Ludwig Wittgenstein
Summary
This package implements two iterative coverage-based ruleset algorithms: IREP and RIPPERk.
Performance is similar to sklearn's DecisionTree CART implementation (see Performance Tests).
For explanation of the algorithms, see my article in Towards Data Science, or the papers below, under Useful References.
Installation
To install, use
$ pip install wittgenstein
To uninstall, use
$ pip uninstall wittgenstein
Requirements
- pandas
- numpy
- python version>=3.6
Usage
Usage syntax is similar to sklearn's.
Training
Once you have loaded and split your data...
>>> import pandas as pd
>>> df = pd.read_csv(dataset_filename)
>>> from sklearn.model_selection import train_test_split # Or any other mechanism you want to use for data partitioning
>>> train, test = train_test_split(df, test_size=.33)
Use the fit
method to train a RIPPER
or IREP
classifier:
>>> import wittgenstein as lw
>>> ripper_clf = lw.RIPPER() # Or irep_clf = lw.IREP() to build a model using IREP
>>> ripper_clf.fit(train, class_feat='Party') # Or pass X and y data to .fit
>>> ripper_clf
<RIPPER with fit ruleset (k=2, prune_size=0.33, dl_allowance=64)> # Hyperparameter details available in the docstrings and TDS article below
Access the underlying trained model with the ruleset_
attribute, or output it with out_model()
. A ruleset is a disjunction of conjunctions -- 'V' represents 'or'; '^' represents 'and'.
In other words, the model predicts positive class if any of the inner-nested condition-combinations are all true:
>>> ripper_clf.ruleset_
<Ruleset [physician-fee-freeze=n] V [synfuels-corporation-cutback=y^adoption-of-the-budget-resolution=y^anti-satellite-test-ban=n]>
IREP
models tend be higher bias, RIPPER
's higher variance.
Scoring
To score our trained model, use the score
function:
>>> X_test = test.drop(class_feat, axis=1)
>>> y_test = test[class_feat]
>>> ripper_clf.score(test_X, test_y)
0.9985686906328078
Default scoring metric is accuracy. You can pass in alternate scoring functions, including those available through sklearn:
>>> from sklearn.metrics import precision_score, recall_score
>>> precision = clf.score(X_test, y_test, precision_score)
>>> recall = clf.score(X_test, y_test, recall_score)
>>> print(f'precision: {precision} recall: {recall}')
precision: 0.9914..., recall: 0.9953...
Model selection
wittgenstein is compatible with sklearn model_selection tools such as cross_val_score
and GridSearchCV
, as well
as ensemblers like StackingClassifier
.
Cross validation:
>>> # First dummify your categorical features and booleanize your class values to make sklearn happy
>>> X_train = pd.get_dummies(X_train, columns=X_train.select_dtypes('object').columns)
>>> y_train = y_train.map(lambda x: 1 if x=='democrat' else 0)
>>> cross_val_score(ripper, X_train, y_train)
Grid search:
>>> param_grid = {"prune_size": [0.33, 0.5], "k": [1, 2]}
>>> grid = GridSearchCV(estimator=ripper, param_grid=param_grid)
>>> grid.fit(X_train, y_train)
Ensemble:
>>> tree = DecisionTreeClassifier(random_state=42)
>>> nb = GaussianNB(random_state=42)
>>> estimators = [("rip", ripper_clf), ("tree", tree), ("nb", nb)]
>>> ensemble_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
>>> ensemble_clf.fit(X_train, y_train)
Prediction
To perform predictions, use predict
:
>>> ripper_clf.predict(new_data)[:5]
[True, True, False, True, False]
Predict class probabilities with predict_proba
:
>>> ripper_clf.predict_proba(test)
# Pairs of negative and positive class probabilities
array([[0.01212121, 0.98787879],
[0.01212121, 0.98787879],
[0.77777778, 0.22222222],
[0.2 , 0.8 ],
...
We can also ask our model to tell us why it made each positive prediction using give_reasons
:
>>> ripper_clf.predict(new_data[:5], give_reasons=True)
([True, True, False, True, True]
[<Rule [physician-fee-freeze=n]>],
[<Rule [physician-fee-freeze=n]>,
<Rule [synfuels-corporation-cutback=y^adoption-of-the-budget-resolution=y^anti-satellite-test-ban=n]>], # This example met multiple sufficient conditions for a positive prediction
[],
[<Rule object: [physician-fee-freeze=n]>],
[])
Issues
If you encounter any issues, or if you have feedback or improvement requests for how wittgenstein could be more helpful for you, please post them to issues, and I'll respond.
Contributing
Contributions are welcome! If you are interested in contributing, let me know at ilan.moscovitz@gmail.com or on linkedin.
Useful references
- My article in Towards Data Science explaining IREP, RIPPER, and wittgenstein
- Furnkrantz-Widmer IREP paper
- Cohen's RIPPER paper
- Partial decision trees
- Bayesian Rulesets
- C4.5 paper including all the gory details on MDL
- Philosophical Investigations
Changelog
v0.2.1: 5/19/2020; v0.2.2 5/21/2020
- Minor bugfixes and optimizations
v0.2.0: 5/4/2020
- Algorithmic optimizations to improve training speed (~10x - ~100x)
- Support for training on iterable datatypes besides DataFrames, such as numpy arrays and python lists
- Compatibility with sklearn ensembling metalearners and sklearn model_selection
.predict_proba
returns probas in neg, pos order- Certain parameters (hyperparameters, random_state, etc.) should now be passed into IREP/RIPPER constructors rather than the .fit method.
- Sundry bugfixes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for wittgenstein-0.2.2.macosx-10.7-x86_64.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ce3cb81f2286cdfa6f9eb8ee06e99cc1c21364cf82cc42430b794530934c295 |
|
MD5 | 6df3fb0712f09f86b3d89174ab2a2ecb |
|
BLAKE2b-256 | 6e9bd68c018bbe60dda956bb91be30828294b6d6d56b083f27b5b09ccc8396ac |
Hashes for wittgenstein-0.2.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e107b7f7c16c5d83339d2c5c1c544b6c32452a365f5405a7f6351f9181eb97df |
|
MD5 | 75e34f34999cbb754025428fab3219cc |
|
BLAKE2b-256 | 6f771fab1d9cf637db423989ac1f22e3c01f330e0ae05880eaa0ab3d9d530ac5 |