Ruleset covering algorithms for explainable machine learning

These details have not been verified by PyPI

Project links

Homepage

Project description

wittgenstein

And is there not also the case where we play and--make up the rules as we go along? -Ludwig Wittgenstein

the duck-rabbit

Summary

This package implements two iterative coverage-based ruleset algorithms: IREP and RIPPERk.

Performance is similar to sklearn's DecisionTree CART implementation (see Performance Tests).

For explanation of the algorithms, see my article in Towards Data Science, or the papers below, under Useful References.

Installation

To install, use

$ pip install wittgenstein

To uninstall, use

$ pip uninstall wittgenstein

Requirements

pandas
numpy
python version>=3.6

Usage

Training

Usage syntax is similar to sklearn's. Once you have loaded and split your data...

>>> import pandas as pd
>>> df = pd.read_csv(dataset_filename)
>>> from sklearn.model_selection import train_test_split # Or any other mechanism you want to use for data partitioning
>>> train, test = train_test_split(df, test_size=.33)

We can fit a ruleset classifier using RIPPER or IREP.

>>> import wittgenstein as lw
>>> ripper_clf = lw.RIPPER() # Or irep_clf = lw.IREP() to build a model using IREP
>>> ripper_clf.fit(train, class_feat='Party') # Or pass X and y data to .fit
>>> ripper_clf
<RIPPER with fit ruleset (k=2, prune_size=0.33, dl_allowance=64)> # Hyperparameter details available in the docstrings and TDS article below

Access the underlying trained model with the .ruleset_ attribute, or output it with .out_model(). A ruleset is a disjunction of conjunctions -- 'V' represents 'or'; '^' represents 'and'.

In other words, the model predicts positive class if any of the inner-nested condition-combinations are all true:

>>> ripper_clf.ruleset_
<Ruleset [physician-fee-freeze=n] V [synfuels-corporation-cutback=y^adoption-of-the-budget-resolution=y^anti-satellite-test-ban=n]>

Scoring

To score our fit model:

>>> X_test = test.drop(class_feat, axis=1)
>>> y_test = test[class_feat]
>>> ripper_clf.score(test_X, test_y)
0.9985686906328078

Default scoring metric is accuracy. You can pass in alternate scoring functions, including those available through sklearn:

from sklearn.metrics import precision_score, recall_score
>>> precision = clf.score(X_test, y_test, precision_score)
>>> recall = clf.score(X_test, y_test, recall_score)
>>> print(f'precision: {precision} recall: {recall})
precision: 0.9914..., recall: 0.9953...

Model selection

wittgenstein classifiers are also compatible with sklearn model_selection tools such as cross_val_score and GridSearchCV, as well as ensemblers like StackingClassifier.

Cross validation:

>>> # First dummify your categorical features to make sklearn happy
>>> X_train = pd.get_dummies(X_train, columns=X_train.select_dtypes('object').columns)
>>> y_train = y_train.map(lambda x: 1 if x=='democrat' else 0)
>>> cross_val_score(ripper, X_train, y_train)

Grid search:

>>> param_grid = {"prune_size": [0.33, 0.5], "k": [1, 2]}
>>> grid = GridSearchCV(estimator=ripper, param_grid=param_grid)
>>> grid.fit(X_train, y_train)

Ensemble:

>>> tree = DecisionTreeClassifier(random_state=42)
>>> estimators = [("rip", ripper_clf), ("tree", tree)]
  ensemble_clf = StackingClassifier(
      estimators=estimators, final_estimator=LogisticRegression()
  )
  ensemble_clf.fit(X_train, y_train)

Prediction

To perform predictions:

>>> ripper_clf.predict(new_data)[:5]
[True, True, False, True, False]

Predict class probabilities:

>>> ripper_clf.predict_proba(test)
# Pairs of negative and positive class probabilities
array([[0.01212121, 0.98787879],
       [0.01212121, 0.98787879],
       [0.77777778, 0.22222222],
       [0.2       , 0.8       ],
       ...

We can also ask our model to tell us why it made each positive prediction that it did:

>>> ripper_clf.predict(new_data[:5], give_reasons=True)
([True, True, False, True, True]
[<Rule object: [physician-fee-freeze=n]>],
[<Rule object: [physician-fee-freeze=n]>,
  <Rule object: [synfuels-corporation-cutback=y^adoption-of-the-budget-resolution=y^anti-satellite-test-ban=n]>], # This example met multiple sufficient conditions for a positive prediction
[],
[<Rule object: [physician-fee-freeze=n]>],
[])

Issues

If you encounter any issues, or if you have feedback or improvement requests for how wittgenstein could be more helpful for you, please post them to issues, and I'll respond.

Changelog

v0.7.0: 5/4/2020

Algorithmic optimizations to improve training speed (~10x - ~100x)
Support for training on iterable datatypes besides DataFrames, such as numpy arrays and python lists
Compatibility with sklearn ensembling metalearners and sklearn model_selection
.predict_proba returns probas in neg, pos order
Certain parameters (hyperparameters, random_state, etc.) should now be passed into IREP/RIPPER constructors rather than the .fit method.
Sundry bugfixes

Contributing

Contributions are welcome! If you are interested in contributing, let me know at ilan.moscovitz@gmail.com or on linkedin.

Useful references

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.3.4

Apr 3, 2023

0.3.3

Apr 3, 2023

0.3.2

Aug 9, 2021

0.3.0

Aug 9, 2021

0.2.3

May 21, 2020

0.2.2 yanked

May 21, 2020

0.2.1 yanked

May 19, 2020

This version

0.2.0

May 5, 2020

0.1.6

Apr 18, 2019

0.1.5

Mar 7, 2019

0.1.4

Mar 5, 2019

0.1.3

Mar 5, 2019

0.1.1

Feb 22, 2019

0.1.0

Feb 22, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wittgenstein-0.2.0.tar.gz (422.6 kB view hashes)

Uploaded May 5, 2020 Source

Built Distribution

wittgenstein-0.2.0-py3-none-any.whl (74.5 kB view hashes)

Uploaded May 5, 2020 Python 3

Hashes for wittgenstein-0.2.0.tar.gz

Hashes for wittgenstein-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`f87159134d4f90214a6335a1740955d9459305d4f9a4199ad6b91237266eeca1`
MD5	`e84125b14ec4f9f90c14311d67ed3720`
BLAKE2b-256	`3e3df58b2f52884fea0925b541011a33192db41b673594dd8f83ca387f4b9e37`

Hashes for wittgenstein-0.2.0-py3-none-any.whl

Hashes for wittgenstein-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`65cf02600b77156fb557f854038a0583b41f52701b475f8b6ebbe9e534fc8f8e`
MD5	`9d58c49c07914e6b82c58eb5b582c32d`
BLAKE2b-256	`ed16c1bd26a8b45cdca67d3fe370e53cea4f02548aac2dd517555bcb1725cd40`