MSI based machine learning algorithms
Project description
msitrees
msitrees
is a set of machine learning models based on minimum surfeit and inaccuracy decision tree algorithm. The main difference to other CART methods is, that there is no hyperparameters to optimize for base learner. Tree is regularized internally to avoid overfitting by design. Quoting authors of the paper:
To achieve this, the algorithm must automatically understand when growing the decision tree adds needless complexity, and must measure such complexity in a way that is commensurate to some prediction quality aspect, e.g., inaccuracy. We argue that a natural way to achieve the above objectives is to define both the inaccuracy and the complexity using the concept of Kolmogorov complexity.
Installation
With pip
pip install msitrees
From source
git clone https://github.com/xadrianzetx/msitrees.git
cd msitrees
python setup.py install
Windows builds require at least MSVC2015
Quick start
from msitrees.tree import MSIDecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
data = load_iris()
clf = MSIDecisionTreeClassifier()
cross_val_score(clf, data['data'], data['target'], cv=10)
# array([1. , 1. , 1. , 0.93333333, 0.93333333,
# 0.8 , 0.93333333, 0.86666667, 0.8 , 1. ])
Reference documentation
API documentation is available here.
Zero hyperparameter based approach
MSI based algorithm should have performance comparable to CART decision tree where best hyperparameters were established with
some sort of search. We are going to compare MSIRandomForestClassifier
with scikit-learn
implementation of random forest algorithm with hyperparameters grid searched using optuna
. Both algorithms will be limited to 100 estimators, and measured by comparing accuracy on validation set of MNIST dataset.
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = load_digits()
x_train, x_valid, y_train, y_valid = train_test_split(data['data'], data['target'], random_state=42)
def objective(trial):
params = {
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 2, 10),
'max_depth': trial.suggest_int('max_depth', 8, 20),
'min_samples_split': trial.suggest_int('min_samples_split', 2, 10),
'random_state': 42,
'n_estimators': 100
}
clf = RandomForestClassifier(**params)
clf.fit(x_train, y_train)
pred = clf.predict(x_valid)
score = accuracy_score(y_valid, pred)
return score
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_jobs=-1, show_progress_bar=True, n_trials=500)
# fit benchmark model on best params
benchmark = RandomForestClassifier(**study.best_params)
benchmark = benchmark.fit(x_train, y_train)
pred = benchmark.predict(x_valid)
accuracy_score(y_valid, pred)
# 0.9711111111111111
Since MSI based algorithm has no additional hyperparameters, code is sparse.
from msitrees.ensemble import MSIRandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = load_digits()
x_train, x_valid, y_train, y_valid = train_test_split(data['data'], data['target'], random_state=42)
clf = MSIRandomForestClassifier(n_estimators=100)
clf.fit(x_train, y_train)
pred = msiclf.predict(x_valid)
accuracy_score(y_valid, pred)
# 0.9733333333333334
Results for both random forest algorithms are comparable. Furthermore, median depth of a tree estimator is equal for both methods, even though MSI has no explicit parameter controlling tree depth.
np.median([e.get_depth() for e in benchmark.estimators_])
# 12.0
np.median([e.get_depth() for e in clf._estimators])
# 12.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file msitrees-0.2-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: msitrees-0.2-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 116.6 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24851849377e5661e7e74e4184d1e3ce2c853a79961a08d0ce7e83df5dbcfa27 |
|
MD5 | 1c19365f2c710db5c9baaab34747ccf9 |
|
BLAKE2b-256 | c0a7ad11e738ce699a3c737c9e55aaab713359354c66783dc6fdb4c01602cb1b |
File details
Details for the file msitrees-0.2-cp38-cp38-manylinux2010_x86_64.whl
.
File metadata
- Download URL: msitrees-0.2-cp38-cp38-manylinux2010_x86_64.whl
- Upload date:
- Size: 2.1 MB
- Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 056885ba5a9ea093327bb500bd5244c304546de680ecd6c75d2ababf8ac21e91 |
|
MD5 | 09c43dc5b1aab819eb34d7a8402cb381 |
|
BLAKE2b-256 | 02d8836ab19e52846dfef97e1e2ee7a68da0b8224405c05eaefbdc9800ecabd9 |
File details
Details for the file msitrees-0.2-cp37-cp37m-win_amd64.whl
.
File metadata
- Download URL: msitrees-0.2-cp37-cp37m-win_amd64.whl
- Upload date:
- Size: 116.9 kB
- Tags: CPython 3.7m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b79a6f2f12c10084ac8ef729db106c2e2e16410b62e765ea4e938d336fa2d26 |
|
MD5 | 966d7cf46312a45f2993421135e3abcd |
|
BLAKE2b-256 | 1e04b4fa95bca92fb1ce66d6b0298de2d8c4ac5847507bb5a92085d12c7a37df |
File details
Details for the file msitrees-0.2-cp37-cp37m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: msitrees-0.2-cp37-cp37m-manylinux2010_x86_64.whl
- Upload date:
- Size: 2.2 MB
- Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 90bc7d529514cdd55e427cb2e0fbc997d4814466f82aeb70d3dec8f7a830df6c |
|
MD5 | b7fdefe45edbc932d94c6ead97662cad |
|
BLAKE2b-256 | 75c8089a4559956d6000f9f6d50fc808cdf8d582372d7b9871284e56b57ecf89 |
File details
Details for the file msitrees-0.2-cp36-cp36m-win_amd64.whl
.
File metadata
- Download URL: msitrees-0.2-cp36-cp36m-win_amd64.whl
- Upload date:
- Size: 116.9 kB
- Tags: CPython 3.6m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ef94a0a02b3e2255753aacb8a02dc60f67798011ddb760b0158045541b9bb4ed |
|
MD5 | 9dbd0fd1fce757cdcdde48a12efa71bd |
|
BLAKE2b-256 | 9ebf732688c3f2d7bf44f9c07094770abf1a2784265d1b42c504f94d2863f111 |
File details
Details for the file msitrees-0.2-cp36-cp36m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: msitrees-0.2-cp36-cp36m-manylinux2010_x86_64.whl
- Upload date:
- Size: 2.2 MB
- Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d93f0c11d5d4af33864fbd2886c0815561eab80044373fc76ed41faf9b0410d3 |
|
MD5 | 3c6d99aac40dedf3c3ac8eab54e5ebd3 |
|
BLAKE2b-256 | 8cb0d616b9728ccfaccfac2ec6d9e827bd788bcb97bb6d13959f51ab80bec2cd |
File details
Details for the file msitrees-0.2-cp35-cp35m-win_amd64.whl
.
File metadata
- Download URL: msitrees-0.2-cp35-cp35m-win_amd64.whl
- Upload date:
- Size: 116.9 kB
- Tags: CPython 3.5m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.5.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9add90f3223b28733fcd010ee40cd70d7ed9c4fe044fdf42ae4c074a53663583 |
|
MD5 | c7c59d9252505629758be15de124852e |
|
BLAKE2b-256 | 19b82937f23f40e4751db37c58fd13fc960f1c45d041295e7b657f23a2dec0a2 |
File details
Details for the file msitrees-0.2-cp35-cp35m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: msitrees-0.2-cp35-cp35m-manylinux2010_x86_64.whl
- Upload date:
- Size: 2.2 MB
- Tags: CPython 3.5m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ede8be3d749753e7f2b2afd003462cd1a89f419713f1faf705638e1780265c5 |
|
MD5 | 3f40726b9212d21d3f15a57e76fe12a6 |
|
BLAKE2b-256 | ca783bea2a766411f2297a4a883a331d0e646892c45d1f3c3d4a15ab8b846584 |