Skip to main content

MSI based machine learning algorithms

Project description

Documentation Status Build linux Build windows PyPI version

msitrees

msitrees is a set of machine learning models based on minimum surfeit and inaccuracy decision tree algorithm. The main difference to other CART methods is, that there is no hyperparameters to optimize for base learner. Tree is regularized internally to avoid overfitting by design. Quoting authors of the paper:

To achieve this, the algorithm must automatically understand when growing the decision tree adds needless complexity, and must measure such complexity in a way that is commensurate to some prediction quality aspect, e.g., inaccuracy. We argue that a natural way to achieve the above objectives is to define both the inaccuracy and the complexity using the concept of Kolmogorov complexity.

Installation

With pip

pip install msitrees

From source

git clone https://github.com/xadrianzetx/msitrees.git
cd msitrees
python setup.py install

Windows builds require at least MSVC2015

Quick start

from msitrees.tree import MSIDecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

data = load_iris()
clf = MSIDecisionTreeClassifier()
cross_val_score(clf, data['data'], data['target'], cv=10)

# array([1.        , 1.        , 1.        , 0.93333333, 0.93333333,
    #    0.8       , 0.93333333, 0.86666667, 0.8       , 1.        ])

Reference documentation

API documentation is available here.

Zero hyperparameter based approach

MSI based algorithm should have performance comparable to CART decision tree where best hyperparameters were established with some sort of search. We are going to compare MSIRandomForestClassifier with scikit-learn implementation of random forest algorithm with hyperparameters grid searched using optuna. Both algorithms will be limited to 100 estimators, and measured by comparing accuracy on validation set of MNIST dataset.

   import optuna
   from sklearn.ensemble import RandomForestClassifier

   from sklearn.datasets import load_iris
   from sklearn.model_selection import train_test_split
   from sklearn.metrics import accuracy_score

   data = load_digits()
   x_train, x_valid, y_train, y_valid = train_test_split(data['data'], data['target'], random_state=42)

   def objective(trial):
      params = {
          'min_samples_leaf': trial.suggest_int('min_samples_leaf', 2, 10),
          'max_depth': trial.suggest_int('max_depth', 8, 20),
          'min_samples_split': trial.suggest_int('min_samples_split', 2, 10),
          'random_state': 42,
          'n_estimators': 100
      }

      clf = RandomForestClassifier(**params)
      clf.fit(x_train, y_train)
      pred = clf.predict(x_valid)
      score = accuracy_score(y_valid, pred)

      return score

   study = optuna.create_study(direction='maximize')
   study.optimize(objective, n_jobs=-1, show_progress_bar=True, n_trials=500)

   # fit benchmark model on best params
   benchmark = RandomForestClassifier(**study.best_params)
   benchmark = benchmark.fit(x_train, y_train)

   pred = benchmark.predict(x_valid)
   accuracy_score(y_valid, pred)
   # 0.9711111111111111

Since MSI based algorithm has no additional hyperparameters, code is sparse.

   from msitrees.ensemble import MSIRandomForestClassifier

   from sklearn.datasets import load_iris
   from sklearn.model_selection import train_test_split
   from sklearn.metrics import accuracy_score

   data = load_digits()
   x_train, x_valid, y_train, y_valid = train_test_split(data['data'], data['target'], random_state=42)

   clf = MSIRandomForestClassifier(n_estimators=100)
   clf.fit(x_train, y_train)
   pred = msiclf.predict(x_valid)
   accuracy_score(y_valid, pred)
   # 0.9733333333333334

Results for both random forest algorithms are comparable. Furthermore, median depth of a tree estimator is equal for both methods, even though MSI has no explicit parameter controlling tree depth.

   np.median([e.get_depth() for e in benchmark.estimators_])
   # 12.0
   np.median([e.get_depth() for e in clf._estimators])
   # 12.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

msitrees-0.2-cp38-cp38-win_amd64.whl (116.6 kB view details)

Uploaded CPython 3.8 Windows x86-64

msitrees-0.2-cp38-cp38-manylinux2010_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

msitrees-0.2-cp37-cp37m-win_amd64.whl (116.9 kB view details)

Uploaded CPython 3.7m Windows x86-64

msitrees-0.2-cp37-cp37m-manylinux2010_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

msitrees-0.2-cp36-cp36m-win_amd64.whl (116.9 kB view details)

Uploaded CPython 3.6m Windows x86-64

msitrees-0.2-cp36-cp36m-manylinux2010_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

msitrees-0.2-cp35-cp35m-win_amd64.whl (116.9 kB view details)

Uploaded CPython 3.5m Windows x86-64

msitrees-0.2-cp35-cp35m-manylinux2010_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

File details

Details for the file msitrees-0.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: msitrees-0.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 116.6 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for msitrees-0.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 24851849377e5661e7e74e4184d1e3ce2c853a79961a08d0ce7e83df5dbcfa27
MD5 1c19365f2c710db5c9baaab34747ccf9
BLAKE2b-256 c0a7ad11e738ce699a3c737c9e55aaab713359354c66783dc6fdb4c01602cb1b

See more details on using hashes here.

File details

Details for the file msitrees-0.2-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: msitrees-0.2-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9

File hashes

Hashes for msitrees-0.2-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 056885ba5a9ea093327bb500bd5244c304546de680ecd6c75d2ababf8ac21e91
MD5 09c43dc5b1aab819eb34d7a8402cb381
BLAKE2b-256 02d8836ab19e52846dfef97e1e2ee7a68da0b8224405c05eaefbdc9800ecabd9

See more details on using hashes here.

File details

Details for the file msitrees-0.2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: msitrees-0.2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 116.9 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9

File hashes

Hashes for msitrees-0.2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 9b79a6f2f12c10084ac8ef729db106c2e2e16410b62e765ea4e938d336fa2d26
MD5 966d7cf46312a45f2993421135e3abcd
BLAKE2b-256 1e04b4fa95bca92fb1ce66d6b0298de2d8c4ac5847507bb5a92085d12c7a37df

See more details on using hashes here.

File details

Details for the file msitrees-0.2-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: msitrees-0.2-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9

File hashes

Hashes for msitrees-0.2-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 90bc7d529514cdd55e427cb2e0fbc997d4814466f82aeb70d3dec8f7a830df6c
MD5 b7fdefe45edbc932d94c6ead97662cad
BLAKE2b-256 75c8089a4559956d6000f9f6d50fc808cdf8d582372d7b9871284e56b57ecf89

See more details on using hashes here.

File details

Details for the file msitrees-0.2-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: msitrees-0.2-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 116.9 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.6.8

File hashes

Hashes for msitrees-0.2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 ef94a0a02b3e2255753aacb8a02dc60f67798011ddb760b0158045541b9bb4ed
MD5 9dbd0fd1fce757cdcdde48a12efa71bd
BLAKE2b-256 9ebf732688c3f2d7bf44f9c07094770abf1a2784265d1b42c504f94d2863f111

See more details on using hashes here.

File details

Details for the file msitrees-0.2-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: msitrees-0.2-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9

File hashes

Hashes for msitrees-0.2-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 d93f0c11d5d4af33864fbd2886c0815561eab80044373fc76ed41faf9b0410d3
MD5 3c6d99aac40dedf3c3ac8eab54e5ebd3
BLAKE2b-256 8cb0d616b9728ccfaccfac2ec6d9e827bd788bcb97bb6d13959f51ab80bec2cd

See more details on using hashes here.

File details

Details for the file msitrees-0.2-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: msitrees-0.2-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 116.9 kB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.5.4

File hashes

Hashes for msitrees-0.2-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 9add90f3223b28733fcd010ee40cd70d7ed9c4fe044fdf42ae4c074a53663583
MD5 c7c59d9252505629758be15de124852e
BLAKE2b-256 19b82937f23f40e4751db37c58fd13fc960f1c45d041295e7b657f23a2dec0a2

See more details on using hashes here.

File details

Details for the file msitrees-0.2-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: msitrees-0.2-cp35-cp35m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.5m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9

File hashes

Hashes for msitrees-0.2-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 7ede8be3d749753e7f2b2afd003462cd1a89f419713f1faf705638e1780265c5
MD5 3f40726b9212d21d3f15a57e76fe12a6
BLAKE2b-256 ca783bea2a766411f2297a4a883a331d0e646892c45d1f3c3d4a15ab8b846584

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page