Skip to main content

Scikit-learn Wrapper for Regularized Greedy Forest

Project description

License Python Versions PyPI Version Downloads

rgf_python

The wrapper of machine learning algorithm Regularized Greedy Forest (RGF) [1] for Python.

Features

Scikit-learn interface and possibility of usage for multiclass classification problem.

rgf_python contains both original RGF from the paper [1] and FastRGF implementations.

Note that FastRGF is developed to be used with large (and sparse) datasets, so on small datasets it often shows poorer performance compared to vanilla RGF.

Original RGF implementations are available only for regression and binary classification, but rgf_python is also available for multiclass classification by “One-vs-Rest” method.

Examples

from sklearn import datasets
from sklearn.utils.validation import check_random_state
from sklearn.model_selection import StratifiedKFold, cross_val_score
from rgf.sklearn import RGFClassifier

iris = datasets.load_iris()
rng = check_random_state(0)
perm = rng.permutation(iris.target.size)
iris.data = iris.data[perm]
iris.target = iris.target[perm]

rgf = RGFClassifier(max_leaf=400,
                    algorithm="RGF_Sib",
                    test_interval=100,
                    verbose=True)

n_folds = 3

rgf_scores = cross_val_score(rgf,
                             iris.data,
                             iris.target,
                             cv=StratifiedKFold(n_folds))

rgf_score = sum(rgf_scores)/n_folds
print('RGF Classifier score: {0:.5f}'.format(rgf_score))

More examples of using RGF estimators could be found here.

Examples of using FastRGF estimators could be found here.

Software Requirements

  • Python (2.7 or >= 3.5)

  • scikit-learn (>= 0.18)

Installation

From PyPI using pip:

pip install rgf_python

or from GitHub:

git clone https://github.com/RGF-team/rgf.git
cd rgf/python-package
python setup.py install

MacOS users, rgf_python after the 3.5.0 version is built with g++-9 and cannot be launched on systems with g++-8 and earlier. You should update your g++ compiler if you don’t want to build from sources or install rgf_python 3.5.0 from PyPI which is the last version built with g++-8.

MacOS users, rgf_python after the 3.1.0 version is built with g++-8 and cannot be launched on systems with g++-7 and earlier. You should update your g++ compiler if you don’t want to build from sources or install rgf_python 3.1.0 from PyPI which is the last version built with g++-7.

If you have any problems while installing by methods listed above, you should build RGF and FastRGF executable files from binaries on your own and place compiled executable files into directory which is included in environmental variable ‘PATH’ or into directory with installed package. Alternatively, you may specify actual locations of executable files and directory for placing temp files by corresponding flags in configuration file .rgfrc, which you should create into your home directory. The default values are the following: rgf_location=$HOME/rgf ($HOME/rgf.exe for Windows), fastrgf_location=$HOME, temp_location=tempfile.gettempdir() (here is more details about tempfile.gettempdir()). Please take a look at the example of the .rgfrc file:

rgf_location=C:/Program Files/RGF/bin/rgf.exe
fastrgf_location=C:/Program Files/FastRGF/bin
temp_location=C:/Program Files/RGF/temp

Note that while rgf_location should point to a concrete RGF executable file, fastrgf_location should point to a folder in which forest_train.exe and forest_predict.exe FastRGF executable files are located.

Also, you may directly specify installation without automatic compilation:

pip install rgf_python --install-option=--nocompilation

or

git clone https://github.com/RGF-team/rgf.git
cd rgf/python-package
python setup.py install --nocompilation

sudo (or administrator privileges in Windows) may be needed to perform installation commands.

Detailed guides how you can build executable files of RGF and FastRGF from source files could be found in their folders here and here respectively.

Docker image

We provide docker image with installed rgf_python.

# Run docker image
docker run -it rgfteam/rgf /bin/bash
# Run RGF example
python ./rgf/python-package/examples/RGF/comparison_RGF_and_RF_regressors_on_boston_dataset.py
# Run FastRGF example
python ./rgf/python-package/examples/FastRGF/FastRGF_classifier_on_iris_dataset.py

Tuning Hyperparameters

RGF

You can tune hyperparameters as follows.

  • max_leaf: Appropriate values are data-dependent and usually varied from 1000 to 10000.

  • test_interval: For efficiency, it must be either multiple or divisor of 100 (default value of the optimization interval).

  • algorithm: You can select “RGF”, “RGF Opt” or “RGF Sib”.

  • loss: You can select “LS”, “Log”, “Expo” or “Abs”.

  • reg_depth: Must be no smaller than 1. Meant for being used with algorithm = “RGF Opt” or “RGF Sib”.

  • l2: Either 1, 0.1, or 0.01 often produces good results though with exponential loss (loss = “Expo”) and logistic loss (loss = “Log”), some data requires smaller values such as 1e-10 or 1e-20.

  • sl2: Default value is equal to l2. On some data, l2/100 works well.

  • normalize: If turned on, training targets are normalized so that the average becomes zero.

  • min_samples_leaf: Smaller values may slow down training. Too large values may degrade model accuracy.

  • n_iter: Number of iterations of coordinate descent to optimize weights.

  • n_tree_search: Number of trees to be searched for the nodes to split. The most recently grown trees are searched first.

  • opt_interval: Weight optimization interval in terms of the number of leaf nodes.

  • learning_rate: Step size of Newton updates used in coordinate descent to optimize weights.

Detailed instruction of tuning hyperparameters is here.

FastRGF

  • n_estimators: Typical range is [100, 10000], and a typical value is 1000.

  • max_depth: Controls the tree depth.

  • max_leaf: Controls the tree size.

  • tree_gain_ratio: Controls when to start a new tree.

  • min_samples_leaf: Controls the tree growth process.

  • loss: You can select “LS”, “MODLS” or “LOGISTIC”.

  • l1: Typical range is [0, 1000], and a large value induces sparsity.

  • l2: Use a relatively large value such as 1000 or 10000. The larger value is, the larger n_estimators you need to use: the resulting accuracy is often better with a longer training time.

  • opt_algorithm: You can select “rgf” or “epsilon-greedy”.

  • learning_rate: Step size of epsilon-greedy boosting. Meant for being used with opt_algorithm = “epsilon-greedy”.

  • max_bin: Typical range for dense data is [10, 65000] and for sparse data is [10, 250].

  • min_child_weight: Controls the process of discretization (creating bins).

  • data_l2: Controls the degree of L2 regularization for discretization (creating bins).

  • sparse_max_features: Typical range is [1000, 10000000]. Meant for being used with sparse data.

  • sparse_min_occurences: Controls which feature will be selected. Meant for being used with sparse data.

Using at Kaggle Kernels

Kaggle Kernels support rgf_python. Please see this page.

Troubleshooting

If you meet any error, please try to run test_rgf_python.py to confirm successful package installation.

Then feel free to open new issue.

Known Issues

  • FastRGF crashes if training dataset is too small (#data < 28). (rgf#92)

  • FastRGFClassifier and FastRGFRegressor do not provide any built-in method to calculate feature importances. (rgf#109)

FAQ

  • Q: Temporary files use too much space on my hard drive (Kaggle Kernels disc space is exhausted while fitting rgf_python model).

    A: Please see rgf#75.

  • Q: GridSearchCV/RandomizedSearchCV/RFECV or other scikit-learn tool with n_jobs parameter hangs/freezes/crashes when runs with rgf_python estimator.

    A: This is a known general problem of multiprocessing in Python. You should set n_jobs=1 parameter of either estimator or scikit-learn tool.

License

rgf_python is distributed under the MIT license. Please read file LICENSE for more information.

Many thanks to Rie Johnson and Tong Zhang (the authors of RGF).

Other

Shamelessly, some part of the implementation is based on the following code. Thanks!

References

[1] Rie Johnson and Tong Zhang. Learning Nonlinear Functions Using Regularized Greedy Forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5):942-954, May 2014

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rgf_python-3.6.0.tar.gz (215.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rgf_python-3.6.0-py2.py3-none-win_amd64.whl (1.7 MB view details)

Uploaded Python 2Python 3Windows x86-64

rgf_python-3.6.0-py2.py3-none-win32.whl (1.6 MB view details)

Uploaded Python 2Python 3Windows x86

rgf_python-3.6.0-py2.py3-none-manylinux1_x86_64.whl (757.8 kB view details)

Uploaded Python 2Python 3

rgf_python-3.6.0-py2.py3-none-manylinux1_i686.whl (791.0 kB view details)

Uploaded Python 2Python 3

rgf_python-3.6.0-py2.py3-none-macosx_10_6_x86_64.macosx_10_7_x86_64.macosx_10_8_x86_64.macosx_10_9_x86_64.macosx_10_10_x86_64.macosx_10_11_x86_64.macosx_10_12_x86_64.macosx_10_13_x86_64.macosx_10_14_x86_64.whl (746.6 kB view details)

Uploaded Python 2Python 3macOS 10.10+ x86-64macOS 10.11+ x86-64macOS 10.12+ x86-64macOS 10.13+ x86-64macOS 10.14+ x86-64macOS 10.6+ x86-64macOS 10.7+ x86-64macOS 10.8+ x86-64macOS 10.9+ x86-64

File details

Details for the file rgf_python-3.6.0.tar.gz.

File metadata

  • Download URL: rgf_python-3.6.0.tar.gz
  • Upload date:
  • Size: 215.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/38.2.4 requests-toolbelt/0.8.0 tqdm/4.19.4 CPython/3.5.2

File hashes

Hashes for rgf_python-3.6.0.tar.gz
Algorithm Hash digest
SHA256 a766cafa1bd9a0cb043b6b9f393510fab47461f4f267b6c1071c95f9fb6a54d7
MD5 a013869b0d4a8b917228b420a5ee9ebd
BLAKE2b-256 3bfffec333f5da1fce8d88cebb77df2c86df0e98681390734196833ebc4a1241

See more details on using hashes here.

File details

Details for the file rgf_python-3.6.0-py2.py3-none-win_amd64.whl.

File metadata

  • Download URL: rgf_python-3.6.0-py2.py3-none-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: Python 2, Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/38.2.4 requests-toolbelt/0.8.0 tqdm/4.19.4 CPython/3.5.2

File hashes

Hashes for rgf_python-3.6.0-py2.py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 5e5da0afb67bf94b3df1f149f1a9761bd77d8c34704a4ca7b30476a3a12096f3
MD5 985e85f1640abef14c2d6631a40a1108
BLAKE2b-256 af5d7d8a78d8f9ba4e7de5d28cf830c2de147ff2858808004679629a8544f1db

See more details on using hashes here.

File details

Details for the file rgf_python-3.6.0-py2.py3-none-win32.whl.

File metadata

  • Download URL: rgf_python-3.6.0-py2.py3-none-win32.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 2, Python 3, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/38.2.4 requests-toolbelt/0.8.0 tqdm/4.19.4 CPython/3.5.2

File hashes

Hashes for rgf_python-3.6.0-py2.py3-none-win32.whl
Algorithm Hash digest
SHA256 57bf7331891f9de430ed5902aedf2cb5bc1ca6df8aff899b74369ab2de6fe827
MD5 2d7c0f9722115a1ebc96d311ff08ed28
BLAKE2b-256 5fcebac7ffd6ae708b17bd8d4c7e69b63efa21242cbf8c60c57002e6e710b89a

See more details on using hashes here.

File details

Details for the file rgf_python-3.6.0-py2.py3-none-manylinux1_x86_64.whl.

File metadata

  • Download URL: rgf_python-3.6.0-py2.py3-none-manylinux1_x86_64.whl
  • Upload date:
  • Size: 757.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/38.2.4 requests-toolbelt/0.8.0 tqdm/4.19.4 CPython/3.5.2

File hashes

Hashes for rgf_python-3.6.0-py2.py3-none-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0a3012d8c8b4517632f5098386e8754928cf9fc6e765f9d7a3e74837d447031f
MD5 b7e014e114388ff8142d45e646c8ca5e
BLAKE2b-256 957c41f50b22c5a40614ffe8b3bfb857dbe7b09b843692697443d7811773f0a4

See more details on using hashes here.

File details

Details for the file rgf_python-3.6.0-py2.py3-none-manylinux1_i686.whl.

File metadata

  • Download URL: rgf_python-3.6.0-py2.py3-none-manylinux1_i686.whl
  • Upload date:
  • Size: 791.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/38.2.4 requests-toolbelt/0.8.0 tqdm/4.19.4 CPython/3.5.2

File hashes

Hashes for rgf_python-3.6.0-py2.py3-none-manylinux1_i686.whl
Algorithm Hash digest
SHA256 618d0e1ac7f67c3b1b1f3d7f5d91d216ceb0e6a2c8a811b7a46daf6333094cc0
MD5 1c2db2e18afb2e8427d6e13e5b916d27
BLAKE2b-256 d11eddb208e3d8c0f1fbe1947de4dd6b1a81a0a05da88a660f11a096c518bc2c

See more details on using hashes here.

File details

Details for the file rgf_python-3.6.0-py2.py3-none-macosx_10_6_x86_64.macosx_10_7_x86_64.macosx_10_8_x86_64.macosx_10_9_x86_64.macosx_10_10_x86_64.macosx_10_11_x86_64.macosx_10_12_x86_64.macosx_10_13_x86_64.macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for rgf_python-3.6.0-py2.py3-none-macosx_10_6_x86_64.macosx_10_7_x86_64.macosx_10_8_x86_64.macosx_10_9_x86_64.macosx_10_10_x86_64.macosx_10_11_x86_64.macosx_10_12_x86_64.macosx_10_13_x86_64.macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 9963d1d65ce64ede8c07d7ab1752d18bf0aed7689594538e7119fbcdb6561947
MD5 ff1f6c83d0d689679ea3e7ae03ff031c
BLAKE2b-256 5aebdc56ba6728d7ddba6c708ade14d5e6c501a1893def32544be302eb1a48ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page