Skip to main content

Scikit-learn Wrapper for Regularized Greedy Forest

Project description

[![Build Status Travis](https://travis-ci.org/fukatani/rgf_python.svg?branch=master)](https://travis-ci.org/fukatani/rgf_python)
[![Build Status AppVeyor](https://ci.appveyor.com/api/projects/status/vpanb9hnncjr16hn/branch/master?svg=true)](https://ci.appveyor.com/project/fukatani/rgf-python)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/fukatani/rgf_python/blob/master/LICENSE)
[![GitHub Version](https://badge.fury.io/gh/fukatani%2Frgf_python.svg)](https://github.com/fukatani/rgf_python/blob/master/rgf/VERSION)

# rgf_python
The wrapper of machine learning algorithm ***Regularized Greedy Forest (RGF)*** for Python.

## Features

##### Scikit-learn interface and possibility of usage for multi-label classification problem.

Original RGF implementation is available only for regression and binary classification, but rgf_python is **also available for multi-label classification** by "One-vs-Rest" method.

Example:
```python
from sklearn import datasets
from sklearn.utils.validation import check_random_state
from sklearn.model_selection import StratifiedKFold, cross_val_score
from rgf.sklearn import RGFClassifier

iris = datasets.load_iris()
rng = check_random_state(0)
perm = rng.permutation(iris.target.size)
iris.data = iris.data[perm]
iris.target = iris.target[perm]

rgf = RGFClassifier(max_leaf=400,
algorithm="RGF_Sib",
test_interval=100,
verbose=True)

n_folds = 3

rgf_scores = cross_val_score(rgf,
iris.data,
iris.target,
cv=StratifiedKFold(n_folds))

rgf_score = sum(rgf_scores)/n_folds
print('RGF Classfier score: {0:.5f}'.format(rgf_score))
```
More examples could be found [here](https://github.com/fukatani/rgf_python/tree/master/examples).

## Software Requirements

* Python (2.7 or >= 3.4)
* scikit-learn (>= 0.18)
* RGF C++ ([link](http://tongzhang-ml.org/software/rgf/index.html))

If you can't access the above URL, alternatively, you can get RGF C++ by downloading it from this [page](https://github.com/fukatani/rgf_python/releases/download/0.2.0/rgf1.2.zip).
Please see README in the zip file to build RGF executional.


## Installation

```
git clone https://github.com/fukatani/rgf_python.git
python setup.py install
```
or using `pip`:
```
pip install git+git://github.com/fukatani/rgf_python@master
```

You have to place RGF execution file in directory which is included in **environmental variable 'PATH'.**
Or you can direct specify path by **manual editing rgf/sklearn.py**

```python
## Edit this ##################################################
#Location of the RGF executable
loc_exec = 'C:\\Program Files\\RGF\\bin\\rgf.exe'
#Location for RGF temp files
loc_temp = 'temp/'
## End Edit ##################################################
```

You need to set actual location of RGF execution file by editing 'loc_exec'.
And the variable 'loc_temp' can be changed to specify the directory for placing temp files.

## Tuning Hyper-parameters
You can tune hyper-parameters as follows.

* _max_leaf_: Appropriate values are data-dependent and usually varied from 1000 to 10000.

* _test_interval_: For efficiency, it must be either multiple or divisor of 100 (default value of the optimization interval).

* _algorithm_: You can select "RGF", "RGF Opt" or "RGF Sib".

* _loss_: You can select "LS", "Log" or "Expo".

* _reg_depth_: Must be no smaller than 1. Meant for being used with _algorithm_ = "RGF Opt" or "RGF Sib".

* _l2_: Either 1, 0.1, or 0.01 often produces good results though with exponential loss (_loss_ = "Expo") and logistic loss (_loss_ = "Log"), some data requires smaller values such as 1e-10 or 1e-20.

* _sl2_: Default value is equal to _l2_. On some data, _l2_/100 works well.

* _normalize_: If turned on, training targets are normalized so that the average becomes zero.

* _min_samples_leaf_: Smaller values may slow down training. Too large values may degrade model accuracy.

* _n_iter_: Number of iterations of coordinate descent to optimize weights.

* _n_tree_search_: Number of trees to be searched for the nodes to split. The most recently grown trees are searched first.

* _opt_interval_: Weight optimization interval in terms of the number of leaf nodes.

* _learning_rate_: Step size of Newton updates used in coordinate descent to optimize weights.

Detailed instruction of tuning hyper-parameters is [here](http://tongzhang-ml.org/software/rgf/rgf1.2-guide.pdf).

## Using at Kaggle Kernel
Now, Kaggle Kernel supports rgf_python. Please see [this page](https://www.kaggle.com/fukatani/d/uciml/iris/classification-by-regularized-greedy-forest).

## Other
Shamelessly, much part of the implementation is based on the following [code](https://github.com/MLWave/RGF-sklearn). Thanks!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rgf_python-1.2.2.tar.gz (9.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page