Skip to main content

Sklearn forests with partial fits

Project description

Incremental trees

The overcomplicated tests are...

Adds partial fit method to sklearn's forest estimators (currently RandomForestClassifier/Regressor and ExtraTreesClassifier/Regressor) to allow incremental training without being limited to a linear model. Works with or without Dask-ml's Incremental.

These methods don't try and implement partial fitting for decision trees, rather they remove requirement that individual decision trees within forests are trained with the same data (or equally sized bootstraps). This reduces memory burden, training time, and variance. This is at the cost of generally increasing the number of weak learners will probably be required.

The resulting forests are not "true" online learners, as batch size affects performance. However, they should have similar (possibly better) performance as their standard versions after seeing an equivalent number of training rows.

Installing package

Quick start:

  1. Clone repo and build pip installable package.
     pip install incremental_trees
    

Usage Examples

Currently implemented:

  • Streaming versions of RandomForestClassifier (StreamingRFC) and ExtraTreesClassifer (StreamingEXTC). They work should work for binary and multi-class classification, but not multi-output yet.
  • Streaming versions of RandomForestRegressor (StreamingRFR) and ExtraTreesRegressor (StreamingEXTR).

See:

Data feeding mechanisms

Fitting with .fit()

Feeds .partial_fit() with randomly samples rows.

import numpy as np
from sklearn.datasets import make_blobs
from incremental_trees.models.classification.streaming_rfc import StreamingRFC

# Generate some data in memory
x, y = make_blobs(n_samples=int(2e5), random_state=0, n_features=40, centers=2, cluster_std=100)

srfc = StreamingRFC(n_estimators_per_chunk=3,
                    max_n_estimators=np.inf,
                    spf_n_fits=30,  # Number of calls to .partial_fit()
                    spf_sample_prop=0.3)  # Number of rows to sample each on .partial_fit()

srfc.fit(x, y, sample_weight=np.ones_like(y))  # Optional, gets sampled along with the data

# Should be n_estimators_per_chunk * spf_n_fits
print(len(srfc.estimators_))
print(srfc.score(x, y))

Fitting with .fit() and Dask

Call .fit() directly, let dask handle sending data to .partial_fit()

import numpy as np
import dask_ml.datasets
from dask_ml.wrappers import Incremental
from dask.distributed import Client, LocalCluster
from dask import delayed
from incremental_trees.models.classification.streaming_rfc import StreamingRFC

# Generate some data out-of-core
x, y = dask_ml.datasets.make_blobs(n_samples=2e5, chunks=1e4, random_state=0,
                                   n_features=40, centers=2, cluster_std=100)

# Create throwaway cluster and client to run on                                  
with LocalCluster(processes=False, n_workers=2, 
                  threads_per_worker=2) as cluster, Client(cluster) as client:

    # Wrap model with Dask Incremental
    srfc = Incremental(StreamingRFC(dask_feeding=True,  # Turn dask on
                                    n_estimators_per_chunk=10,
                                    max_n_estimators=np.inf,
                                    n_jobs=4))
    
    # Call fit directly, specifying the expected classes
    srfc.fit(x, y,
             classes=delayed(np.unique)(y).compute())
             
    print(len(srfc.estimators_))
    print(srfc.score(x, y))

Feeding .partial_fit() manually

.partial_fit can be called directly and fed data manually.

For example, this can be used to feed .partial_fit() sequentially (although below example selects random rows, which is similar to non-dask example above).

import numpy as np
from sklearn.datasets import make_blobs
from incremental_trees.models.classification.streaming_rfc import StreamingRFC

srfc = StreamingRFC(n_estimators_per_chunk=20,
                    max_n_estimators=np.inf,
                    n_jobs=4)

# Generate some data in memory
x, y = make_blobs(n_samples=int(2e5), random_state=0, n_features=40,
                  centers=2, cluster_std=100)

# Feed .partial_fit() with random samples of the data
n_chunks = 30
chunk_size = int(2e3)
for i in range(n_chunks):
   sample_idx = np.random.randint(0, x.shape[0], chunk_size)
   # Call .partial_fit(), specifying expected classes, also supports other .fit args such as sample_weight
   srfc.partial_fit(x[sample_idx, :], y[sample_idx],
                    classes=np.unique(y))

# Should be n_chunks * n_estimators_per_chunk             
print(len(srfc.estimators_))
print(srfc.score(x, y))

Possible model set ups

There are a couple of different model setups worth considering. No idea which works best.

"Incremental forest"

For the number of chunks/fits, sample rows from X, then fit a number of single trees (with different column subsets), eg.

srfc = StreamingRFC(n_estimators_per_chunk=10, max_features='sqrt')    

"Incremental decision trees"

Single (or few) decision trees per data subset, with all features.

srfc = StreamingRFC(n_estimators_per_chunk=1, max_features=x.shape[1])

Version history

v0.6.0

  • Update to work with scikit-learn==1.2, dask==2022.12, dask-glm==0.2.0, dask-ml==2022.5.27. Support python 3.8 and 3.9.

v0.5.1

  • Add support for passing fit args/kwargs via .fit (specifically, sample_weight)

v0.5.0

  • Add support for passing fit args/kwargs via .partial fit (specifically, sample_weight)

v0.4.0

  • Refactor and tidy, try with new versions of Dask/sklearn

v0.3.1-3

  • Update Dask versions

v0.3.0

  • Updated unit tests
  • Added performance benchmark tests for classifiers, not finished.
  • Added regressor versions of RandomForest (StreamingRFR) and ExtaTrees (StreamingEXTR, also renamed StreamingEXT to StreamingEXTC).
  • .fit() overload to handle feeding .partial_fit() with random row samples, without using Dask. Adds compatibility with sklearn SearchCV objects.

v0.2.0

  • Add ExtraTreesClassifier (StreamingEXT)

v0.1.0

  • .partial_fit() for RandomForestClassifier (StreamingRFC)
  • .predict_proba() for RandomforestClassifier

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

incremental_trees-0.6.0.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

incremental_trees-0.6.0-py3-none-any.whl (34.0 kB view details)

Uploaded Python 3

File details

Details for the file incremental_trees-0.6.0.tar.gz.

File metadata

  • Download URL: incremental_trees-0.6.0.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for incremental_trees-0.6.0.tar.gz
Algorithm Hash digest
SHA256 d44553a4554c3e3fc5b83335c09d947fc05d87b1aa659ba61c42e02c900c2a9a
MD5 6736a835dadff367207ade3689882474
BLAKE2b-256 bd54f48d246b91a7354756e500699302e6750220944827989ac1a13e38cfdd24

See more details on using hashes here.

File details

Details for the file incremental_trees-0.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for incremental_trees-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 18da88847ece429465506f2b07c3678726b8839f01c83cf8499d874a7fbc1191
MD5 5fc390c2b4b2db261408fe464c69647e
BLAKE2b-256 0208cd83e4f7b0d80b05dea54b5772278f729d923945904499581f383820d074

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page