Skip to main content

Distributed scikit-learn meta-estimators with PySpark

Project description

License PyPI Package Downloads Python Versions

What is it?

sk-dist is a Python package for machine learning built on top of scikit-learn and is distributed under the Apache 2.0 software license. The sk-dist module can be thought of as “distributed scikit-learn” as its core functionality is to extend the scikit-learn built-in joblib parallelization of meta-estimator training to spark. A popular use case is the parallelization of grid search as shown here:

Check out the blog post for more information on the motivation and use cases of sk-dist.

Main Features

  • Distributed Training - sk-dist parallelizes the training of scikit-learn meta-estimators with PySpark. This allows distributed training of these estimators without any constraint on the physical resources of any one machine. In all cases, spark artifacts are automatically stripped from the fitted estimator. These estimators can then be pickled and un-pickled for prediction tasks, operating identically at predict time to their scikit-learn counterparts. Supported tasks are:

  • Distributed Prediction - sk-dist provides a prediction module which builds vectorized UDFs for PySpark DataFrames using fitted scikit-learn estimators. This distributes the predict and predict_proba methods of scikit-learn estimators, enabling large scale prediction with scikit-learn.

  • Feature Encoding - sk-dist provides a flexible feature encoding utility called Encoderizer which encodes mix-typed feature spaces using either default behavior or user defined customizable settings. It is particularly aimed at text features, but it additionally handles numeric and dictionary type feature spaces.

Installation

Dependencies

sk-dist requires:

Dependency Notes

  • versions of numpy, scipy and joblib that are compatible with any supported version of scikit-learn should be sufficient for sk-dist

  • sk-dist is not supported with Python 2

Spark Dependencies

Most sk-dist functionality requires a spark installation as well as PySpark. Some functionality can run without spark, so spark related dependencies are not required. The connection between sk-dist and spark relies solely on a sparkContext as an argument to various sk-dist classes upon instantiation.

A variety of spark configurations and setups will work. It is left up to the user to configure their own spark setup. The testing suite runs spark 2.4 and spark 3.0, though any spark 2.0+ versions are expected to work.

Additional spark related dependecies are pyarrow, which is used only for skdist.predict functions. This uses vectorized pandas UDFs which require pyarrow>=0.8.0, tested with pyarrow==0.16.0. Depending on the spark version, it may be necessary to set spark.conf.set("spark.sql.execution.arrow.enabled", "true") in the spark configuration.

User Installation

The easiest way to install sk-dist is with pip:

pip install --upgrade sk-dist

You can also download the source code:

git clone https://github.com/Ibotta/sk-dist.git

Testing

With pytest installed, you can run tests locally:

pytest sk-dist

Examples

The package contains numerous examples on how to use sk-dist in practice. Examples of note are:

Gradient Boosting

sk-dist has been tested with a number of popular gradient boosting packages that conform to the scikit-learn API. This includes xgboost and catboost. These will need to be installed in addition to sk-dist on all nodes of the spark cluster via a node bootstrap script. Version compatibility is left up to the user.

Support for lightgbm is not guaranteed, as it requires additional installations on all nodes of the spark cluster. This may work given proper installation but has not beed tested with sk-dist.

Background

The project was started at Ibotta Inc. on the machine learning team and open sourced in 2019.

It is currently maintained by the machine learning team at Ibotta. Special thanks to those who contributed to sk-dist while it was initially in development at Ibotta:

Thanks to James Foley for logo artwork.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sk_dist_ems-0.1.11.tar.gz (46.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sk_dist_ems-0.1.11-py3-none-any.whl (57.6 kB view details)

Uploaded Python 3

File details

Details for the file sk_dist_ems-0.1.11.tar.gz.

File metadata

  • Download URL: sk_dist_ems-0.1.11.tar.gz
  • Upload date:
  • Size: 46.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for sk_dist_ems-0.1.11.tar.gz
Algorithm Hash digest
SHA256 3fecb72e412ed5a0417d5d66e0e6943124b825fed70d95873a8149784cd9ec90
MD5 4f92b9bcc4d855dbe289dd145b1dc58d
BLAKE2b-256 0b84882bc3fe7d8573efed91129c3d4e884bd86448be0c1128e36b5bc903ae5e

See more details on using hashes here.

File details

Details for the file sk_dist_ems-0.1.11-py3-none-any.whl.

File metadata

  • Download URL: sk_dist_ems-0.1.11-py3-none-any.whl
  • Upload date:
  • Size: 57.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for sk_dist_ems-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 877b132fb7c8c654f7bd6c8b9c3db93bdf832a8480c0435fd8871dec72dff13e
MD5 74c59d2a3f166d4a2eb6550d39d1f6f7
BLAKE2b-256 8ff3e91a78c0a652fb60e41d430ac88fede671c9bdd97827024588dc952a5a93

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page