Spark acceleration for Scikit-Learn cross validation techniques
Project description
Spark acceleration for Scikit-Learn
This project is a major re-write of the spark-sklearn project, which seems to no longer be under development. It focuses specifically on the acceleration of Scikit-Learn's cross validation functionality using PySpark.
Improvements over spark-sklearn
scikit-spark
supports scikit-learn
versions past 0.19, spark-sklearn
have stated that they are probably not
going to support newer versions.
The functionality in scikit-spark
is based on sklearn.model_selection
module rather than the
deprecated and soon to be removed sklearn.grid_search
. The new model_selection
versions
contain several nicer features and scikit-spark
maintains full compatibility.
Installation
The package can be installed through pip:
pip install scikit-spark
It has so far only been tested with Spark 2.2.0 and up, but may work with older versions.
Supported scikit-learn versions
- 0.18 untested, likely doesn't work
- 0.19 supported
- 0.20 supported
- 0.21 in progress
Usage
The functionality here is meant to as closely resemble using Scikit-Learn as
possible. By default (with spark=True
) the SparkSession
is obtained
internally by calling SparkSession.builder.getOrCreate()
, so the instantiation
and calling of the functions is the same (You will preferably have already
created a SparkSession
).
This example is adapted from the Scikit-Learn documentation. It instantiates
a local SparkSession
, and distributes the cross validation folds and
iterations using this. In actual use, to get the benefit of this package it
should be used distributed across several machines with Spark as running it
locally is slower than the Scikit-Learn
parallelisation implementation.
from sklearn import svm, datasets
from pyspark.sql import SparkSession
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[0.01, 0.1, 1, 10, 100]}
svc = svm.SVC()
spark = SparkSession.builder\
.master("local[*]")\
.appName("skspark-grid-search-doctests")\
.getOrCreate()
# How to run grid search
from skspark.model_selection import GridSearchCV
gs = GridSearchCV(svc, parameters)
gs.fit(iris.data, iris.target)
# How to run random search
from skspark.model_selection import RandomizedSearchCV
rs = RandomizedSearchCV(spark, svc, parameters)
rs.fit(iris.data, iris.target)
Current and upcoming functionality
- Current
- model_selection.RandomizedSearchCV
- model_selection.GridSearchCV
- Upcoming
- model_selection.cross_val_predict
- model_selection.cross_val_score
The docstrings are modifications of the Scikit-Learn ones and are still being converted to specifically refer to this project.
Performance optimisations
Reducing RAM usage
Coming soon
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scikit-spark-0.2.0.tar.gz
.
File metadata
- Download URL: scikit-spark-0.2.0.tar.gz
- Upload date:
- Size: 26.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3539e4f5490acd106825cf2819ca07e312ed7e57533e092cd7a4f0fc50562bea |
|
MD5 | 1a79de57b476478ff3aa36b2ebdc730e |
|
BLAKE2b-256 | e8e4312fecf0cb4e1518dc4c62d29b71a52c9351fc6053534a98cb2dc7bcd08f |
File details
Details for the file scikit_spark-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: scikit_spark-0.2.0-py3-none-any.whl
- Upload date:
- Size: 46.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1596f0ae99f666b284d765997915e7987170ad5b4b62fe5713723a610d36f2a1 |
|
MD5 | 59aa199aae107214d4aa94b6042e07f7 |
|
BLAKE2b-256 | 126e20ecfe9dc9f95963827cc581d38e1fca8d5897f1e7b1ccb2dcb2cd75fb9c |