RayDP: Distributed Data Processing on Ray

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

RayDP

RayDP is a distributed data processing library that provides simple APIs for running Spark on Ray and integrating Spark with distributed deep learning and machine learning frameworks. RayDP makes it simple to build distributed end-to-end data analytics and AI pipeline. Instead of using lots of glue code or an orchestration framework to stitch multiple distributed programs, RayDP allows you to write Spark, PyTorch, Tensorflow, XGBoost code in a single python program with increased productivity and performance. You can build an end-to-end pipeline on a single Ray cluster by using Spark for data preprocessing, RaySGD or Horovod for distributed deep learning, RayTune for hyperparameter tuning and RayServe for model serving.

Spark on Ray

RayDP provides an API for starting a Spark job on Ray in your python program without a need to setup a Spark cluster manually. RayDP supports Ray as a Spark resource manger and runs Spark executors in Ray actors. RayDP utilizes Ray's in-memory object store to efficiently exchange data between Spark and other Ray libraries. You can use Spark to read the input data, process the data using SQL, Spark DataFrame, or Pandas (via Koalas) API, extract and transform features using Spark MLLib, and feed the output to deep learning and machine learning frameworks.

Integrating Spark with Deep Learning and Machine Learning Frameworks

MLDataset API

RayDP provides an API for creating a Ray MLDataset from a Spark dataframe. MLDataset represents a distributed dataset stored in Ray's in-memory object store. It supports transformation on each shard and can be converted to a PyTorch or Tensorflow dataset for distributed training. If you prefer to using Horovod on Ray or RaySGD for distributed training, you can use MLDataset to seamlessly integrate Spark with them.

Estimator API

RayDP also provides high level scikit-learn style Estimator APIs for distributed training. The Estimator APIs allow you to train a deep neural network directly on a Spark DataFrame, leveraging Ray’s ability to scale out across the cluster. The Estimator APIs are wrappers of RaySGD and hide the complexity of converting a Spark DataFrame to a PyTorch/Tensorflow dataset and distributing the training.

Installation

You can install latest RayDP using pip. RayDP requires Ray (>=1.1.0) and PySpark (>=3.0.0). Please also make sure java is installed and JAVA_HOME is set properly.

pip install raydp

Or you can install our nightly build:

pip install raydp-nightly

If you'd like to build and install the latest master, use the following command:

./build.sh
pip install dist/raydp*.whl

Getting Started

To start a Spark job on Ray, you can use the raydp.init_spark API. You can write Spark, PyTorch/Tensorflow, Ray code in the same python program to easily implement an end-to-end pipeline.

Classic Spark Word Count Example

After we use RayDP to initialize a Spark cluster, we can use Spark as usual.

import ray
import raydp

ray.init(address='auto')

spark = raydp.init_spark('word_count',
                         num_executors=2,
                         executor_cores=2,
                         executor_memory='1G')

df = spark.createDataFrame([('look',), ('spark',), ('tutorial',), ('spark',), ('look', ), ('python', )], ['word'])
df.show()
word_count = df.groupBy('word').count()
word_count.show()

raydp.stop_spark()

Integration with PyTorch

However, combined with other ray components, such as RaySGD and RayServe, we can easily build an end-to-end deep learning pipeline. In this example. we show how to use our estimator API, which is a wrapper around RaySGD, to perform data preprocessing using Spark, and train a model using PyTorch.

import ray
import raydp
from raydp.torch import TorchEstimator

ray.init()
spark = raydp.init_spark(app_name="RayDP example",
                         num_executors=2,
                         executor_cores=2,
                         executor_memory="4GB")

# Spark DataFrame Code 
df = spark.read.parquet(…) 
train_df = df.withColumn(…)

# PyTorch Code 
model = torch.nn.Sequential(torch.nn.Linear(2, 1)) 
optimizer = torch.optim.Adam(model.parameters())

# You can use the RayDP Estimator API or libraries like RaySGD for distributed training.
estimator = TorchEstimator(model=model, optimizer=optimizer, ...) 
estimator.fit_on_spark(train_df)

raydp.stop_spark()

More Examples

Not sure how to use RayDP? Check the examples folder. We have added many examples showing how RayDP works together with PyTorch, TensorFlow, XGBoost, Horovod, and so on. If you still cannot find what you want, feel free to post an issue to ask us!

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.7.0b20240501.dev0 pre-release

May 1, 2024

1.7.0b20240425.dev0 pre-release

Apr 25, 2024

1.7.0b20240413.dev0 pre-release

Apr 13, 2024

1.7.0b20240410.dev0 pre-release

Apr 10, 2024

1.7.0b20231223.dev0 pre-release

Dec 23, 2023

1.7.0b20231216.dev0 pre-release

Dec 16, 2023

1.7.0b20231205.dev0 pre-release

Dec 5, 2023

1.7.0b20231020.dev0 pre-release

Oct 20, 2023

1.7.0b20231017.dev0 pre-release

Oct 17, 2023

1.7.0b20231009.dev0 pre-release

Oct 9, 2023

1.7.0b20230919.dev0 pre-release

Sep 19, 2023

1.7.0b20230812.dev0 pre-release

Aug 12, 2023

1.6.0

Aug 11, 2023

1.6.0b20230807.dev0 pre-release

Aug 7, 2023

1.6.0b20230805.dev0 pre-release

Aug 5, 2023

1.6.0b20230801.dev0 pre-release

Aug 1, 2023

1.6.0b20230727.dev0 pre-release

Jul 27, 2023

1.6.0b20230726.dev0 pre-release

Jul 26, 2023

1.6.0b20230711.dev0 pre-release

Jul 11, 2023

1.6.0b20230707.dev0 pre-release

Jul 7, 2023

1.6.0b20230705.dev0 pre-release

Jul 5, 2023

1.6.0b20230704.dev0 pre-release

Jul 4, 2023

1.6.0b20230627.dev0 pre-release

Jun 27, 2023

1.6.0b20230614.dev0 pre-release

Jun 14, 2023

1.6.0b20230608.dev0 pre-release

Jun 8, 2023

1.6.0b20230527.dev0 pre-release

May 27, 2023

1.6.0b20230523.dev0 pre-release

May 23, 2023

1.6.0b20230518.dev0 pre-release

May 18, 2023

1.6.0b20230424.dev0 pre-release

Apr 24, 2023

1.6.0b20230414.dev0 pre-release

Apr 14, 2023

1.6.0b20230408.dev0 pre-release

Apr 8, 2023

1.6.0b20230405.dev0 pre-release

Apr 5, 2023

1.6.0b20230401.dev0 pre-release

Apr 1, 2023

1.6.0b20230331.dev0 pre-release

Mar 31, 2023

1.6.0b20230328.dev0 pre-release

Mar 28, 2023

1.6.0b20230322.dev0 pre-release

Mar 22, 2023

1.6.0b20230310.dev0 pre-release

Mar 10, 2023

1.6.0b20230306.dev0 pre-release

Mar 6, 2023

1.6.0b20230303.dev0 pre-release

Mar 3, 2023

1.6.0b20230301.dev0 pre-release

Mar 1, 2023

1.6.0b20230225.dev0 pre-release

Feb 25, 2023

1.6.0b20230224.dev0 pre-release

Feb 24, 2023

1.6.0b20230223.dev0 pre-release

Feb 23, 2023

1.6.0b20230214.dev0 pre-release

Feb 14, 2023

1.6.0b20230212.dev0 pre-release

Feb 12, 2023

1.5.0

Jan 5, 2023

0.6.0

Dec 2, 2022

0.5.0

Sep 9, 2022

0.4.2

Apr 19, 2022

0.4.1

Nov 11, 2021

0.4.0

Nov 2, 2021

0.3.0

Jun 4, 2021

This version

0.2.0

Apr 7, 2021

0.1.1

Feb 7, 2021

0.1

Feb 5, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

raydp-0.2.0-py3-none-any.whl (10.2 MB view hashes)

Uploaded Apr 7, 2021 Python 3

Hashes for raydp-0.2.0-py3-none-any.whl

Hashes for raydp-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`70bb68c945a0a2b31454f8c9f09db944af02f26b00e070039a086a3708fbb017`
MD5	`a9f36805328288ac956e0c81643d6369`
BLAKE2b-256	`dd752e1f18e5495c7e9d6addb5eab8aacc4188d728e5c567c63b006030fc7b77`