Skip to main content

RayDP: Distributed Data Processing on Ray

Project description

RayDP

RayDP brings popular big data frameworks including Apache Spark to Ray ecosystem and integrates with other Ray libraries seamlessly. RayDP makes it simple to build distributed end-to-end data analytics and AI pipeline on Ray by using Spark for data preprocessing, RayTune for hyperparameter tunning, RaySGD for distributed deep learning, RLlib for reinforcement learning and RayServe for model serving.

stack

Key Features

Spark on Ray

RayDP enables you to start a Spark job on Ray in your python program without a need to setup a Spark cluster manually. RayDP supports Ray as a Spark resource manger and starts Spark executors using Ray actor directly. RayDP utilizes Ray's in-memory object store to efficiently exchange data between Spark and other Ray libraries. You can use Spark to read the input data, process the data using SQL, Spark DataFrame, or Pandas (via Koalas) API, extract and transform features using Spark MLLib, and use RayDP Estimator API for distributed training on the preprocessed dataset.

Estimator APIs for Distributed Training

RayDP provides high level scikit-learn style Estimator APIs for distributed training. The Estimator APIs allow you to train a deep neural network directly on a Spark DataFrame, leveraging Ray’s ability to scale out across the cluster. The Estimator APIs are wrappers of RaySGD and hide the complexity of converting a Spark DataFrame to a PyTorch/Tensorflow dataset and distributing the training.

Installation

You can install latest RayDP using pip. RayDP requires Ray (>=1.1.0) and PySpark (3.0.0 or 3.0.1).

pip install raydp

If you'd like to build and install the latest master, use the following command:

./build.sh
pip install dist/raydp*.whl

Getting Started

To start a Spark job on Ray, you can use the raydp.init_spark API. You can write Spark, PyTorch/Tensorflow, Ray code in the same python program to easily implement an end to end pipeline.

import ray
import raydp
from raydp.torch import TorchEstimator

ray.init()
spark = raydp.init_spark(app_name="RayDP example",
                         num_executors=2,
                         executor_cores=2,
                         executor_memory="4GB")

# Spark DataFrame Code 
df = spark.read.parquet() 
train_df = df.withColumn()

# PyTorch Code 
model = torch.nn.Sequential(torch.nn.Linear(2, 1)) 
optimizer = torch.optim.Adam(model.parameters())

# You can use the RayDP Estimator API or libraries like RaySGD for distributed training.
estimator = TorchEstimator(model=model, optimizer=optimizer, ...) 
estimator.fit_on_spark(train_df)

You can find more examples under the examples folder.

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

raydp-0.1.0-py3-none-any.whl (10.1 MB view details)

Uploaded Python 3

raydp-0.1-py3-none-any.whl (10.1 MB view details)

Uploaded Python 3

File details

Details for the file raydp-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: raydp-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9

File hashes

Hashes for raydp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cc1b16083fc234dba6e4f4bae8df623e87d7a1721c3500076012f3e7edcd921b
MD5 e2bfd7330d37d37e44e84e66e66b4265
BLAKE2b-256 168e6a26f5a67f1f7c8f9c0828edec5b37d6f83993ba9f4bca168544d0cc6084

See more details on using hashes here.

File details

Details for the file raydp-0.1-py3-none-any.whl.

File metadata

  • Download URL: raydp-0.1-py3-none-any.whl
  • Upload date:
  • Size: 10.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9

File hashes

Hashes for raydp-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c00d4f6ee066a88a624e487c3dd3f0fd64d6370244f03a780a8590cefa7de613
MD5 0a49f300844a940e1f4eeb0f04322f9b
BLAKE2b-256 906409a50f1af912b6850ca3967a4ef469d84bbe8d755fa0548dcbd358c4b262

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page