Skip to main content

RaySQL: DataFusion on Ray

Project description

RaySQL: DataFusion on Ray

This is a research project to evaluate performing distributed SQL queries from Python, using Ray and DataFusion.

Goals

  • Demonstrate how easily new systems can be built on top of DataFusion. See the design documentation to understand how RaySQL works.
  • Drive requirements for DataFusion's Python bindings.
  • Create content for an interesting blog post or conference talk.

Non Goals

  • Build and support a production system.

Example

Run the following example live in your browser using a Google Colab notebook.

import ray
from raysql.context import RaySqlContext
from raysql.worker import Worker

# Start our cluster
ray.init()

# create some remote Workers
workers = [Worker.remote() for i in range(2)]

# create a remote context and register a table
ctx = RaySqlContext.remote(workers)
ray.get(ctx.register_csv.remote('tips', 'tips.csv', True))

# Parquet is also supported
# ctx.register_parquet('tips', 'tips.parquet')

result_set = ray.get(ctx.sql.remote('select sex, smoker, avg(tip/total_bill) as tip_pct from tips group by sex, smoker'))
print(result_set)

Status

  • RaySQL can run 21 of the 22 TPC-H benchmark queries (query 15 needs DDL and that is not yet supported).

Features

  • Mature SQL support (CTEs, joins, subqueries, etc) thanks to DataFusion
  • Support for CSV and Parquet files

Limitations

  • Requires a shared file system currently

Performance

This chart shows the performance of RaySQL compared to Apache Spark for SQLBench-H at a very small data set (10GB), running on my desktop (Threadripper with 24 physical cores). Both RaySQL and Spark are configured with 24 executors.

Note that query 15 is excluded from both results since RaySQL does not support DDL yet.

Overall Time

RaySQL is ~65% faster overall for this scale factor and environment.

SQLBench-H Total

Per Query Time

Spark is much faster on some queries, likely due to broadcast exchanges, which RaySQL hasn't implemented yet.

SQLBench-H Per Query

Performance Plan

I'm planning on experimenting with the following changes to improve performance:

  • Make better use of Ray futures to run more tasks in parallel
  • Use Ray object store for shuffle data transfer to reduce disk I/O cost
  • Keep upgrading to newer versions of DataFusion to pick up the latest optimizations

Building

# prepare development environment (used to build wheel / install in development)
python3 -m venv venv
# activate the venv
source venv/bin/activate
# update pip itself if necessary
python -m pip install -U pip
# install dependencies (for Python 3.8+)
python -m pip install -r requirements-in.txt

Whenever rust code changes (your changes or via git pull):

# make sure you activate the venv using "source venv/bin/activate" first
maturin develop
python -m pytest

Benchmarking

Create a release build when running benchmarks, then use pip to install the wheel.

maturin develop --release

How to update dependencies

To change test dependencies, change the requirements.in and run

# install pip-tools (this can be done only once), also consider running in venv
python -m pip install pip-tools
python -m piptools compile --generate-hashes -o requirements-310.txt

To update dependencies, run with -U

python -m piptools compile -U --generate-hashes -o requirements-310.txt

More details here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

raysql-0.5.0.tar.gz (127.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

raysql-0.5.0-cp37-abi3-manylinux_2_34_x86_64.whl (13.9 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.34+ x86-64

File details

Details for the file raysql-0.5.0.tar.gz.

File metadata

  • Download URL: raysql-0.5.0.tar.gz
  • Upload date:
  • Size: 127.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.14.15

File hashes

Hashes for raysql-0.5.0.tar.gz
Algorithm Hash digest
SHA256 e3aee31c2bb3240766d7597b4b3a211c7bf1ce0336c9808dc3ea865fad686ff2
MD5 90ed1ff4970ab5a27cf7be053b5472fa
BLAKE2b-256 eb20f5fbc6c166c0109b149c688bff3cccf4ebd3eca7412983ae22cc14d803b8

See more details on using hashes here.

File details

Details for the file raysql-0.5.0-cp37-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for raysql-0.5.0-cp37-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 2a85e9e54543c2d8c2baa1040c48349aca6cce6ec5a1eaa6aea7f53c0e74f949
MD5 aa89d8071cee75713a6b5b29c1e414ae
BLAKE2b-256 d20b9e0b88bfaee749f4bd3c64e4241464e35c5337e0c124652f8d57aa83ce4a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page