Skip to main content

PathsData Distributed Query Engine - Python client for distributed SQL execution

Project description

PathsData Distributed

PathsData Distributed Query Engine - Python client for distributed SQL execution.

This project is versioned and released independently from the main Ballista project and is intentionally not part of the default Cargo workspace so that it doesn't cause overhead for maintainers of the main Ballista codebase.

Creating a SessionContext

[!IMPORTANT] Current approach is to support datafusion python API, there are know limitations of current approach, with some cases producing errors. We trying to come up with the best approach to support datafusion python interface. More details could be found at #1142

Creates a new context and connects to a Ballista scheduler process.

from pathsdata_distributed import BallistaBuilder
>>> ctx = BallistaBuilder().standalone()

Example SQL Usage

>>> ctx.sql("create external table t stored as parquet location './testdata/test.parquet'")
>>> df = ctx.sql("select * from t limit 5")
>>> pyarrow_batches = df.collect()

Example DataFrame Usage

>>> df = ctx.read_parquet('./testdata/test.parquet').limit(5)
>>> pyarrow_batches = df.collect()

Scheduler and Executor

Scheduler and executors can be configured and started from python code.

To start scheduler:

from pathsdata_distributed import BallistaScheduler

scheduler = BallistaScheduler()

scheduler.start()
scheduler.wait_for_termination()

For executor:

from pathsdata_distributed import BallistaExecutor

executor = BallistaExecutor()

executor.start()
executor.wait_for_termination()

Development Process

Creating Virtual Environment

python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt

Building

maturin develop

Note that you can also run maturin develop --release to get a release build locally.

Testing

python3 -m pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pathsdata_distributed-43.0.0.tar.gz (47.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pathsdata_distributed-43.0.0-cp38-abi3-manylinux_2_39_x86_64.whl (55.2 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.39+ x86-64

File details

Details for the file pathsdata_distributed-43.0.0.tar.gz.

File metadata

File hashes

Hashes for pathsdata_distributed-43.0.0.tar.gz
Algorithm Hash digest
SHA256 a0eb0f566d8687f9a4a7741d0b2c163bd9c68352def15272328a4a767dd3c4d1
MD5 c19b5efa9c9c428f946365248f87f8de
BLAKE2b-256 1a0f3365f5c199219d8ed05641b8392701cd8294a25ccab4a31c1120dd8ae573

See more details on using hashes here.

File details

Details for the file pathsdata_distributed-43.0.0-cp38-abi3-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for pathsdata_distributed-43.0.0-cp38-abi3-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 9324c4de6942fb67c29f065407f0650e3bb7e2258449fe5cde9518636b8ff1cb
MD5 56272f5b2ee5eb529288054eab316b5c
BLAKE2b-256 8d101722f5d554b745503364f8d990fe57b7adae9a238823ee3f63b561354d2d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page