An Apache Beam pipeline Runner built on Apache Spark's python API
Project description
PySpark Apache Beam Runner
Overview
(WHY? Doesn't Beam ship with a Spark runner?)
This project introduces a custom Apache Beam runner that leverages PySpark directly. This is not a 'portability' framework compliant runner! It is designed for environments where a SparkSession is available but a Spark master server is not. This is useful for e.g. serverless environments where jobs are triggered without a long-running cluster, sidestepping the expectations of Beam's default Spark runner.
The other benefit is that this strategy for building a runner helps to keep the stack as python-centric as possible. The compilation process, the optimizations, the execution planning - these all happen in python (for better or worse). Depending on your needs, this might be a significant advantage.
Features
- Direct Integration with PySpark: Utilizes a PySpark assumed SparkSession directly.
- Serverless Compatibility: Ideal for environments without a dedicated Spark master, supporting execution in serverless frameworks.
- Simplified Setup: Potentially reduces the complexity of job submission by avoiding the need for port listening on a Spark master.
Getting Started
Prerequisites
- Apache Spark
- Apache Beam
- Python 3.8 or later
Installation
To use this custom runner, just pip install
as you would any library
pip install beam-pyspark-runner
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file beam_pyspark_runner-0.0.3.tar.gz
.
File metadata
- Download URL: beam_pyspark_runner-0.0.3.tar.gz
- Upload date:
- Size: 10.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a02ecbf325f9d8c8885a92218c79edfbd43ade2d4f4502a76afa05dd3ccd44d |
|
MD5 | 359cbf0b0dfda90b45a69694f9a2332f |
|
BLAKE2b-256 | cc0ca51d5b39b2beda69129da1e52825dcdc99b055ac11810258b21deb348fac |
File details
Details for the file beam_pyspark_runner-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: beam_pyspark_runner-0.0.3-py3-none-any.whl
- Upload date:
- Size: 11.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 33c458c2f1b48d7a5042732d4fd55b2739cfad19f6d7a1f485d270e57c1d5141 |
|
MD5 | ae2c6a090c4ed8839def0abe3cd1ab44 |
|
BLAKE2b-256 | 5ec44e55ec84c154902a1b334ecc12caab5fb38cdf79e0ea7bd809136dd581fe |