Skip to main content

An Apache Beam pipeline Runner built on Apache Spark's python API

Project description

PySpark Apache Beam Runner

Overview

(WHY? Doesn't Beam ship with a Spark runner?)

This project introduces a custom Apache Beam runner that leverages PySpark directly. This is not a 'portability' framework compliant runner! It is designed for environments where a SparkSession is available but a Spark master server is not. This is useful for e.g. serverless environments where jobs are triggered without a long-running cluster, sidestepping the expectations of Beam's default Spark runner.

The other benefit is that this strategy for building a runner helps to keep the stack as python-centric as possible. The compilation process, the optimizations, the execution planning - these all happen in python (for better or worse). Depending on your needs, this might be a significant advantage.

Features

  • Direct Integration with PySpark: Utilizes a PySpark assumed SparkSession directly.
  • Serverless Compatibility: Ideal for environments without a dedicated Spark master, supporting execution in serverless frameworks.
  • Simplified Setup: Potentially reduces the complexity of job submission by avoiding the need for port listening on a Spark master.

Getting Started

Prerequisites

  • Apache Spark
  • Apache Beam
  • Python 3.8 or later

Installation

To use this custom runner, just pip install as you would any library

pip install beam-pyspark-runner

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

beam_pyspark_runner-0.0.1.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

beam_pyspark_runner-0.0.1-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file beam_pyspark_runner-0.0.1.tar.gz.

File metadata

  • Download URL: beam_pyspark_runner-0.0.1.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for beam_pyspark_runner-0.0.1.tar.gz
Algorithm Hash digest
SHA256 0732964b043919e0feb4e741a3fbfe0afdb5f32f145bc53bc7ac66a432c366da
MD5 447fe4c4b8cf87f6838f5cebb6ce002a
BLAKE2b-256 c57dd1843c46a71a7653d9f560ff0811c56e5892cd1643f2acf15ddfa2c36e2f

See more details on using hashes here.

File details

Details for the file beam_pyspark_runner-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for beam_pyspark_runner-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 afe16db1bdb413cb584717f28fc76d0b7d96f3ff49f7c67020c23847e6a1c862
MD5 0c728b9bed3ec6dfef8e31dc8c644c86
BLAKE2b-256 610b7ff28c5ce690bc37219447587381ad72eb36b3084322874584f314e15934

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page