Skip to main content

An Apache Beam pipeline Runner built on Apache Spark's python API

Project description

PySpark Apache Beam Runner

Overview

(WHY? Doesn't Beam ship with a Spark runner?)

This project introduces a custom Apache Beam runner that leverages PySpark directly. This is not a 'portability' framework compliant runner! It is designed for environments where a SparkSession is available but a Spark master server is not. This is useful for e.g. serverless environments where jobs are triggered without a long-running cluster, sidestepping the expectations of Beam's default Spark runner.

The other benefit is that this strategy for building a runner helps to keep the stack as python-centric as possible. The compilation process, the optimizations, the execution planning - these all happen in python (for better or worse). Depending on your needs, this might be a significant advantage.

Features

  • Direct Integration with PySpark: Utilizes a PySpark assumed SparkSession directly.
  • Serverless Compatibility: Ideal for environments without a dedicated Spark master, supporting execution in serverless frameworks.
  • Simplified Setup: Potentially reduces the complexity of job submission by avoiding the need for port listening on a Spark master.

Getting Started

Prerequisites

  • Apache Spark
  • Apache Beam
  • Python 3.8 or later

Installation

To use this custom runner, just pip install as you would any library

pip install beam-pyspark-runner

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

beam_pyspark_runner-0.0.2.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

beam_pyspark_runner-0.0.2-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file beam_pyspark_runner-0.0.2.tar.gz.

File metadata

  • Download URL: beam_pyspark_runner-0.0.2.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for beam_pyspark_runner-0.0.2.tar.gz
Algorithm Hash digest
SHA256 c31cae40004355cccb6dd19b8e538d34225fc672d8ffa6d6100897f51da5f8f2
MD5 95a611dd7df60f83a15406a293ceb331
BLAKE2b-256 a908b26780f95070b13df4417dc29921ee427a5e1dfba977e0c110c65468b675

See more details on using hashes here.

File details

Details for the file beam_pyspark_runner-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for beam_pyspark_runner-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d4941c511021456897cf70fe7b9b8c1376541aa47dde81dc10c757c412676943
MD5 48c132613db79f7e50c3ba68f57e5c71
BLAKE2b-256 ed102d06d8cf52dd71fd34380bd74db80e14f4676136f748416d150dcf9b6ad6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page