Skip to main content

An Apache Beam pipeline Runner built on Apache Spark's python API

Project description

PySpark Apache Beam Runner

Overview

(WHY? Doesn't Beam ship with a Spark runner?)

This project introduces a custom Apache Beam runner that leverages PySpark directly. This is not a 'portability' framework compliant runner! It is designed for environments where a SparkSession is available but a Spark master server is not. This is useful for e.g. serverless environments where jobs are triggered without a long-running cluster, sidestepping the expectations of Beam's default Spark runner.

The other benefit is that this strategy for building a runner helps to keep the stack as python-centric as possible. The compilation process, the optimizations, the execution planning - these all happen in python (for better or worse). Depending on your needs, this might be a significant advantage.

Features

  • Direct Integration with PySpark: Utilizes a PySpark assumed SparkSession directly.
  • Serverless Compatibility: Ideal for environments without a dedicated Spark master, supporting execution in serverless frameworks.
  • Simplified Setup: Potentially reduces the complexity of job submission by avoiding the need for port listening on a Spark master.

Getting Started

Prerequisites

  • Apache Spark
  • Apache Beam
  • Python 3.8 or later

Installation

To use this custom runner, just pip install as you would any library

pip install beam-pyspark-runner

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

beam_pyspark_runner-0.0.3.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

beam_pyspark_runner-0.0.3-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file beam_pyspark_runner-0.0.3.tar.gz.

File metadata

  • Download URL: beam_pyspark_runner-0.0.3.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for beam_pyspark_runner-0.0.3.tar.gz
Algorithm Hash digest
SHA256 1a02ecbf325f9d8c8885a92218c79edfbd43ade2d4f4502a76afa05dd3ccd44d
MD5 359cbf0b0dfda90b45a69694f9a2332f
BLAKE2b-256 cc0ca51d5b39b2beda69129da1e52825dcdc99b055ac11810258b21deb348fac

See more details on using hashes here.

File details

Details for the file beam_pyspark_runner-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for beam_pyspark_runner-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 33c458c2f1b48d7a5042732d4fd55b2739cfad19f6d7a1f485d270e57c1d5141
MD5 ae2c6a090c4ed8839def0abe3cd1ab44
BLAKE2b-256 5ec44e55ec84c154902a1b334ecc12caab5fb38cdf79e0ea7bd809136dd581fe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page