Skip to main content

This Spark package is designed to process data from various sources, perform transformations, and write the results to different sinks. It provides extension points for Source, Sink and Transformer and follows the pipeline design pattern to provide a flexible and modular approach to data processing.

Project description

Introduction

This Spark package is designed to process data from various sources, perform transformations, and write the results to different sinks. It follows the pipeline design pattern to provide a flexible and modular approach to data processing.

Design

The package is structured as follows:

Source, Sink and Transformer Abstraction

The package defines abstract classes Source, Sink and Transformer to represent data sources, sinks and transformers. It also provides concrete classes, including CsvSource, CsvSink and SQLTransformer, which inherit from the abstract classes. This design allows you to add new source and sink types with ease.

Configuration via recipe.yml

The package reads its configuration from a recipe.yml file. This YAML file specifies the source, sink, and transformation configurations. It allows you to define different data sources, sinks, and transformation queries.

Transformation Queries

Transformations are performed by SQLTransformer using Spark SQL queries defined in the configuration. These queries are executed on the data from the source before writing it to the sink. New transformers can be implemented by extending Transformer abstract class that can take spark dataframes from sources to process and send dataframes to sinks to save.

Pipeline Execution

The package reads data from the specified source, performs transformations based on the configured SQL queries, and then writes the results to the specified sink. You can configure multiple sources and sinks within the same package.

Setup

The project is built using python-3.12.0, spark-3.5.0 (and other dependencies in requirements.txt).

In your environment you should have spark and hadoop libraries added

export SPARK_HOME=/path/to/your/spark
export HADOOP_HOME=/path/to/your/hadoop (if applicable)
export PATH=$SPARK_HOME/bin:$HADOOP_HOME/bin:$PATH

Testing

Make sure that the required packages are installed in your python environemnt, listed in setup.py INSTALL_REQUIRES section. The package includes a testing framework using PyTest. Test cases are provided to ensure that each source, sink, and transformation works as expected. You can run the tests using the following command:

pytest test_app.py

or test run directly inside pysparkify folder

python ./src/app.py --config ./config/recipe.yml 

Deployment

... Environment specific documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysparkify-0.1.tar.gz (3.3 kB view details)

Uploaded Source

Built Distribution

pysparkify-0.1-py3-none-any.whl (3.1 kB view details)

Uploaded Python 3

File details

Details for the file pysparkify-0.1.tar.gz.

File metadata

  • Download URL: pysparkify-0.1.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for pysparkify-0.1.tar.gz
Algorithm Hash digest
SHA256 f41db79abe554fb1727961deb2b8e9be578c758339678fceb1c0c0c18ba2deb2
MD5 bbe0da6220f3edf6aab34f7270b185c7
BLAKE2b-256 edc22dfd640c8a351e905a80b9feda9ad9acf578b2b0544f27477928052a0edf

See more details on using hashes here.

File details

Details for the file pysparkify-0.1-py3-none-any.whl.

File metadata

  • Download URL: pysparkify-0.1-py3-none-any.whl
  • Upload date:
  • Size: 3.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for pysparkify-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 83fd2820fcc6a8f39a621d38c0ec71388887857bcde222bf66fc6fc926e7e482
MD5 fac756af02739d6625b214d191d4effe
BLAKE2b-256 d136f6ab90dbe8c493270b411e76f36dc4957da258e40ee8c4cc30b46640f74d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page