Configurable data pipeline with Pyspark
Project description
Pyspark-config
Pyspark-Config is a Python module for data processing in Pyspark by means of a configuration file, granting access to build distributed data piplines with configurable inputs, transformations and outputs.
Getting Started
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Installation
To install the current release (Ubuntu and Windows):
$ pip install pyspark_config
Dependencies
- Python (>= 3.6)
- Pyspark (>= 2.4.5)
- PyYaml (>= 5.3.1)
- Dataclasses (>= 0.0.0)
Example
Given the yaml configuration file '../example.yaml':
input:
sources:
- type: 'Parquet'
label: 'parquet'
parquet_path: '../table.parquet'
transformations:
- type: "Select"
cols: ['A', 'B']
- type: "Concatenate"
cols: ['A', 'B']
name: 'Concatenation_AB'
delimiter: "-"
output:
- type: 'Parquet'
name: "example"
path: "../outputs"
With the input source saved in '../table.parquet', the following code can then be applied:
from pyspark_config import Config
from pyspark_config.transformations.transformations import *
from pyspark_config.output import *
from pyspark_config.input import *
config_path="../example.yaml"
configuration=Config()
configuration.load(config_path)
configuration.apply()
The output will then be saved in '../outputs/example.parquet'.
Changelog
See the changelog for a history of notable changes to pyspark-config.
License
This project is distributed under the 3-Clause BSD license. - see the LICENSE.md file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyspark_config-0.0.2.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3016d77cf7abbfa3ed0f9cef613e0a17a4a6001203f4db7a379b47e016cedfc5 |
|
MD5 | 0055985f0512fefa74f90127dc060384 |
|
BLAKE2b-256 | 83ce7918eab1091075dd8355d8adc521079826c0d602fda57cefd3aaf3ba530a |