Configurable data pipeline with Pyspark
Project description
Pyspark-config
Pyspark-Config is a Python module for data processing in Pyspark by means of a configuration file, granting access to build distributed data piplines with configurable inputs, transformations and outputs.
Getting Started
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Installation
To install the current release (Ubuntu and Windows):
$ pip install pyspark_config
Dependencies
- Python (>= 3.6)
- Pyspark (>= 2.4.5)
- PyYaml (>= 5.3.1)
- Dataclasses (>= 0.0.0)
Example
Given the yaml configuration file '../example.yaml':
input:
sources:
- type: 'Parquet'
label: 'parquet'
parquet_path: '../table.parquet'
transformations:
- type: "Select"
cols: ['A', 'B']
- type: "Concatenate"
cols: ['A', 'B']
name: 'Concatenation_AB'
delimiter: "-"
output:
- type: 'Parquet'
name: "example"
path: "../outputs"
With the input source saved in '../table.parquet', the following code can then be applied:
from pyspark_config import Config
from pyspark_config.transformations.transformations import *
from pyspark_config.output import *
from pyspark_config.input import *
config_path="../example.yaml"
configuration=Config()
configuration.load(config_path)
configuration.apply()
The output will then be saved in '../outputs/example.parquet'.
Changelog
See the changelog for a history of notable changes to pyspark-config.
License
This project is distributed under the 3-Clause BSD license. - see the LICENSE.md file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyspark-config-0.0.2.16.tar.gz
.
File metadata
- Download URL: pyspark-config-0.0.2.16.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a9445a21e54f46f5399d5c177dabedf5de0e0167244c016b3207cf1de8a221cd |
|
MD5 | 4385736ee19f5d5b6456f3ae4b22503c |
|
BLAKE2b-256 | cbb7b1bdd8cac60292a9a8d32b1b86979fd152be16cb3aeef03d7bc97e4da914 |
File details
Details for the file pyspark_config-0.0.2.16-py3-none-any.whl
.
File metadata
- Download URL: pyspark_config-0.0.2.16-py3-none-any.whl
- Upload date:
- Size: 26.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a064c9e3e6b2ca61d0aa6e4a166aad1ae6e69105da28aeb9787978d05a0e833 |
|
MD5 | 66e195f0b465efe5aab98086dac44846 |
|
BLAKE2b-256 | ce87924c81f4d36f43e885aafff8265b8b1742b888441e281922c2e1e8078d7d |