Skip to main content

Configurable data pipeline with Pyspark

Project description

Pyspark-config

Python PyPI

Pyspark-Config is a Python module for data processing in Pyspark by means of a configuration file, granting access to build distributed data piplines with configurable inputs, transformations and outputs.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Installation

To install the current release (Ubuntu and Windows):

$ pip install pyspark_config

Dependencies

  • Python (>= 3.6)
  • Pyspark (>= 2.4.5)
  • PyYaml (>= 5.3.1)
  • Dataclasses (>= 0.0.0)

Example

Given the yaml configuration file '../example.yaml':

input:
  sources:
    - type: 'Parquet'
      label: 'parquet'
      parquet_path: '../table.parquet'

transformations:
  - type: "Select"
    cols: ['A', 'B']
  - type: "Concatenate"
    cols: ['A', 'B']
    name: 'Concatenation_AB'
    delimiter: "-"

output:
  - type: 'Parquet'
    name: "example"
    path: "../outputs"

With the input source saved in '../table.parquet', the following code can then be applied:

from pyspark_config import Config

from pyspark_config.transformations.transformations import *
from pyspark_config.output import *
from pyspark_config.input import *

config_path="../example.yaml"
configuration=Config()
configuration.load(config_path)

configuration.apply()

The output will then be saved in '../outputs/example.parquet'.

Changelog

See the changelog for a history of notable changes to pyspark-config.

License

This project is distributed under the 3-Clause BSD license. - see the LICENSE.md file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark-config-0.0.2.16.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

pyspark_config-0.0.2.16-py3-none-any.whl (26.0 kB view details)

Uploaded Python 3

File details

Details for the file pyspark-config-0.0.2.16.tar.gz.

File metadata

  • Download URL: pyspark-config-0.0.2.16.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.10

File hashes

Hashes for pyspark-config-0.0.2.16.tar.gz
Algorithm Hash digest
SHA256 a9445a21e54f46f5399d5c177dabedf5de0e0167244c016b3207cf1de8a221cd
MD5 4385736ee19f5d5b6456f3ae4b22503c
BLAKE2b-256 cbb7b1bdd8cac60292a9a8d32b1b86979fd152be16cb3aeef03d7bc97e4da914

See more details on using hashes here.

File details

Details for the file pyspark_config-0.0.2.16-py3-none-any.whl.

File metadata

  • Download URL: pyspark_config-0.0.2.16-py3-none-any.whl
  • Upload date:
  • Size: 26.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.10

File hashes

Hashes for pyspark_config-0.0.2.16-py3-none-any.whl
Algorithm Hash digest
SHA256 7a064c9e3e6b2ca61d0aa6e4a166aad1ae6e69105da28aeb9787978d05a0e833
MD5 66e195f0b465efe5aab98086dac44846
BLAKE2b-256 ce87924c81f4d36f43e885aafff8265b8b1742b888441e281922c2e1e8078d7d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page