Skip to main content

A PySpark ETL Framework

Project description

PyPI Badge Build Status Code Coverage Documentation Status

Overview

PySetl is a framework focused to improve readability and structure of PySpark ETL projects. Also, it is designed to take advantage of Python’s typing syntax to reduce runtime errors through linting tools and verifying types at runtime. Thus, effectively enhacing stability for large ETL pipelines.

In order to accomplish this task we provide some tools:

  • pysetl.config: Type-safe configuration.

  • pysetl.storage: Agnostic and extensible data sources connections.

  • pysetl.workflow: Pipeline management and dependency injection.

PySetl is designed with Python typing syntax at its core. Hence, we strongly suggest typedspark and pydantic for development.

Why use PySetl?

  • Model complex data pipelines.

  • Reduce risks at production with type-safe development.

  • Improve large project structure and readability.

Installation

PySetl is available in PyPI:

pip install pysetl

PySetl doesn’t list pyspark as dependency since most environments have their own Spark environment. Nevertheless, you can install pyspark running:

pip install "pysetl[pyspark]"

Acknowledgments

PySetl is a port from SETL. We want to fully recognise this package is heavily inspired by the work of the SETL team. We just adapted things to work in Python.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysetl-0.1.7rc0.tar.gz (32.5 kB view hashes)

Uploaded Source

Built Distribution

pysetl-0.1.7rc0-py3-none-any.whl (51.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page