Skip to main content

A PySpark ETL Framework

Project description

PySetl

Build Status Code Coverage Documentation Status

PyPI Python PySpark Downloads

License Code style: ruff Type checked with mypy Pre-commit

Overview

PySetl is a framework to improve the readability and structure of PySpark ETL projects. It is designed to take advantage of Python's typing syntax to reduce runtime errors through linting tools and verifying types at runtime, effectively enhancing stability for large ETL pipelines.

To accomplish this, we provide some tools:

  • pysetl.config: Type-safe configuration.
  • pysetl.storage: Agnostic and extensible data sources connections.
  • pysetl.workflow: Pipeline management and dependency injection.

PySetl is designed with Python typing syntax at its core. We strongly suggest using typedspark and pydantic for development.

Why use PySetl?

  • Model complex data pipelines.
  • Reduce risks at production with type-safe development.
  • Improve large project structure and readability.

Quick Start

from pysetl.config import CsvConfig
from pysetl.workflow import Factory, Stage, Pipeline
from typedspark import DataSet, Schema, Column, create_partially_filled_dataset
from pyspark.sql.types import StringType, IntegerType

# Define your data schema
class Citizen(Schema):
    name: Column[StringType]
    age: Column[IntegerType]
    city: Column[StringType]

# Create a factory
class CitizensFactory(Factory[DataSet[Citizen]]):
    def read(self):
        self.citizens = create_partially_filled_dataset(
            spark, Citizen,
            [{Citizen.name: "Alice", Citizen.age: 30, Citizen.city: "NYC"}]
        )
        return self
    def process(self): return self
    def write(self): return self
    def get(self): return self.citizens

# Build and run pipeline
stage = Stage().add_factory_from_type(CitizensFactory)
pipeline = Pipeline().add_stage(stage).run()

Installation

PySetl is available on PyPI:

pip install pysetl

Optional Dependencies

PySetl provides several optional dependencies for different use cases:

  • PySpark: For local development (most production environments come with their own Spark distribution)

    pip install "pysetl[pyspark]"
    
  • Documentation: For building documentation locally

    pip install "pysetl[docs]"
    

Documentation

Development

git clone https://github.com/JhossePaul/pysetl.git
cd pysetl
hatch env show  # Shows available environments and scripts
hatch shell
pre-commit install

Development Commands

  • Type checking: hatch run type
  • Lint code: hatch run lint
  • Format code: hatch run format
  • Run tests (default environment only): hatch test
  • Run all test matrix: hatch test --all
  • Run tests with coverage (all matrix): hatch test --cover --all
  • Build documentation: hatch run docs:docs
  • Serve documentation: hatch run docs:serve
  • Security checks: hatch run security:all

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

PySetl is a port from SETL. We want to fully recognize this package is heavily inspired by the work of the SETL team. We just adapted things to work in Python.

Supported Python Versions

pysetl supports Python 3.9, 3.10, 3.11, 3.12, and 3.13. The typing system and all features are compatible across these versions. Recent updates have improved compatibility with Python 3.9, especially regarding advanced typing and generics.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysetl-1.2.1.tar.gz (46.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pysetl-1.2.1-py3-none-any.whl (73.2 kB view details)

Uploaded Python 3

File details

Details for the file pysetl-1.2.1.tar.gz.

File metadata

  • Download URL: pysetl-1.2.1.tar.gz
  • Upload date:
  • Size: 46.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for pysetl-1.2.1.tar.gz
Algorithm Hash digest
SHA256 cbdc4187000395ee2ce2258eda3307e2d10dbb7099c5d66ba7f9abf79514e952
MD5 cb44a507c57e1305c7ef190228f313bd
BLAKE2b-256 8c8935b73fa31006a893a4dd50d7dd7d3a77ebd87cec7bc999e48bbac9e94267

See more details on using hashes here.

File details

Details for the file pysetl-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: pysetl-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 73.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for pysetl-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c7614f4904b8c5db594996421ce73414ca546206de53fc6bb6c5509f34451773
MD5 b17244caaa3cf287ba73f6fba9248297
BLAKE2b-256 dfa9179fad9460018b22385702209dcba66650c3d29b0c34194cbd5b959115e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page