A PySpark ETL Framework
Project description
Overview
PySetl is a framework to improve the readability and structure of PySpark ETL projects. It is designed to take advantage of Python's typing syntax to reduce runtime errors through linting tools and verifying types at runtime, effectively enhancing stability for large ETL pipelines.
To accomplish this, we provide some tools:
pysetl.config: Type-safe configuration.pysetl.storage: Agnostic and extensible data sources connections.pysetl.workflow: Pipeline management and dependency injection.
PySetl is designed with Python typing syntax at its core. We strongly suggest using typedspark and pydantic for development.
Why use PySetl?
- Model complex data pipelines.
- Reduce risks at production with type-safe development.
- Improve large project structure and readability.
Quick Start
from pysetl.config import CsvConfig
from pysetl.workflow import Factory, Stage, Pipeline
from typedspark import DataSet, Schema, Column, create_partially_filled_dataset
from pyspark.sql.types import StringType, IntegerType
# Define your data schema
class Citizen(Schema):
name: Column[StringType]
age: Column[IntegerType]
city: Column[StringType]
# Create a factory
class CitizensFactory(Factory[DataSet[Citizen]]):
def read(self):
self.citizens = create_partially_filled_dataset(
spark, Citizen,
[{Citizen.name: "Alice", Citizen.age: 30, Citizen.city: "NYC"}]
)
return self
def process(self): return self
def write(self): return self
def get(self): return self.citizens
# Build and run pipeline
stage = Stage().add_factory_from_type(CitizensFactory)
pipeline = Pipeline().add_stage(stage).run()
Installation
PySetl is available on PyPI:
pip install pysetl
Optional Dependencies
PySetl provides several optional dependencies for different use cases:
-
PySpark: For local development (most production environments come with their own Spark distribution)
pip install "pysetl[pyspark]"
-
Documentation: For building documentation locally
pip install "pysetl[docs]"
Documentation
Development
git clone https://github.com/JhossePaul/pysetl.git
cd pysetl
hatch env show # Shows available environments and scripts
hatch shell
pre-commit install
Development Commands
- Type checking:
hatch run type - Lint code:
hatch run lint - Format code:
hatch run format - Run tests (default environment only):
hatch test - Run all test matrix:
hatch test --all - Run tests with coverage (all matrix):
hatch test --cover --all - Build documentation:
hatch run docs:docs - Serve documentation:
hatch run docs:serve - Security checks:
hatch run security:all
Contributing
We welcome contributions! Please see our Contributing Guide for details.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Acknowledgments
PySetl is a port from SETL. We want to fully recognize this package is heavily inspired by the work of the SETL team. We just adapted things to work in Python.
Supported Python Versions
pysetl supports Python 3.9, 3.10, 3.11, 3.12, and 3.13. The typing system and all features are compatible across these versions. Recent updates have improved compatibility with Python 3.9, especially regarding advanced typing and generics.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pysetl-1.2.1.tar.gz.
File metadata
- Download URL: pysetl-1.2.1.tar.gz
- Upload date:
- Size: 46.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbdc4187000395ee2ce2258eda3307e2d10dbb7099c5d66ba7f9abf79514e952
|
|
| MD5 |
cb44a507c57e1305c7ef190228f313bd
|
|
| BLAKE2b-256 |
8c8935b73fa31006a893a4dd50d7dd7d3a77ebd87cec7bc999e48bbac9e94267
|
File details
Details for the file pysetl-1.2.1-py3-none-any.whl.
File metadata
- Download URL: pysetl-1.2.1-py3-none-any.whl
- Upload date:
- Size: 73.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7614f4904b8c5db594996421ce73414ca546206de53fc6bb6c5509f34451773
|
|
| MD5 |
b17244caaa3cf287ba73f6fba9248297
|
|
| BLAKE2b-256 |
dfa9179fad9460018b22385702209dcba66650c3d29b0c34194cbd5b959115e0
|