Column-wise type annotations for pyspark DataFrames
Project description
Typedspark: column-wise type annotations for pyspark DataFrames
We love Spark! But in production code we're wary when we see:
from pyspark.sql import DataFrame
def foo(df: DataFrame) -> DataFrame:
# do stuff
return df
Because… How do we know which columns are supposed to be in df
?
Using typedspark
, we can be more explicit about what these data should look like.
from typedspark import Column, DataSet, Schema
from pyspark.sql.types import LongType, StringType
class Person(Schema):
id: Column[LongType]
name: Column[StringType]
age: Column[LongType]
def foo(df: DataSet[Person]) -> DataSet[Person]:
# do stuff
return df
The advantages include:
- Improved readability of the code
- Typechecking, both during runtime and linting
- Auto-complete of column names
- Easy refactoring of column names
- Easier unit testing through the generation of empty
DataSets
based on their schemas - Improved documentation of tables
Documentation
Please see our documentation on readthedocs.
Installation
You can install typedspark
from pypi by running:
pip install typedspark
By default, typedspark
does not list pyspark
as a dependency, since many platforms (e.g. Databricks) come with pyspark
preinstalled. If you want to install typedspark
with pyspark
, you can run:
pip install "typedspark[pyspark]"
Demo videos
IDE demo
https://github.com/kaiko-ai/typedspark/assets/47976799/e6f7fa9c-6d14-4f68-baba-fe3c22f75b67
You can find the corresponding code here.
Jupyter / Databricks notebooks demo
https://github.com/kaiko-ai/typedspark/assets/47976799/39e157c3-6db0-436a-9e72-44b2062df808
You can find the corresponding code here.
FAQ
I found a bug! What should I do?
Great! Please make an issue and we'll look into it.
I have a great idea to improve typedspark! How can we make this work?
Awesome, please make an issue and let us know!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for typedspark-1.4.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 471c27bccc71d095b60c57d37ee75cf5c43a4cc171146473890985dce7ec3cf3 |
|
MD5 | 97ce88c8706f25203913e4e4fca4f59b |
|
BLAKE2b-256 | 5d4c09d6d1a02e2680eaa2ba26019c39c0614e4dfb0a60c6079a3be8b6086901 |