Column-wise type annotations for pyspark DataFrames
Project description
Typedspark: column-wise type annotations for pyspark DataFrames
We love Spark! But in production code we're wary when we see:
from pyspark.sql import DataFrame
def foo(df: DataFrame) -> DataFrame:
# do stuff
return df
Because… How do we know which columns are supposed to be in df
?
Using typedspark
, we can be more explicit about what these data should look like.
from typedspark import Column, DataSet, Schema
from pyspark.sql.types import LongType, StringType
class Person(Schema):
id: Column[LongType]
name: Column[StringType]
age: Column[LongType]
def foo(df: DataSet[Person]) -> DataSet[Person]:
# do stuff
return df
The advantages include:
- Improved readibility of the code
- Typechecking, both during runtime and linting
- Auto-complete of column names
- Easy refactoring of column names
- Easier unit testing through the generation of empty
DataSets
based on their schemas - Improved documentation of tables
Documentation
Please see our documentation on readthedocs.
Installation
You can install typedspark
from pypi by running:
pip install typedspark
By default, typedspark
does not list pyspark
as a dependency, since many platforms (e.g. Databricks) come with pyspark
preinstalled. If you want to install typedspark
with pyspark
, you can run:
pip install "typedspark[pyspark]"
Demo videos
IDE demo
https://github.com/kaiko-ai/typedspark/assets/47976799/e6f7fa9c-6d14-4f68-baba-fe3c22f75b67
You can find the corresponding code here.
Jupyter / Databricks notebooks demo
https://github.com/kaiko-ai/typedspark/assets/47976799/39e157c3-6db0-436a-9e72-44b2062df808
You can find the corresponding code here.
FAQ
I found a bug! What should I do?
Great! Please make an issue and we'll look into it.
I have a great idea to improve typedspark! How can we make this work?
Awesome, please make an issue and let us know!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for typedspark-1.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24a456d4b6326922e79d4dbb53642b4bbac80e09a791817a0ba034bb64bc1c5e |
|
MD5 | 2290a8d2bef5178c743bba2031c3c2de |
|
BLAKE2b-256 | ca0d042ebea3a85b0caadaa4302f5ae1c125bcdf3a3e8f2b608d6d6521d521f0 |