Bringing type-checking and schema validation to PySpark DataFrames.

Project description

📚 pyspark-datasets 📐

pyspark-datasets is a Python package for typed dataframes in PySpark. One aim of this project is to give developers type-safety similar to Dataset API in Scala/Spark.

Installation

pip install pyspark-datasets

At least Python 3.10 is required, with 3.12 or above recommended. # TODO spark version

`TypedDataFrame`

pyspark_datasets.TypeDataFrame is a subclass of pyspark.sql.DataFrame building on sparkdantic which automatically derives its schema from its members, for example:

from typing import Annotated
from pyspark.sql import Column
from pyspark_datasets import TypedDataFrame

class Person(TypedDataFrame):
    name: Annotated[Column, str]
    age: Annotated[Column, int]

>>> Person.model_spark_schema()
StructType([StructField('name', StringType(), True), StructField('age', LongType(), True)])

Any types supported by sparkdantic can be inferred, if annotated as an Annotated[pyspark.sql.Column, ...] field. This includes nested dataclasses, pydantic models, lists, mappings, datetimes, union and optional types, and literals, amongst other things. Unlike to sparkdantic classes, this class also subclasses theDataFrame class and thus retains all underlying methods for transformations.

An existing 'untyped' spark dataframe may be converted to a typed dataframe via the constructor:

from pyspark.sql import DataFrame

df: DataFrame = ...
people = Person(df)

This now verifies before a spark action is called that the dataframe conforms to the schema expected by Person. We can optionally skip schema verification in the constructor (verify_schema=False), or customize it to allow missing or extra fields in the dataframe (allow_missing_fields=True, allow_extra_fields=True respectively).

Additionally, IDEs can recognise that TypedDataFrame instances have attributes named after their columns and syntax highlight the ones defined as fields. For example, in VSCode, the attributes become Ctrl+Clickable to jump to the class definition. In the equivalent sparkdantic model, fields would type-check as their python types and not Column.

`colname` and `LazyColumn`

pyspark_datasets.colname is the inverse of pyspark.sql.functions.col: the latter takes strings and produces pyspark columns, the former retrieves column names from pyspark columns. This can be used to increase code maintainability, using symbols instead of strings for column names where possible in conjunction with a TypedDataFrame instance, without the need for a second boilerplate container for column names. Consider the following example:

# TODO example.

If a typo is made in a column name, type-checkers are unaware, and renaming fields in Person does not cause a type-check failure. Now instead compare this with below, where these issues are gone:

# TODO example.

A quirk of pyspark is that creating Column instances requires an active spark context. TypedDataFrame circumvents this by giving pyspark_datasets.columns.LazyColumn instances for column attributes instead, which don't require a spark session when retreiving their column name with colname but subclass and behave like pyspark.sql.Column instances in every other way. This can allow schema validation code to sit outside scopes with active spark contexts.

Example Usecase: `.applyInPandas`, `.mapInPandas`, and similar pyspark functions.

TODO

`Col`

Col allows defining aliases for column fields when specifying the column annotation for TypedDataFrame, or to separate the annotations for TypedDataFrame from any others the user may be using.

class Items(TypedDataFrame):
    # aliased column
    _count_: Annotated[Column, Col(int, alias="count")]

    # annotated with an alias and other metadata
    name: Annotated[Column, Col(str), "<other-annotations>", ...]

As in the above example, aliases are often useful to annotate columns whose name would override a DataFrame method (in this case, .count()). If this field were named simply count, the field would be removed when defining Items and a warning raised to the user.

Development

This repo uses uv to build the environment. Run uv sync to install locally and uv build from source. Additional checks performed by CI:

pytest --cov=pyspark_datasets tests/ --cov-report=html  # Run unit tests
mypy pyspark_datasets                                   # Type-check code
ruff check                                              # Lint code
ruff format                                             # Autoformat code
pdoc pyspark_datasets -f --html -o ./docs               # Build API reference

Contributing

All contributions are welcome, especially because data engineering is not my main specialism. This project was inspired by repeatedly debugging preventable column name and type issues in ML preprocessing pipelines. Feel free to reach out through GitHub or my LinkedIn: https://www.linkedin.com/in/aaronzolnailucas/ if you are interested.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

May 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_datasets-0.1.0.tar.gz (12.7 kB view details)

Uploaded May 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyspark_datasets-0.1.0-py3-none-any.whl (10.3 kB view details)

Uploaded May 21, 2025 Python 3

File details

Details for the file pyspark_datasets-0.1.0.tar.gz.

File metadata

Download URL: pyspark_datasets-0.1.0.tar.gz
Upload date: May 21, 2025
Size: 12.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pyspark_datasets-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`812686b4651703afc76f60303b2a431928192c3ac3422e3e117b16a1b594488c`
MD5	`90b5046198bda522a84a69d2b0dc6120`
BLAKE2b-256	`4201e7326f288b2e15035c2bdb74c78551cfeeb3d878f22f0da7c435fe9c537d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_datasets-0.1.0.tar.gz:

Publisher: release.yml on aaronzo/pyspark-datasets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyspark_datasets-0.1.0.tar.gz
- Subject digest: 812686b4651703afc76f60303b2a431928192c3ac3422e3e117b16a1b594488c
- Sigstore transparency entry: 217058179
- Sigstore integration time: May 21, 2025
Source repository:
- Permalink: aaronzo/pyspark-datasets@91820a1418983c63bf4668c9ad5c273fab7cda77
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/aaronzo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@91820a1418983c63bf4668c9ad5c273fab7cda77
- Trigger Event: release

File details

Details for the file pyspark_datasets-0.1.0-py3-none-any.whl.

File metadata

Download URL: pyspark_datasets-0.1.0-py3-none-any.whl
Upload date: May 21, 2025
Size: 10.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pyspark_datasets-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0350cfe39d2cab39469a2984377adb94893ac070f81358aa0bda97fb5934c809`
MD5	`0328b14374397528e5d6e35b251a5e61`
BLAKE2b-256	`bb3398c79930f84695b2ad3799d713ad89d6225e8b488d6d52d2f0533110ea54`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_datasets-0.1.0-py3-none-any.whl:

Publisher: release.yml on aaronzo/pyspark-datasets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyspark_datasets-0.1.0-py3-none-any.whl
- Subject digest: 0350cfe39d2cab39469a2984377adb94893ac070f81358aa0bda97fb5934c809
- Sigstore transparency entry: 217058185
- Sigstore integration time: May 21, 2025
Source repository:
- Permalink: aaronzo/pyspark-datasets@91820a1418983c63bf4668c9ad5c273fab7cda77
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/aaronzo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@91820a1418983c63bf4668c9ad5c273fab7cda77
- Trigger Event: release

pyspark-datasets 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

📚 pyspark-datasets 📐

Installation

`TypedDataFrame`

`colname` and `LazyColumn`

Example Usecase: `.applyInPandas`, `.mapInPandas`, and similar pyspark functions.

TODO

`Col`

Development

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

pyspark-datasets 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

📚 pyspark-datasets 📐

Installation

TypedDataFrame

colname and LazyColumn

Example Usecase: .applyInPandas, .mapInPandas, and similar pyspark functions.

TODO

Col

Development

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`TypedDataFrame`

`colname` and `LazyColumn`

Example Usecase: `.applyInPandas`, `.mapInPandas`, and similar pyspark functions.

`Col`