Skip to main content

`pyspark_types` is a Python library that provides a simple way to map Python dataclasses to PySpark StructTypes

Project description

PySpark Types

pyspark_types is a Python library that provides a simple way to map Python dataclasses to PySpark StructTypes.

Usage

Pydantic

PySparkBaseModel is a base class for PySpark models that provides methods for converting between PySpark Rows and Pydantic models.

Here's an example of a Pydantic model that will be used to create a PySpark DataFrame:

from pyspark_types.auxiliary import BoundDecimal
from pyspark_types.pydantic import PySparkBaseModel


class Person(PySparkBaseModel):
    name: str
    age: int
    addresses: dict[str, str]
    salary: BoundDecimal

To create a PySpark DataFrame from a list of Person Pydantic models, we can use PySparkBaseModel.create_spark_dataframe() method.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()

# create a list of Pydantic models
data = [
    Person(
        name="Alice",
        age=25,
        addresses={"home": "123 Main St", "work": "456 Pine St"},
        salary=BoundDecimal("5000.00", precision=10, scale=2),
    ),
    Person(
        name="Bob",
        age=30,
        addresses={"home": "789 Elm St", "work": "321 Oak St"},
        salary=BoundDecimal("6000.50", precision=10, scale=2),
    ),
]

# create a PySpark DataFrame from the list of Pydantic models
df = Person.create_spark_dataframe(data, spark)

# show the contents of the DataFrame
df.show()

Output:

+---+-----+--------------------+------+
|age| name|           addresses|salary|
+---+-----+--------------------+------+
| 25|Alice|[home -> 123 Main...|5000.00|
| 30|  Bob|[home -> 789 Elm ...|6000.50|
+---+-----+--------------------+------+

The PySparkBaseModel.create_spark_dataframe() method converts the list of Pydantic models to a list of dictionaries, and then creates a PySpark DataFrame from the list of dictionaries and schema generated from the Pydantic model.

You can also generate a schema based on a Pydantic model by calling the PySparkBaseModel.schema() method:

schema = PySparkBaseModel.schema(Person)

This creates a PySpark schema for the Person Pydantic model.

Note that if you have custom types, such as BoundDecimal, you will need to add support for them in PySparkBaseModel. For example, you can modify the PySparkBaseModel.dict() method to extract BoundDecimal values when mapping to DecimalType.

Dataclasses

To use pyspark_types, you first need to define a Python data class with the fields you want to map to PySpark. For example:

from dataclasses import dataclass

@dataclass
class Person:
    name: str
    age: int
    is_student: bool

To map this data class to a PySpark StructType, you can use the map_dataclass_to_struct() function:

from pyspark_types import map_dataclass_to_struct

person_struct = map_dataclass_to_struct(Person)

This will return a PySpark StructType that corresponds to the Person data class.

You can also use the apply_nullability() function to set the nullable flag for a given PySpark DataType:

from pyspark.sql.types import StringType
from pyspark_types import apply_nullability

nullable_string_type = apply_nullability(StringType(), True)

This will return a new PySpark StringType with the nullable flag set to True.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_types-0.0.2.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

pyspark_types-0.0.2-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_types-0.0.2.tar.gz.

File metadata

  • Download URL: pyspark_types-0.0.2.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for pyspark_types-0.0.2.tar.gz
Algorithm Hash digest
SHA256 62302cfa38cf74c695ba492c4515011b9017333e65843cb6e3195ab651aa76d8
MD5 0cba129e3bcc02e2feea76baa334274f
BLAKE2b-256 845b803ad6a8a1ed3f028cfb1e0be5800050fee3bfc20eaa14752a9b12ddc42a

See more details on using hashes here.

File details

Details for the file pyspark_types-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_types-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4aeac0c726b9e1cd8ef8e0c98297dbe5ef4000fa5c91d4fd7a38a4ae1c25ce01
MD5 823f93e4cfe3153cc1e60efeddf0381e
BLAKE2b-256 b69b3f598b34b77682e61508deab9c2464ac23a2da4125f4aaebd5df78aeb0cf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page