Skip to main content

`pyspark_types` is a Python library that provides a simple way to map Python dataclasses to PySpark StructTypes

Project description

PySpark Types

pyspark_types is a Python library that provides a simple way to map Python dataclasses to PySpark StructTypes.

Usage

Pydantic

PySparkBaseModel is a base class for PySpark models that provides methods for converting between PySpark Rows and Pydantic models.

Here's an example of a Pydantic model that will be used to create a PySpark DataFrame:

from pyspark_types.auxiliary import BoundDecimal
from pyspark_types.pydantic import PySparkBaseModel


class Person(PySparkBaseModel):
    name: str
    age: int
    addresses: dict[str, str]
    salary: BoundDecimal

To create a PySpark DataFrame from a list of Person Pydantic models, we can use PySparkBaseModel.create_spark_dataframe() method.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()

# create a list of Pydantic models
data = [
    Person(
        name="Alice",
        age=25,
        addresses={"home": "123 Main St", "work": "456 Pine St"},
        salary=BoundDecimal("5000.00", precision=10, scale=2),
    ),
    Person(
        name="Bob",
        age=30,
        addresses={"home": "789 Elm St", "work": "321 Oak St"},
        salary=BoundDecimal("6000.50", precision=10, scale=2),
    ),
]

# create a PySpark DataFrame from the list of Pydantic models
df = Person.create_spark_dataframe(data, spark)

# show the contents of the DataFrame
df.show()

Output:

+---+-----+--------------------+------+
|age| name|           addresses|salary|
+---+-----+--------------------+------+
| 25|Alice|[home -> 123 Main...|5000.00|
| 30|  Bob|[home -> 789 Elm ...|6000.50|
+---+-----+--------------------+------+

The PySparkBaseModel.create_spark_dataframe() method converts the list of Pydantic models to a list of dictionaries, and then creates a PySpark DataFrame from the list of dictionaries and schema generated from the Pydantic model.

You can also generate a schema based on a Pydantic model by calling the PySparkBaseModel.schema() method:

schema = PySparkBaseModel.schema(Person)

This creates a PySpark schema for the Person Pydantic model.

Note that if you have custom types, such as BoundDecimal, you will need to add support for them in PySparkBaseModel. For example, you can modify the PySparkBaseModel.dict() method to extract BoundDecimal values when mapping to DecimalType.

Dataclasses

To use pyspark_types, you first need to define a Python data class with the fields you want to map to PySpark. For example:

from dataclasses import dataclass

@dataclass
class Person:
    name: str
    age: int
    is_student: bool

To map this data class to a PySpark StructType, you can use the map_dataclass_to_struct() function:

from pyspark_types import map_dataclass_to_struct

person_struct = map_dataclass_to_struct(Person)

This will return a PySpark StructType that corresponds to the Person data class.

You can also use the apply_nullability() function to set the nullable flag for a given PySpark DataType:

from pyspark.sql.types import StringType
from pyspark_types import apply_nullability

nullable_string_type = apply_nullability(StringType(), True)

This will return a new PySpark StringType with the nullable flag set to True.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_types-0.0.3.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

pyspark_types-0.0.3-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_types-0.0.3.tar.gz.

File metadata

  • Download URL: pyspark_types-0.0.3.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.1

File hashes

Hashes for pyspark_types-0.0.3.tar.gz
Algorithm Hash digest
SHA256 dbb4b68e30e5850b8a4dfa8c0350d7162080636645552c1ca2102da4772ee6fe
MD5 ff3a50b1296ae532176f093453347149
BLAKE2b-256 6f046288547fd30d1931f79001f6bbf971ad54134581924f97b4050c1f929eb9

See more details on using hashes here.

File details

Details for the file pyspark_types-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_types-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a122c3c614b042749afc07671325c6f24e48943f5a02f1592bce46e909f1ddca
MD5 72a8444206f8511600aa944d45f722ea
BLAKE2b-256 2dfa7c46646d61732420b9faeeeaa5eb509c49a47e0cf4aba7760d00ecf1473f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page