`pyspark_types` is a Python library that provides a simple way to map Python dataclasses to PySpark StructTypes
Project description
PySpark Types
pyspark_types
is a Python library that provides a simple way to map Python dataclasses to PySpark StructTypes.
Usage
Pydantic
PySparkBaseModel is a base class for PySpark models that provides methods for converting between PySpark Rows and Pydantic models.
Here's an example of a Pydantic model that will be used to create a PySpark DataFrame:
from pyspark_types.auxiliary import BoundDecimal
from pyspark_types.pydantic import PySparkBaseModel
class Person(PySparkBaseModel):
name: str
age: int
addresses: dict[str, str]
salary: BoundDecimal
To create a PySpark DataFrame from a list of Person Pydantic models, we can use PySparkBaseModel.create_spark_dataframe() method.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
# create a list of Pydantic models
data = [
Person(
name="Alice",
age=25,
addresses={"home": "123 Main St", "work": "456 Pine St"},
salary=BoundDecimal("5000.00", precision=10, scale=2),
),
Person(
name="Bob",
age=30,
addresses={"home": "789 Elm St", "work": "321 Oak St"},
salary=BoundDecimal("6000.50", precision=10, scale=2),
),
]
# create a PySpark DataFrame from the list of Pydantic models
df = Person.create_spark_dataframe(data, spark)
# show the contents of the DataFrame
df.show()
Output:
+---+-----+--------------------+------+
|age| name| addresses|salary|
+---+-----+--------------------+------+
| 25|Alice|[home -> 123 Main...|5000.00|
| 30| Bob|[home -> 789 Elm ...|6000.50|
+---+-----+--------------------+------+
The PySparkBaseModel.create_spark_dataframe() method converts the list of Pydantic models to a list of dictionaries, and then creates a PySpark DataFrame from the list of dictionaries and schema generated from the Pydantic model.
You can also generate a schema based on a Pydantic model by calling the PySparkBaseModel.schema() method:
schema = PySparkBaseModel.schema(Person)
This creates a PySpark schema for the Person Pydantic model.
Note that if you have custom types, such as BoundDecimal, you will need to add support for them in PySparkBaseModel. For example, you can modify the PySparkBaseModel.dict() method to extract BoundDecimal values when mapping to DecimalType.
Dataclasses
To use pyspark_types, you first need to define a Python data class with the fields you want to map to PySpark. For example:
from dataclasses import dataclass
@dataclass
class Person:
name: str
age: int
is_student: bool
To map this data class to a PySpark StructType, you can use the map_dataclass_to_struct() function:
from pyspark_types import map_dataclass_to_struct
person_struct = map_dataclass_to_struct(Person)
This will return a PySpark StructType that corresponds to the Person data class.
You can also use the apply_nullability() function to set the nullable flag for a given PySpark DataType:
from pyspark.sql.types import StringType
from pyspark_types import apply_nullability
nullable_string_type = apply_nullability(StringType(), True)
This will return a new PySpark StringType with the nullable flag set to True.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyspark_types-0.0.3.tar.gz
.
File metadata
- Download URL: pyspark_types-0.0.3.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dbb4b68e30e5850b8a4dfa8c0350d7162080636645552c1ca2102da4772ee6fe |
|
MD5 | ff3a50b1296ae532176f093453347149 |
|
BLAKE2b-256 | 6f046288547fd30d1931f79001f6bbf971ad54134581924f97b4050c1f929eb9 |
File details
Details for the file pyspark_types-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: pyspark_types-0.0.3-py3-none-any.whl
- Upload date:
- Size: 6.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a122c3c614b042749afc07671325c6f24e48943f5a02f1592bce46e909f1ddca |
|
MD5 | 72a8444206f8511600aa944d45f722ea |
|
BLAKE2b-256 | 2dfa7c46646d61732420b9faeeeaa5eb509c49a47e0cf4aba7760d00ecf1473f |