Convert between Polars schemas and PySpark schemas
Project description
Charmander
Cross-platform Handling of Array, Recursive, Mapping, And Nested Data Exchange Runtime
Convert between Polars schemas and PySpark schemas with ease.
Charmander provides simple, bidirectional conversion functions to transform schemas between Polars and PySpark, supporting all complex types including nested structures, arrays, and maps.
Installation
pip install charmander
Requirements
- Python >= 3.8
- polars >= 0.19.0
- pyspark >= 3.0.0
Quick Start
Converting Polars Schema to PySpark
import polars as pl
from charmander import to_pyspark_schema
# Define a Polars schema
polars_schema = {
"name": pl.String,
"age": pl.Int32,
"score": pl.Float64,
"tags": pl.List(pl.String),
}
# Convert to PySpark schema
pyspark_schema = to_pyspark_schema(polars_schema)
print(pyspark_schema)
# StructType([StructField('name', StringType(), True),
# StructField('age', IntegerType(), True),
# StructField('score', DoubleType(), True),
# StructField('tags', ArrayType(StringType(), True), True)])
Converting PySpark Schema to Polars
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, ArrayType
from charmander import to_polars_schema
# Define a PySpark schema
pyspark_schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType()),
StructField("score", DoubleType()),
StructField("tags", ArrayType(StringType())),
])
# Convert to Polars schema
polars_schema = to_polars_schema(pyspark_schema)
print(polars_schema)
# {'name': <class 'polars.datatypes.String'>, 'age': <class 'polars.datatypes.Int32'>, ...}
Features
- Bidirectional Conversion: Convert schemas in both directions
- Comprehensive Type Support: Supports all primitive and complex types
- Nested Structures: Handles deeply nested structs, arrays, and maps
- Type Safety: Clear error messages for unsupported types
- Simple API: Functional, stateless functions
Supported Types
Primitive Types
| Polars | PySpark |
|---|---|
Int8 |
ByteType |
Int16 |
ShortType |
Int32 |
IntegerType |
Int64 |
LongType |
UInt8 |
ShortType |
UInt16 |
IntegerType |
UInt32 |
LongType |
Float32 |
FloatType |
Float64 |
DoubleType |
Boolean |
BooleanType |
String / Utf8 |
StringType |
Date |
DateType |
Datetime |
TimestampType |
Decimal |
DecimalType |
Binary |
BinaryType |
Null |
NullType |
Categorical |
StringType |
Enum |
StringType |
Int128 |
DecimalType |
PySpark Types:
| PySpark | Polars |
|---|---|
ByteType |
Int8 |
ShortType |
Int32 |
IntegerType |
Int32 |
LongType |
Int64 |
FloatType |
Float32 |
DoubleType |
Float64 |
BooleanType |
Boolean |
StringType |
String |
VarcharType |
String |
CharType |
String |
DateType |
Date |
TimestampType |
Datetime |
TimestampNTZType |
Datetime |
DecimalType |
Decimal |
BinaryType |
Binary |
NullType |
Null |
Complex Types
- Arrays/Lists: Fully supported with nested arrays
- Structs: Fully supported with nested structs
- Maps: PySpark
MapTypeconverts to PolarsStruct(withkeyandvaluefields)
Limitations
Type Conversions with Information Loss
Some type conversions result in information loss or semantic changes:
-
UInt64 → LongType: PySpark doesn't support unsigned 64-bit integers, so
UInt64maps to signedLongType. Values greater than2^63 - 1may cause issues. -
Duration → StringType: Polars
Durationtypes are converted to PySparkStringTypeas PySpark doesn't have a native duration type. The semantic meaning is lost. -
Time → TimestampType: Polars
Timetypes are converted to PySparkTimestampType, which may not be the ideal representation. -
Decimal precision/scale: When converting Polars
Decimalto PySparkDecimalType, default precision (10) and scale (0) are used. Precision and scale information is not preserved when converting from PySpark to Polars. -
MapType → Struct: PySpark
MapTypeis converted to a PolarsStructwithkeyandvaluefields. This changes the data structure from a map to a struct representation.
Nullability
-
Polars → PySpark: All fields are created with
nullable=True, as Polars schemas don't explicitly track nullability at the schema definition level. -
PySpark → Polars: The
nullableattribute from PySparkStructFieldis not preserved, as Polars schemas don't track nullability per field. All Polars fields can contain nulls by default.
Input Validation
Charmander validates schemas before conversion:
- Duplicate field names: Raises
SchemaErrorif duplicate field names are detected - Empty field names: Raises
SchemaErrorif any field name is an empty string - Invalid field types: Raises
SchemaErrorif field types areNone - Invalid field name types: Raises
SchemaErrorif field names are not strings
Datetime Timezone Handling
- Polars
Datetimetypes can have timezone information (e.g.,pl.Datetime(time_unit="ms", time_zone="UTC")) - When converting to PySpark
TimestampType, timezone information is not preserved TimestampNTZType(PySpark 3.4+) is converted to PolarsDatetimewithout timezone information- The timezone metadata is lost in conversion, but the timestamp value is preserved
Advanced Examples
Nested Structures
import polars as pl
from charmander import to_pyspark_schema
# Define a nested Polars schema
polars_schema = {
"user": pl.Struct([
pl.Field("name", pl.String),
pl.Field("address", pl.Struct([
pl.Field("street", pl.String),
pl.Field("city", pl.String),
pl.Field("zip", pl.Int32),
])),
]),
}
pyspark_schema = to_pyspark_schema(polars_schema)
Arrays with Nested Types
import polars as pl
from charmander import to_pyspark_schema
# Nested arrays
polars_schema = {
"matrix": pl.List(pl.List(pl.Float64)),
"tags": pl.List(pl.String),
}
pyspark_schema = to_pyspark_schema(polars_schema)
Round-Trip Conversion
import polars as pl
from charmander import to_pyspark_schema, to_polars_schema
# Start with Polars schema
original = {
"name": pl.String,
"age": pl.Int32,
"scores": pl.List(pl.Float64),
}
# Convert to PySpark and back
pyspark = to_pyspark_schema(original)
converted_back = to_polars_schema(pyspark)
# Verify types match
assert converted_back["name"] == original["name"]
assert converted_back["age"] == original["age"]
Error Handling
Charmander provides clear error messages through custom exceptions:
from charmander import ConversionError, UnsupportedTypeError, SchemaError
try:
schema = to_pyspark_schema(invalid_schema)
except SchemaError as e:
print(f"Invalid schema: {e}")
except UnsupportedTypeError as e:
print(f"Unsupported type: {e}")
except ConversionError as e:
print(f"Conversion error: {e}")
API Reference
to_pyspark_schema(polars_schema)
Convert a Polars schema to a PySpark StructType.
Parameters:
polars_schema(dict orpl.Schema): Polars schema as a dictionary mapping field names to types, or apolars.Schemaobject
Returns:
pyspark.sql.types.StructType: PySpark schema
Raises:
SchemaError: If the schema structure is invalidUnsupportedTypeError: If a type cannot be converted
to_polars_schema(pyspark_schema)
Convert a PySpark StructType to a Polars schema dictionary.
Parameters:
pyspark_schema(pyspark.sql.types.StructType): PySpark schema
Returns:
dict: Dictionary mapping field names to Polars types
Raises:
SchemaError: If the schema structure is invalidUnsupportedTypeError: If a type cannot be converted
Development
Running Tests
pip install -e ".[dev]"
pytest
Project Structure
charmander/
├── charmander/
│ ├── __init__.py # Public API
│ ├── converters.py # Core conversion functions
│ ├── type_mappings.py # Type mapping dictionaries
│ └── errors.py # Custom exceptions
├── tests/
│ ├── test_converters.py # Conversion tests
│ └── test_type_mappings.py # Type mapping tests
└── pyproject.toml # Package configuration
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Inspiration
This project is inspired by poldantic, which provides similar functionality for converting between Pydantic models and Polars schemas.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file charmander-0.1.0.tar.gz.
File metadata
- Download URL: charmander-0.1.0.tar.gz
- Upload date:
- Size: 15.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1283c0e032c73c07462e7f0ceede85ffbdbce65b608354a7de45af058683328b
|
|
| MD5 |
81f3c320e92266af05d93f3e8668da3c
|
|
| BLAKE2b-256 |
01da484897e1484398b88115c20a07f0739ef0a6964f62fe50e2dd2021d89fad
|
File details
Details for the file charmander-0.1.0-py3-none-any.whl.
File metadata
- Download URL: charmander-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d20d0871884b711130d0c236ce74a265874879335bdcd242175bc204c4f6cb73
|
|
| MD5 |
26f811116f70eda6cb21ca33f42f89bd
|
|
| BLAKE2b-256 |
45014084555a8b85b864597e93d74ba7047d40cc8b3e313289f1a143650129f3
|