Skip to main content

Convert between Polars schemas and PySpark schemas

Project description

Charmander

Cross-platform Handling of Array, Recursive, Mapping, And Nested Data Exchange Runtime

Convert between Polars schemas and PySpark schemas with ease.

Charmander provides simple, bidirectional conversion functions to transform schemas between Polars and PySpark, supporting all complex types including nested structures, arrays, and maps.

Installation

pip install charmander

Requirements

  • Python >= 3.8
  • polars >= 0.19.0
  • pyspark >= 3.0.0

Quick Start

Converting Polars Schema to PySpark

import polars as pl
from charmander import to_pyspark_schema

# Define a Polars schema
polars_schema = {
    "name": pl.String,
    "age": pl.Int32,
    "score": pl.Float64,
    "tags": pl.List(pl.String),
}

# Convert to PySpark schema
pyspark_schema = to_pyspark_schema(polars_schema)
print(pyspark_schema)
# StructType([StructField('name', StringType(), True),
#             StructField('age', IntegerType(), True),
#             StructField('score', DoubleType(), True),
#             StructField('tags', ArrayType(StringType(), True), True)])

Converting PySpark Schema to Polars

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, ArrayType
from charmander import to_polars_schema

# Define a PySpark schema
pyspark_schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType()),
    StructField("score", DoubleType()),
    StructField("tags", ArrayType(StringType())),
])

# Convert to Polars schema
polars_schema = to_polars_schema(pyspark_schema)
print(polars_schema)
# {'name': <class 'polars.datatypes.String'>, 'age': <class 'polars.datatypes.Int32'>, ...}

Features

  • Bidirectional Conversion: Convert schemas in both directions
  • Comprehensive Type Support: Supports all primitive and complex types
  • Nested Structures: Handles deeply nested structs, arrays, and maps
  • Type Safety: Clear error messages for unsupported types
  • Simple API: Functional, stateless functions

Supported Types

Primitive Types

Polars PySpark
Int8 ByteType
Int16 ShortType
Int32 IntegerType
Int64 LongType
UInt8 ShortType
UInt16 IntegerType
UInt32 LongType
Float32 FloatType
Float64 DoubleType
Boolean BooleanType
String / Utf8 StringType
Date DateType
Datetime TimestampType
Decimal DecimalType
Binary BinaryType
Null NullType
Categorical StringType
Enum StringType
Int128 DecimalType

PySpark Types:

PySpark Polars
ByteType Int8
ShortType Int32
IntegerType Int32
LongType Int64
FloatType Float32
DoubleType Float64
BooleanType Boolean
StringType String
VarcharType String
CharType String
DateType Date
TimestampType Datetime
TimestampNTZType Datetime
DecimalType Decimal
BinaryType Binary
NullType Null

Complex Types

  • Arrays/Lists: Fully supported with nested arrays
  • Structs: Fully supported with nested structs
  • Maps: PySpark MapType converts to Polars Struct (with key and value fields)

Limitations

Type Conversions with Information Loss

Some type conversions result in information loss or semantic changes:

  • UInt64 → LongType: PySpark doesn't support unsigned 64-bit integers, so UInt64 maps to signed LongType. Values greater than 2^63 - 1 may cause issues.

  • Duration → StringType: Polars Duration types are converted to PySpark StringType as PySpark doesn't have a native duration type. The semantic meaning is lost.

  • Time → TimestampType: Polars Time types are converted to PySpark TimestampType, which may not be the ideal representation.

  • Decimal precision/scale: When converting Polars Decimal to PySpark DecimalType, default precision (10) and scale (0) are used. Precision and scale information is not preserved when converting from PySpark to Polars.

  • MapType → Struct: PySpark MapType is converted to a Polars Struct with key and value fields. This changes the data structure from a map to a struct representation.

Nullability

  • Polars → PySpark: All fields are created with nullable=True, as Polars schemas don't explicitly track nullability at the schema definition level.

  • PySpark → Polars: The nullable attribute from PySpark StructField is not preserved, as Polars schemas don't track nullability per field. All Polars fields can contain nulls by default.

Input Validation

Charmander validates schemas before conversion:

  • Duplicate field names: Raises SchemaError if duplicate field names are detected
  • Empty field names: Raises SchemaError if any field name is an empty string
  • Invalid field types: Raises SchemaError if field types are None
  • Invalid field name types: Raises SchemaError if field names are not strings

Datetime Timezone Handling

  • Polars Datetime types can have timezone information (e.g., pl.Datetime(time_unit="ms", time_zone="UTC"))
  • When converting to PySpark TimestampType, timezone information is not preserved
  • TimestampNTZType (PySpark 3.4+) is converted to Polars Datetime without timezone information
  • The timezone metadata is lost in conversion, but the timestamp value is preserved

Advanced Examples

Nested Structures

import polars as pl
from charmander import to_pyspark_schema

# Define a nested Polars schema
polars_schema = {
    "user": pl.Struct([
        pl.Field("name", pl.String),
        pl.Field("address", pl.Struct([
            pl.Field("street", pl.String),
            pl.Field("city", pl.String),
            pl.Field("zip", pl.Int32),
        ])),
    ]),
}

pyspark_schema = to_pyspark_schema(polars_schema)

Arrays with Nested Types

import polars as pl
from charmander import to_pyspark_schema

# Nested arrays
polars_schema = {
    "matrix": pl.List(pl.List(pl.Float64)),
    "tags": pl.List(pl.String),
}

pyspark_schema = to_pyspark_schema(polars_schema)

Round-Trip Conversion

import polars as pl
from charmander import to_pyspark_schema, to_polars_schema

# Start with Polars schema
original = {
    "name": pl.String,
    "age": pl.Int32,
    "scores": pl.List(pl.Float64),
}

# Convert to PySpark and back
pyspark = to_pyspark_schema(original)
converted_back = to_polars_schema(pyspark)

# Verify types match
assert converted_back["name"] == original["name"]
assert converted_back["age"] == original["age"]

Error Handling

Charmander provides clear error messages through custom exceptions:

from charmander import ConversionError, UnsupportedTypeError, SchemaError

try:
    schema = to_pyspark_schema(invalid_schema)
except SchemaError as e:
    print(f"Invalid schema: {e}")
except UnsupportedTypeError as e:
    print(f"Unsupported type: {e}")
except ConversionError as e:
    print(f"Conversion error: {e}")

API Reference

to_pyspark_schema(polars_schema)

Convert a Polars schema to a PySpark StructType.

Parameters:

  • polars_schema (dict or pl.Schema): Polars schema as a dictionary mapping field names to types, or a polars.Schema object

Returns:

  • pyspark.sql.types.StructType: PySpark schema

Raises:

  • SchemaError: If the schema structure is invalid
  • UnsupportedTypeError: If a type cannot be converted

to_polars_schema(pyspark_schema)

Convert a PySpark StructType to a Polars schema dictionary.

Parameters:

  • pyspark_schema (pyspark.sql.types.StructType): PySpark schema

Returns:

  • dict: Dictionary mapping field names to Polars types

Raises:

  • SchemaError: If the schema structure is invalid
  • UnsupportedTypeError: If a type cannot be converted

Development

Running Tests

pip install -e ".[dev]"
pytest

Project Structure

charmander/
├── charmander/
│   ├── __init__.py          # Public API
│   ├── converters.py         # Core conversion functions
│   ├── type_mappings.py      # Type mapping dictionaries
│   └── errors.py             # Custom exceptions
├── tests/
│   ├── test_converters.py    # Conversion tests
│   └── test_type_mappings.py # Type mapping tests
└── pyproject.toml            # Package configuration

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Inspiration

This project is inspired by poldantic, which provides similar functionality for converting between Pydantic models and Polars schemas.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

charmander-0.1.0.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

charmander-0.1.0-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file charmander-0.1.0.tar.gz.

File metadata

  • Download URL: charmander-0.1.0.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for charmander-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1283c0e032c73c07462e7f0ceede85ffbdbce65b608354a7de45af058683328b
MD5 81f3c320e92266af05d93f3e8668da3c
BLAKE2b-256 01da484897e1484398b88115c20a07f0739ef0a6964f62fe50e2dd2021d89fad

See more details on using hashes here.

File details

Details for the file charmander-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: charmander-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for charmander-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d20d0871884b711130d0c236ce74a265874879335bdcd242175bc204c4f6cb73
MD5 26f811116f70eda6cb21ca33f42f89bd
BLAKE2b-256 45014084555a8b85b864597e93d74ba7047d40cc8b3e313289f1a143650129f3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page