Skip to main content

Convert between Polars schemas and PySpark schemas

Project description

Charmander

Cross-platform Handling of Array, Recursive, Mapping, And Nested Data Exchange Runtime

Convert between Polars schemas and PySpark schemas with ease.

Charmander provides simple, bidirectional conversion functions to transform schemas between Polars and PySpark, supporting all complex types including nested structures, arrays, and maps.

Installation

pip install charmander

Requirements

  • Python >= 3.8
  • polars >= 0.19.0
  • pyspark >= 3.0.0

Quick Start

Converting Polars Schema to PySpark

Charmander supports three Polars schema formats - use whichever is most convenient:

import polars as pl
from charmander import to_pyspark_schema

# Format 1: Dictionary
polars_schema_dict = {
    "name": pl.String,
    "age": pl.Int32,
    "score": pl.Float64,
    "tags": pl.List(pl.String),
}

# Format 2: pl.Schema object
polars_schema_schema = pl.Schema({
    "name": pl.String,
    "age": pl.Int32,
    "score": pl.Float64,
    "tags": pl.List(pl.String),
})

# Format 3: List of tuples
polars_schema_list = [
    ("name", pl.String),
    ("age", pl.Int32),
    ("score", pl.Float64),
    ("tags", pl.List(pl.String)),
]

# All three formats work identically!
pyspark_schema = to_pyspark_schema(polars_schema_dict)
# or: to_pyspark_schema(polars_schema_schema)
# or: to_pyspark_schema(polars_schema_list)

print(pyspark_schema)
# StructType([StructField('name', StringType(), True),
#             StructField('age', IntegerType(), True),
#             StructField('score', DoubleType(), True),
#             StructField('tags', ArrayType(StringType(), True), True)])

Converting PySpark Schema to Polars

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, ArrayType
from charmander import to_polars_schema

# Define a PySpark schema
pyspark_schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType()),
    StructField("score", DoubleType()),
    StructField("tags", ArrayType(StringType())),
])

# Convert to Polars schema
polars_schema = to_polars_schema(pyspark_schema)
print(polars_schema)
# Schema({'name': <class 'polars.datatypes.String'>, 'age': <class 'polars.datatypes.Int32'>, ...})

# Use directly with Polars DataFrame
df = pl.DataFrame({}, schema=polars_schema)

Features

  • Bidirectional Conversion: Convert schemas in both directions (Polars ↔ PySpark)
  • Multiple Schema Formats: Supports pl.Schema, dict[str, pl.DataType], and Iterable[tuple[str, pl.DataType]] formats
  • Native Polars Integration: Returns pl.Schema objects from to_polars_schema for seamless DataFrame integration
  • Comprehensive Type Support: Supports all primitive and complex types
  • Nested Structures: Handles deeply nested structs, arrays, and maps
  • Type Safety: Clear error messages for unsupported types
  • Simple API: Functional, stateless functions - easy to use and understand

Supported Types

Primitive Types

Polars PySpark
Int8 ByteType
Int16 ShortType
Int32 IntegerType
Int64 LongType
UInt8 ShortType
UInt16 IntegerType
UInt32 LongType
Float32 FloatType
Float64 DoubleType
Boolean BooleanType
String / Utf8 StringType
Date DateType
Datetime TimestampType
Decimal DecimalType
Binary BinaryType
Null NullType
Categorical StringType
Enum StringType
Int128 DecimalType

PySpark Types:

PySpark Polars
ByteType Int8
ShortType Int32
IntegerType Int32
LongType Int64
FloatType Float32
DoubleType Float64
BooleanType Boolean
StringType String
VarcharType String
CharType String
DateType Date
TimestampType Datetime
TimestampNTZType Datetime
DecimalType Decimal
BinaryType Binary
NullType Null

Complex Types

  • Arrays/Lists: Fully supported with nested arrays
  • Structs: Fully supported with nested structs
  • Maps: PySpark MapType converts to Polars Struct (with key and value fields)

Limitations

Type Conversions with Information Loss

Some type conversions result in information loss or semantic changes:

  • UInt64 → LongType: PySpark doesn't support unsigned 64-bit integers, so UInt64 maps to signed LongType. Values greater than 2^63 - 1 may cause issues.

  • Duration → StringType: Polars Duration types are converted to PySpark StringType as PySpark doesn't have a native duration type. The semantic meaning is lost.

  • Time → TimestampType: Polars Time types are converted to PySpark TimestampType, which may not be the ideal representation.

  • Decimal precision/scale: When converting Polars Decimal to PySpark DecimalType, default precision (10) and scale (0) are used. Precision and scale information is not preserved when converting from PySpark to Polars.

  • MapType → Struct: PySpark MapType is converted to a Polars Struct with key and value fields. This changes the data structure from a map to a struct representation.

Nullability

  • Polars → PySpark: All fields are created with nullable=True, as Polars schemas don't explicitly track nullability at the schema definition level.

  • PySpark → Polars: The nullable attribute from PySpark StructField is not preserved, as Polars schemas don't track nullability per field. All Polars fields can contain nulls by default.

Input Validation

Charmander validates schemas before conversion:

  • Duplicate field names: Raises SchemaError if duplicate field names are detected
  • Empty field names: Raises SchemaError if any field name is an empty string
  • Invalid field types: Raises SchemaError if field types are None
  • Invalid field name types: Raises SchemaError if field names are not strings

Datetime Timezone Handling

  • Polars Datetime types can have timezone information (e.g., pl.Datetime(time_unit="ms", time_zone="UTC"))
  • When converting to PySpark TimestampType, timezone information is not preserved
  • TimestampNTZType (PySpark 3.4+) is converted to Polars Datetime without timezone information
  • The timezone metadata is lost in conversion, but the timestamp value is preserved

Advanced Examples

Nested Structures

import polars as pl
from charmander import to_pyspark_schema

# Define a nested Polars schema
polars_schema = {
    "user": pl.Struct([
        pl.Field("name", pl.String),
        pl.Field("address", pl.Struct([
            pl.Field("street", pl.String),
            pl.Field("city", pl.String),
            pl.Field("zip", pl.Int32),
        ])),
    ]),
}

pyspark_schema = to_pyspark_schema(polars_schema)

Arrays with Nested Types

import polars as pl
from charmander import to_pyspark_schema

# Nested arrays
polars_schema = {
    "matrix": pl.List(pl.List(pl.Float64)),
    "tags": pl.List(pl.String),
}

pyspark_schema = to_pyspark_schema(polars_schema)

Round-Trip Conversion

import polars as pl
from charmander import to_pyspark_schema, to_polars_schema

# Start with Polars schema (any format works)
original = {
    "name": pl.String,
    "age": pl.Int32,
    "scores": pl.List(pl.Float64),
}

# Convert to PySpark and back
pyspark = to_pyspark_schema(original)
converted_back = to_polars_schema(pyspark)  # Returns pl.Schema

# Verify types match (pl.Schema supports dict-like access)
assert converted_back["name"] == original["name"]
assert converted_back["age"] == original["age"]
assert isinstance(converted_back, pl.Schema)

Error Handling

Charmander provides clear error messages through custom exceptions. All exceptions inherit from ConversionError, so you can catch all conversion errors at once or handle them individually:

from charmander import ConversionError, UnsupportedTypeError, SchemaError

# Example 1: Handle specific error types
try:
    schema = to_pyspark_schema(invalid_schema)
except SchemaError as e:
    print(f"Invalid schema structure: {e}")
    # Handles: duplicate field names, empty field names, invalid field types, etc.
except UnsupportedTypeError as e:
    print(f"Unsupported type: {e}")
    # Handles: types that cannot be converted between Polars and PySpark
except ConversionError as e:
    print(f"General conversion error: {e}")
    # Catches all conversion-related errors (base class)

# Example 2: Catch all conversion errors
try:
    schema = to_pyspark_schema(invalid_schema)
except ConversionError as e:
    print(f"Conversion failed: {e}")
    # This will catch SchemaError, UnsupportedTypeError, and any future error types

# Example 3: Common error scenarios
try:
    # Invalid iterable format
    schema = to_pyspark_schema([("name", pl.String), "invalid"])
except SchemaError as e:
    print(f"Schema validation failed: {e}")
    # Output: "Invalid schema format: <class 'list'>. Expected iterable of (field_name, type) tuples. Item at index 1 is not a tuple: 'invalid'"

try:
    # Duplicate field names
    schema = to_pyspark_schema([("name", pl.String), ("name", pl.Int32)])
except SchemaError as e:
    print(f"Duplicate field: {e}")
    # Output: "Invalid schema format: <class 'list'>. Duplicate field name found: 'name'"

try:
    # Unsupported type
    schema = to_pyspark_schema({"field": some_unsupported_type})
except UnsupportedTypeError as e:
    print(f"Unsupported type: {e}")
    # Output includes list of supported types

API Reference

to_pyspark_schema(polars_schema)

Convert a Polars schema to a PySpark StructType.

Parameters:

  • polars_schema: Polars schema in any supported format:
    • pl.Schema object
    • dict[str, pl.DataType]: Dictionary mapping field names to types
    • Iterable[tuple[str, pl.DataType]]: Iterable of (field_name, type) tuples (e.g., list or tuple of tuples)

Returns:

  • pyspark.sql.types.StructType: PySpark schema

Raises:

  • SchemaError: If the schema structure is invalid
  • UnsupportedTypeError: If a type cannot be converted

Example:

import polars as pl
from charmander import to_pyspark_schema

# All three formats work:
schema1 = {"name": pl.String, "age": pl.Int32}
schema2 = pl.Schema({"name": pl.String, "age": pl.Int32})
schema3 = [("name", pl.String), ("age", pl.Int32)]

pyspark_schema = to_pyspark_schema(schema1)  # or schema2, or schema3

to_polars_schema(pyspark_schema)

Convert a PySpark StructType to a Polars schema.

Parameters:

  • pyspark_schema (pyspark.sql.types.StructType): PySpark schema

Returns:

  • pl.Schema: Polars Schema object mapping field names to Polars types

Raises:

  • SchemaError: If the schema structure is invalid
  • UnsupportedTypeError: If a type cannot be converted

Example:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from charmander import to_polars_schema

pyspark_schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType())
])

polars_schema = to_polars_schema(pyspark_schema)
# Returns pl.Schema object - use directly with Polars DataFrames
df = pl.DataFrame({}, schema=polars_schema)

Development

Running Tests

pip install -e ".[dev]"
pytest

Project Structure

charmander/
├── charmander/
│   ├── __init__.py          # Public API
│   ├── converters.py         # Core conversion functions
│   ├── type_mappings.py      # Type mapping dictionaries
│   └── errors.py             # Custom exceptions
├── tests/
│   ├── test_converters.py    # Conversion tests
│   └── test_type_mappings.py # Type mapping tests
└── pyproject.toml            # Package configuration

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Inspiration

This project is inspired by poldantic, which provides similar functionality for converting between Pydantic models and Polars schemas.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

charmander-0.2.0.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

charmander-0.2.0-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file charmander-0.2.0.tar.gz.

File metadata

  • Download URL: charmander-0.2.0.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for charmander-0.2.0.tar.gz
Algorithm Hash digest
SHA256 426b5dd3431dd9b01b11fd5df9dc6bbe9f5ed3a7a7194d06113db6345f37d4dd
MD5 be5a7eeb6578c3e3b2f263be09d4f429
BLAKE2b-256 72465b37beadeb5aa083724e8c4d71c6217356bd06d3b023e2e122c4cccb1de5

See more details on using hashes here.

File details

Details for the file charmander-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: charmander-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for charmander-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6600c2166f7d400e7a06cddff927c9a58f73ca1e7a28293dac1f3fef77752918
MD5 2054294424c352dcc8c19d78892be044
BLAKE2b-256 16900b88cfe12d7ec91e0c9d4d3f52bc56beed47e0f192dda5a67e25733e729f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page