Skip to main content

Conversion between PySpark and Polars DataFrames

Project description

sparkpolars

sparkpolars is a lightweight library designed for seamless conversions between Apache Spark and Polars without unnecessary dependencies. (Dependencies are only required when explicitly requested.)

Installation

pip install sparkpolars
# or
conda install skandev::sparkpolars

Requirements

  • Python ≥ 3.10
  • Apache Spark ≥ 3.3.0 (must be pre-installed)
  • Polars ≥ 1.0 (must be pre-installed)
  • Pyspark must also be installed if you plan to use this library

Why Does This Library Exist?

The Problem

Typical conversions between Spark and Polars often involve an intermediate Pandas step:

# Traditional approach:
# Spark -> Pandas -> Polars
# or
# Polars -> Pandas -> Spark

The Solution

sparkpolars eliminates unnecessary dependencies like pandas and pyarrow by leveraging native functions such as .collect() and schema interpretation.

Key Benefits

  • 🚀 No extra dependencies – No need for Pandas or PyArrow
  • Reliable handling of complex types – Provides better consistency for MapType, StructType, and nested ArrayType, where existing conversion methods can be unreliable

Features

  • Convert a Spark DataFrame to a Polars DataFrame or LazyFrame
  • Ensures schema consistency: preserves LongType as Int64 instead of mistakenly converting to Int32
  • Three conversion modes: NATIVE, ARROW, PANDAS
  • NATIVE mode properly converts MapType, StructType, and nested ArrayType
  • ARROW and PANDAS modes may have limitations with complex types
  • Configurable conversion settings for Polars list(struct) to Spark MapType
  • Timezone and time unit customization for Polars Datetime

Usage

0. Supercharge Polars and Spark DataFrame

In your __init__.py file at the root project you can do the following for ease of use

from sparkpolars import toPolars, to_spark
from pyspark.sql import DataFrame as SparkDataFrame
from polars import DataFrame as PolarsDataFrame, LazyFrame as PolarsLazyFrame

__all__ = [
    "toPolars",
    "to_spark",
]

SparkDataFrame.toPolars = toPolars
PolarsDataFrame.to_spark = to_spark
PolarsLazyFrame.to_spark = to_spark

1. From Spark to Polars DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars()

2. From Spark to Polars LazyFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars(lazy=True)

3. From Polars DataFrame to Spark

from pyspark.sql import SparkSession
from polars import DataFrame

spark = SparkSession.builder.appName("example").getOrCreate()

df = DataFrame({"a": [1], "b": [2]})  # It can also be a LazyDataFrame

spark_df = df.to_spark(spark=spark)
# or
spark_df = df.to_spark()  # It will try to get the Spark ActiveSession

4. Using Specific Mode

from sparkpolars import ModeMethod

spark_df = df.to_spark(mode=ModeMethod.NATIVE)
spark_df = df.to_spark(mode=ModeMethod.PANDAS)
spark_df = df.to_spark(mode=ModeMethod.ARROW)

polars_df = df.toPolars(mode=ModeMethod.NATIVE)
polars_df = df.toPolars(mode=ModeMethod.PANDAS)
polars_df = df.toPolars(mode=ModeMethod.ARROW)

5. Using Config

from sparkpolars import Config

conf = Config(
    map_elements=["column_should_be_converted_to_map_type", ...],  # Specify columns to convert to MapType
    time_unit="ms",  # Literal["ns", "us", "ms"], defaults to "us"
)
spark_df = df.to_spark(config=conf)

polars_df = df.toPolars(config=conf)

Known Limitations

JVM Timezone Discrepancy

Spark timestamps are collected via the JVM, which may differ from Spark’s timezone settings. If issues arise, verify the JVM timezone.

Memory Constraints

Collecting large datasets into memory can exceed available driver memory, leading to failures. (as for pandas/arrow)

Handling MapType:

From Spark to Polars

If you have in Spark:

Type: StructField("example", MapType(StringType(), IntegerType()))

Data: {"a": 1, "b": 2}

Then it will become in Polars:

Type: {"example": List(Struct("key": String, "value": Int32))}

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

From Polars to Spark

If you have in Polars:

Type: {"example": List(Struct("key": String, "value": Int32))}

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

Then it will become in Spark without specifying any config (Default Behavior):

Type: StructField("example", ArrayType(StructType(StructField("key", StringType())), StructField("value", IntegerType())))

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

If you want this data to be converted to MapType:

from sparkpolars import Config
conf = Config(
    map_elements=["example"]
)

Type: StructField("example", MapType(StringType(), IntegerType()))

Data: {"a": 1, "b": 2}

License

  • MIT License

Contribution

  • Create an associated issue, or assign yourself to an existing issue
  • Fork the project
  • Install all the dependencies pip install ".[dev,lint,test]
  • Install pre-commit file pre-commit install
  • Develop your feature
  • Unit-test your feature
  • Create a Pull request

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkpolars-0.1.1rc11.tar.gz (29.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparkpolars-0.1.1rc11-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file sparkpolars-0.1.1rc11.tar.gz.

File metadata

  • Download URL: sparkpolars-0.1.1rc11.tar.gz
  • Upload date:
  • Size: 29.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for sparkpolars-0.1.1rc11.tar.gz
Algorithm Hash digest
SHA256 e647285a7a2fb359daf63e4f7547e5c8973c0d50a807bd9302fb34138ce18188
MD5 cc3d6405bb8dbefbce109c3299db4f7c
BLAKE2b-256 bcaab5f745abb7f6b584a7597154202b27d626444723d44fe0de8e0058d74d8b

See more details on using hashes here.

File details

Details for the file sparkpolars-0.1.1rc11-py3-none-any.whl.

File metadata

File hashes

Hashes for sparkpolars-0.1.1rc11-py3-none-any.whl
Algorithm Hash digest
SHA256 fa35a91a7b6cbb6734cadf6de5d7b8f2aa567eee2d9732a376076298b79e5526
MD5 55fdcfb32257bdee199a6cbaba98638f
BLAKE2b-256 346d2b92547403f10352468b2f1adad028a819c3b2b758d056dd862eb6b5965f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page