Skip to main content

Conversion between PySpark and Polars DataFrames

Project description

sparkpolars

sparkpolars is a lightweight library designed for seamless conversions between Apache Spark and Polars without unnecessary dependencies. (Dependencies are only required when explicitly requested.)

Installation

pip install sparkpolars  # Waiting for the first release

Requirements

  • Python ≥ 3.10
  • Apache Spark ≥ 3.3.0 (must be pre-installed)
  • Polars ≥ 1.0 (must be pre-installed)
  • Pyspark must also be installed if you plan to use this library

Why Does This Library Exist?

The Problem

Typical conversions between Spark and Polars often involve an intermediate Pandas step:

# Traditional approach:
# Spark -> Pandas -> Polars
# or
# Polars -> Pandas -> Spark

The Solution

sparkpolars eliminates unnecessary dependencies like pandas and pyarrow by leveraging native functions such as .collect() and schema interpretation.

Key Benefits

  • 🚀 No extra dependencies – No need for Pandas or PyArrow
  • Reliable handling of complex types – Provides better consistency for MapType, StructType, and nested ArrayType, where existing conversion methods can be unreliable

Features

  • Convert a Spark DataFrame to a Polars DataFrame or LazyFrame
  • Ensures schema consistency: preserves LongType as Int64 instead of mistakenly converting to Int32
  • Three conversion modes: NATIVE, ARROW, PANDAS
  • NATIVE mode properly converts MapType, StructType, and nested ArrayType
  • ARROW and PANDAS modes may have limitations with complex types
  • Configurable conversion settings for Polars list(struct) to Spark MapType
  • Timezone and time unit customization for Polars Datetime

Usage

1. From Spark to Polars DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars()

2. From Spark to Polars LazyFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars(lazy=True)

3. From Polars DataFrame to Spark

from pyspark.sql import SparkSession
from polars import DataFrame

spark = SparkSession.builder.appName("example").getOrCreate()

df = DataFrame({"a": [1], "b": [2]})  # It can also be a LazyDataFrame

spark_df = df.to_spark(spark=spark)
# or 
spark_df = df.to_spark()  # It will try to get the Spark ActiveSession

4. Using Specific Mode

from sparkpolars import ModeMethod

spark_df = df.to_spark(mode=ModeMethod.NATIVE)
spark_df = df.to_spark(mode=ModeMethod.PANDAS)
spark_df = df.to_spark(mode=ModeMethod.ARROW)

polars_df = df.toPolars(mode=ModeMethod.NATIVE)
polars_df = df.toPolars(mode=ModeMethod.PANDAS)
polars_df = df.toPolars(mode=ModeMethod.ARROW)

5. Using Config

from sparkpolars import Config

conf = Config(
    map_elements=["column_should_be_converted_to_map_type", ...],  # Specify columns to convert to MapType
    time_unit="ms",  # Literal["ns", "us", "ms"], defaults to "us"
)
spark_df = df.to_spark(config=conf)

polars_df = df.toPolars(config=conf)

Known Limitations

JVM Timezone Discrepancy

Spark timestamps are collected via the JVM, which may differ from Spark’s timezone settings. If issues arise, verify the JVM timezone.

Memory Constraints

Collecting large datasets into memory can exceed available driver memory, leading to failures. (as for pandas/arrow)

Handling MapType:

From Spark to Polars

If you have in Spark:

Type: StructField("example", MapType(StringType(), IntegerType()))

Data: {"a": 1, "b": 2}

Then it will become in Polars:

Type: {"example": List(Struct("key": String, "value": Int32))}

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

From Polars to Spark

If you have in Polars:

Type: {"example": List(Struct("key": String, "value": Int32))}

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

Then it will become in Spark without specifying any config (Default Behavior):

Type: StructField("example", ArrayType(StructType(StructField("key", StringType())), StructField("value", IntegerType())))

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

If you want this data to be converted to MapType:

from sparkpolars import Config
conf = Config(
    map_elements=["example"]
)

Type: StructField("example", MapType(StringType(), IntegerType()))

Data: {"a": 1, "b": 2}

License

  • pending

Contribution

  • pending

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkpolars-0.0.3.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparkpolars-0.0.3-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file sparkpolars-0.0.3.tar.gz.

File metadata

  • Download URL: sparkpolars-0.0.3.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for sparkpolars-0.0.3.tar.gz
Algorithm Hash digest
SHA256 8cfcd193566cafcaf81f7e0c8a97d55c9cd572922f108916386b3a9a2dda0cc0
MD5 d473e25df8e82da9baeafb53d12a1c12
BLAKE2b-256 8760ed2990a7725d8993e8da7b668e6488aec8260b66a1840e15bc6ea0597590

See more details on using hashes here.

File details

Details for the file sparkpolars-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: sparkpolars-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for sparkpolars-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 227ca4294a2eee9f1a43d1d5f936b5fee5226f7def31b2b64e336b4be1b495fa
MD5 23c9bb1a3f7c728f912641eac577266d
BLAKE2b-256 96276b2101b8b2dade05dc4287b1b9464bb87894aae0e70ed46077697df24920

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page