Skip to main content

Conversion between PySpark and Polars DataFrames

Project description

sparkpolars

sparkpolars is a lightweight library designed for seamless conversions between Apache Spark and Polars without unnecessary dependencies. (Dependencies are only required when explicitly requested.)

Installation

pip install sparkpolars
# or
conda install skandev::sparkpolars

Requirements

  • Python ≥ 3.10
  • Apache Spark ≥ 3.3.0 (must be pre-installed)
  • Polars ≥ 1.0 (must be pre-installed)
  • Pyspark must also be installed if you plan to use this library

Why Does This Library Exist?

The Problem

Typical conversions between Spark and Polars often involve an intermediate Pandas step:

# Traditional approach:
# Spark -> Pandas -> Polars
# or
# Polars -> Pandas -> Spark

The Solution

sparkpolars eliminates unnecessary dependencies like pandas and pyarrow by leveraging native functions such as .collect() and schema interpretation.

Key Benefits

  • 🚀 No extra dependencies – No need for Pandas or PyArrow
  • Reliable handling of complex types – Provides better consistency for MapType, StructType, and nested ArrayType, where existing conversion methods can be unreliable

Features

  • Convert a Spark DataFrame to a Polars DataFrame or LazyFrame
  • Ensures schema consistency: preserves LongType as Int64 instead of mistakenly converting to Int32
  • Three conversion modes: NATIVE, ARROW, PANDAS
  • NATIVE mode properly converts MapType, StructType, and nested ArrayType
  • ARROW and PANDAS modes may have limitations with complex types
  • Configurable conversion settings for Polars list(struct) to Spark MapType
  • Timezone and time unit customization for Polars Datetime

Usage

1. From Spark to Polars DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars()

2. From Spark to Polars LazyFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars(lazy=True)

3. From Polars DataFrame to Spark

from pyspark.sql import SparkSession
from polars import DataFrame

spark = SparkSession.builder.appName("example").getOrCreate()

df = DataFrame({"a": [1], "b": [2]})  # It can also be a LazyDataFrame

spark_df = df.to_spark(spark=spark)
# or 
spark_df = df.to_spark()  # It will try to get the Spark ActiveSession

4. Using Specific Mode

from sparkpolars import ModeMethod

spark_df = df.to_spark(mode=ModeMethod.NATIVE)
spark_df = df.to_spark(mode=ModeMethod.PANDAS)
spark_df = df.to_spark(mode=ModeMethod.ARROW)

polars_df = df.toPolars(mode=ModeMethod.NATIVE)
polars_df = df.toPolars(mode=ModeMethod.PANDAS)
polars_df = df.toPolars(mode=ModeMethod.ARROW)

5. Using Config

from sparkpolars import Config

conf = Config(
    map_elements=["column_should_be_converted_to_map_type", ...],  # Specify columns to convert to MapType
    time_unit="ms",  # Literal["ns", "us", "ms"], defaults to "us"
)
spark_df = df.to_spark(config=conf)

polars_df = df.toPolars(config=conf)

Known Limitations

JVM Timezone Discrepancy

Spark timestamps are collected via the JVM, which may differ from Spark’s timezone settings. If issues arise, verify the JVM timezone.

Memory Constraints

Collecting large datasets into memory can exceed available driver memory, leading to failures. (as for pandas/arrow)

Handling MapType:

From Spark to Polars

If you have in Spark:

Type: StructField("example", MapType(StringType(), IntegerType()))

Data: {"a": 1, "b": 2}

Then it will become in Polars:

Type: {"example": List(Struct("key": String, "value": Int32))}

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

From Polars to Spark

If you have in Polars:

Type: {"example": List(Struct("key": String, "value": Int32))}

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

Then it will become in Spark without specifying any config (Default Behavior):

Type: StructField("example", ArrayType(StructType(StructField("key", StringType())), StructField("value", IntegerType())))

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

If you want this data to be converted to MapType:

from sparkpolars import Config
conf = Config(
    map_elements=["example"]
)

Type: StructField("example", MapType(StringType(), IntegerType()))

Data: {"a": 1, "b": 2}

License

  • pending

Contribution

  • pending

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkpolars-0.0.5.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparkpolars-0.0.5-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file sparkpolars-0.0.5.tar.gz.

File metadata

  • Download URL: sparkpolars-0.0.5.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for sparkpolars-0.0.5.tar.gz
Algorithm Hash digest
SHA256 9cb9a7bf74151a27239166aa0b582afaa532a17c8de24002428d2d688b88baad
MD5 d74878b53c5526ad3d18d7e6d89b7e98
BLAKE2b-256 9b4cbb6ee66d519efe70d47df9d4aa9f062f4b18906bdb6798f1f0d5d4d3543a

See more details on using hashes here.

File details

Details for the file sparkpolars-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: sparkpolars-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for sparkpolars-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 f6bf2e84fedc105d3985f6863eb4d6534aeb0c4641884aa2d2e33a97129b22bc
MD5 35d8de3b127f1e275987bed69d3ebf31
BLAKE2b-256 9842a19f763f978cf81952f8841a20dcd106e2338032301e8165ce60ab546953

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page