Skip to main content

Conversion between PySpark and Polars DataFrames

Project description

sparkpolars

sparkpolars is a lightweight library designed for seamless conversions between Apache Spark and Polars without unnecessary dependencies. (Dependencies are only required when explicitly requested.)

Installation

pip install sparkpolars
# or
conda install skandev::sparkpolars

Requirements

  • Python ≥ 3.10
  • Apache Spark ≥ 3.3.0 (must be pre-installed)
  • Polars ≥ 1.0 (must be pre-installed)
  • Pyspark must also be installed if you plan to use this library

Why Does This Library Exist?

The Problem

Typical conversions between Spark and Polars often involve an intermediate Pandas step:

# Traditional approach:
# Spark -> Pandas -> Polars
# or
# Polars -> Pandas -> Spark

The Solution

sparkpolars eliminates unnecessary dependencies like pandas and pyarrow by leveraging native functions such as .collect() and schema interpretation.

Key Benefits

  • 🚀 No extra dependencies – No need for Pandas or PyArrow
  • Reliable handling of complex types – Provides better consistency for MapType, StructType, and nested ArrayType, where existing conversion methods can be unreliable

Features

  • Convert a Spark DataFrame to a Polars DataFrame or LazyFrame
  • Ensures schema consistency: preserves LongType as Int64 instead of mistakenly converting to Int32
  • Three conversion modes: NATIVE, ARROW, PANDAS
  • NATIVE mode properly converts MapType, StructType, and nested ArrayType
  • ARROW and PANDAS modes may have limitations with complex types
  • Configurable conversion settings for Polars list(struct) to Spark MapType
  • Timezone and time unit customization for Polars Datetime

Usage

0. Supercharge Polars and Spark DataFrame

In your __init__.py file at the root project you can do the following for ease of use

from sparkpolars import toPolars, to_spark
from pyspark.sql import DataFrame as SparkDataFrame
from polars import DataFrame as PolarsDataFrame, LazyFrame as PolarsLazyFrame

__all__ = [
    "toPolars",
    "to_spark",
]

SparkDataFrame.toPolars = toPolars
PolarsDataFrame.to_spark = to_spark
PolarsLazyFrame.to_spark = to_spark

1. From Spark to Polars DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars()

2. From Spark to Polars LazyFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars(lazy=True)

3. From Polars DataFrame to Spark

from pyspark.sql import SparkSession
from polars import DataFrame

spark = SparkSession.builder.appName("example").getOrCreate()

df = DataFrame({"a": [1], "b": [2]})  # It can also be a LazyDataFrame

spark_df = df.to_spark(spark=spark)
# or
spark_df = df.to_spark()  # It will try to get the Spark ActiveSession

4. Using Specific Mode

from sparkpolars import ModeMethod

spark_df = df.to_spark(mode=ModeMethod.NATIVE)
spark_df = df.to_spark(mode=ModeMethod.PANDAS)
spark_df = df.to_spark(mode=ModeMethod.ARROW)

polars_df = df.toPolars(mode=ModeMethod.NATIVE)
polars_df = df.toPolars(mode=ModeMethod.PANDAS)
polars_df = df.toPolars(mode=ModeMethod.ARROW)

5. Using Config

from sparkpolars import Config

conf = Config(
    map_elements=["column_should_be_converted_to_map_type", ...],  # Specify columns to convert to MapType
    time_unit="ms",  # Literal["ns", "us", "ms"], defaults to "us"
)
spark_df = df.to_spark(config=conf)

polars_df = df.toPolars(config=conf)

Known Limitations

JVM Timezone Discrepancy

Spark timestamps are collected via the JVM, which may differ from Spark’s timezone settings. If issues arise, verify the JVM timezone.

Memory Constraints

Collecting large datasets into memory can exceed available driver memory, leading to failures. (as for pandas/arrow)

Handling MapType:

From Spark to Polars

If you have in Spark:

Type: StructField("example", MapType(StringType(), IntegerType()))

Data: {"a": 1, "b": 2}

Then it will become in Polars:

Type: {"example": List(Struct("key": String, "value": Int32))}

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

From Polars to Spark

If you have in Polars:

Type: {"example": List(Struct("key": String, "value": Int32))}

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

Then it will become in Spark without specifying any config (Default Behavior):

Type: StructField("example", ArrayType(StructType(StructField("key", StringType())), StructField("value", IntegerType())))

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

If you want this data to be converted to MapType:

from sparkpolars import Config
conf = Config(
    map_elements=["example"]
)

Type: StructField("example", MapType(StringType(), IntegerType()))

Data: {"a": 1, "b": 2}

License

  • MIT License

Contribution

  • Create an associated issue, or assign yourself to an existing issue
  • Fork the project
  • Install all the dependencies pip install ".[dev,lint,test]
  • Install pre-commit file pre-commit install
  • Develop your feature
  • Unit-test your feature
  • Create a Pull request

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkpolars-0.1.1rc17.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparkpolars-0.1.1rc17-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file sparkpolars-0.1.1rc17.tar.gz.

File metadata

  • Download URL: sparkpolars-0.1.1rc17.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for sparkpolars-0.1.1rc17.tar.gz
Algorithm Hash digest
SHA256 8f79802a524bd9004b5ba301d36e9a9ae01a631561e61f937d3ce464f4805127
MD5 8f7083d0bd27bd98665462f7fc6ee5d7
BLAKE2b-256 7a7bd895f05d5e6fa7c32ee8a412de306ca3fe3779f1edde5a73616b7a518c69

See more details on using hashes here.

File details

Details for the file sparkpolars-0.1.1rc17-py3-none-any.whl.

File metadata

File hashes

Hashes for sparkpolars-0.1.1rc17-py3-none-any.whl
Algorithm Hash digest
SHA256 905b16c7e5b5d1cda01dcf3decebbf90378e1d767c5e05ac179ba32032faee8c
MD5 f6d0d5e327a2c3f605ec3dade8649652
BLAKE2b-256 54fd00c90a3d7c7be0346d0c91f7074fcdecef591ea14a3db9a2ec8f17e6f5df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page