Conversion between PySpark and Polars DataFrames

These details have not been verified by PyPI

Project links

Project description

sparkpolars

sparkpolars is a lightweight library designed for seamless conversions between Apache Spark and Polars without unnecessary dependencies. (Dependencies are only required when explicitly requested.)

Installation

pip install sparkpolars  # Waiting for the first release

Requirements

Python ≥ 3.10
Apache Spark ≥ 3.3.0 (must be pre-installed)
Polars ≥ 1.0 (must be pre-installed)
Pyspark must also be installed if you plan to use this library

Why Does This Library Exist?

The Problem

Typical conversions between Spark and Polars often involve an intermediate Pandas step:

# Traditional approach:
# Spark -> Pandas -> Polars
# or
# Polars -> Pandas -> Spark

The Solution

sparkpolars eliminates unnecessary dependencies like pandas and pyarrow by leveraging native functions such as .collect() and schema interpretation.

Key Benefits

🚀 No extra dependencies – No need for Pandas or PyArrow
✅ Reliable handling of complex types – Provides better consistency for MapType, StructType, and nested ArrayType, where existing conversion methods can be unreliable

Features

Convert a Spark DataFrame to a Polars DataFrame or LazyFrame
Ensures schema consistency: preserves LongType as Int64 instead of mistakenly converting to Int32
Three conversion modes: NATIVE, ARROW, PANDAS
NATIVE mode properly converts MapType, StructType, and nested ArrayType
ARROW and PANDAS modes may have limitations with complex types
Configurable conversion settings for Polars list(struct) to Spark MapType
Timezone and time unit customization for Polars Datetime

Usage

1. From Spark to Polars DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars()

2. From Spark to Polars LazyFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.createDataFrame([(1, 2)], ["a", "b"])

polars_df = df.toPolars(lazy=True)

3. From Polars DataFrame to Spark

from pyspark.sql import SparkSession
from polars import DataFrame

spark = SparkSession.builder.appName("example").getOrCreate()

df = DataFrame({"a": [1], "b": [2]})  # It can also be a LazyDataFrame

spark_df = df.to_spark(spark=spark)
# or 
spark_df = df.to_spark()  # It will try to get the Spark ActiveSession

4. Using Specific Mode

from sparkpolars import ModeMethod

spark_df = df.to_spark(mode=ModeMethod.NATIVE)
spark_df = df.to_spark(mode=ModeMethod.PANDAS)
spark_df = df.to_spark(mode=ModeMethod.ARROW)

polars_df = df.toPolars(mode=ModeMethod.NATIVE)
polars_df = df.toPolars(mode=ModeMethod.PANDAS)
polars_df = df.toPolars(mode=ModeMethod.ARROW)

5. Using Config

from sparkpolars import Config

conf = Config(
    map_elements=["column_should_be_converted_to_map_type", ...],  # Specify columns to convert to MapType
    time_unit="ms",  # Literal["ns", "us", "ms"], defaults to "us"
)
spark_df = df.to_spark(config=conf)

polars_df = df.toPolars(config=conf)

Known Limitations

JVM Timezone Discrepancy

Spark timestamps are collected via the JVM, which may differ from Spark’s timezone settings. If issues arise, verify the JVM timezone.

Memory Constraints

Collecting large datasets into memory can exceed available driver memory, leading to failures. (as for pandas/arrow)

Handling `MapType`:

From Spark to Polars

If you have in Spark:

Type: StructField("example", MapType(StringType(), IntegerType()))

Data: {"a": 1, "b": 2}

Then it will become in Polars:

Type: {"example": List(Struct("key": String, "value": Int32))}

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

From Polars to Spark

If you have in Polars:

Type: {"example": List(Struct("key": String, "value": Int32))}

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

Then it will become in Spark without specifying any config (Default Behavior):

Type: StructField("example", ArrayType(StructType(StructField("key", StringType())), StructField("value", IntegerType())))

Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]

If you want this data to be converted to MapType:

from sparkpolars import Config
conf = Config(
    map_elements=["example"]
)

Type: StructField("example", MapType(StringType(), IntegerType()))

Data: {"a": 1, "b": 2}

License

pending

Contribution

pending

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1rc18 pre-release

Jul 15, 2025

0.1.1rc17 pre-release

Jul 15, 2025

0.1.1rc16 pre-release

Jul 15, 2025

0.1.1rc15 pre-release

Jul 15, 2025

0.1.1rc14 pre-release

Jul 15, 2025

0.1.1rc13 pre-release

Jul 15, 2025

0.1.1rc12 pre-release

Jul 15, 2025

0.1.1rc11 pre-release

Jul 14, 2025

0.1.1rc10 pre-release

Jul 11, 2025

0.1.1rc9 pre-release

Jul 11, 2025

0.1.1rc8 pre-release

Jul 10, 2025

0.1.1rc7 pre-release

Jul 10, 2025

0.1.1rc6 pre-release

Jul 10, 2025

0.1.0

Feb 14, 2025

0.0.10

Feb 13, 2025

0.0.9

Feb 12, 2025

0.0.8

Feb 12, 2025

0.0.7

Feb 12, 2025

0.0.5

Feb 12, 2025

This version

0.0.4

Feb 11, 2025

0.0.3

Feb 11, 2025

0.0.2

Feb 11, 2025

0.0.1

Feb 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkpolars-0.0.4.tar.gz (19.0 kB view details)

Uploaded Feb 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparkpolars-0.0.4-py3-none-any.whl (11.2 kB view details)

Uploaded Feb 11, 2025 Python 3

File details

Details for the file sparkpolars-0.0.4.tar.gz.

File metadata

Download URL: sparkpolars-0.0.4.tar.gz
Upload date: Feb 11, 2025
Size: 19.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for sparkpolars-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`fe08ec59d86206e3026a634551670b2c3bdff2f190f85a13d5f26dd1039ecb3f`
MD5	`94c3d62dd31a8f3b863e0d0ceb6b6cce`
BLAKE2b-256	`6662a9c5aefbc038337b9c8ab0a06947f1ae9ccdc3d626224d1c58eead684586`

See more details on using hashes here.

File details

Details for the file sparkpolars-0.0.4-py3-none-any.whl.

File metadata

Download URL: sparkpolars-0.0.4-py3-none-any.whl
Upload date: Feb 11, 2025
Size: 11.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for sparkpolars-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`89f0ab586a63ec8d114cdb49b95389600be1d6d0a9286ec693125b270849f00e`
MD5	`78dacf3aa455999fd534b50e935008f2`
BLAKE2b-256	`e42fca6b65c798ee0eee98ee19c99e2ec00b91951fed9ee21d514383c8d83715`

See more details on using hashes here.

sparkpolars 0.0.4

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

sparkpolars

Installation

Requirements

Why Does This Library Exist?

The Problem

The Solution

Key Benefits

Features

Usage

1. From Spark to Polars DataFrame

2. From Spark to Polars LazyFrame

3. From Polars DataFrame to Spark

4. Using Specific Mode

5. Using Config

Known Limitations

JVM Timezone Discrepancy

Memory Constraints

Handling MapType:

From Spark to Polars

From Polars to Spark

License

Contribution

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Handling `MapType`: