Conversion between PySpark and Polars DataFrames
Project description
sparkpolars
sparkpolars is a lightweight library designed for seamless conversions between Apache Spark and Polars without unnecessary dependencies. (Dependencies are only required when explicitly requested.)
Installation
pip install sparkpolars
# or
conda install skandev::sparkpolars
Requirements
- Python ≥ 3.10
- Apache Spark ≥ 3.3.0 (must be pre-installed)
- Polars ≥ 1.0 (must be pre-installed)
- Pyspark must also be installed if you plan to use this library
Why Does This Library Exist?
The Problem
Typical conversions between Spark and Polars often involve an intermediate Pandas step:
# Traditional approach:
# Spark -> Pandas -> Polars
# or
# Polars -> Pandas -> Spark
The Solution
sparkpolars eliminates unnecessary dependencies like pandas and pyarrow by leveraging native functions such as .collect() and schema interpretation.
Key Benefits
- 🚀 No extra dependencies – No need for Pandas or PyArrow
- ✅ Reliable handling of complex types – Provides better consistency for
MapType,StructType, and nestedArrayType, where existing conversion methods can be unreliable
Features
- Convert a Spark DataFrame to a Polars DataFrame or LazyFrame
- Ensures schema consistency: preserves
LongTypeasInt64instead of mistakenly converting toInt32 - Three conversion modes:
NATIVE,ARROW,PANDAS NATIVEmode properly convertsMapType,StructType, and nestedArrayTypeARROWandPANDASmodes may have limitations with complex types- Configurable conversion settings for Polars
list(struct)to SparkMapType - Timezone and time unit customization for Polars
Datetime
Usage
0. Supercharge Polars and Spark DataFrame
In your __init__.py file at the root project you can do the following for ease of use
from sparkpolars import toPolars, to_spark
from pyspark.sql import DataFrame as SparkDataFrame
from polars import DataFrame as PolarsDataFrame, LazyFrame as PolarsLazyFrame
__all__ = [
"toPolars",
"to_spark",
]
SparkDataFrame.toPolars = toPolars
PolarsDataFrame.to_spark = to_spark
PolarsLazyFrame.to_spark = to_spark
1. From Spark to Polars DataFrame
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.createDataFrame([(1, 2)], ["a", "b"])
polars_df = df.toPolars()
2. From Spark to Polars LazyFrame
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.createDataFrame([(1, 2)], ["a", "b"])
polars_df = df.toPolars(lazy=True)
3. From Polars DataFrame to Spark
from pyspark.sql import SparkSession
from polars import DataFrame
spark = SparkSession.builder.appName("example").getOrCreate()
df = DataFrame({"a": [1], "b": [2]}) # It can also be a LazyDataFrame
spark_df = df.to_spark(spark=spark)
# or
spark_df = df.to_spark() # It will try to get the Spark ActiveSession
4. Using Specific Mode
from sparkpolars import ModeMethod
spark_df = df.to_spark(mode=ModeMethod.NATIVE)
spark_df = df.to_spark(mode=ModeMethod.PANDAS)
spark_df = df.to_spark(mode=ModeMethod.ARROW)
polars_df = df.toPolars(mode=ModeMethod.NATIVE)
polars_df = df.toPolars(mode=ModeMethod.PANDAS)
polars_df = df.toPolars(mode=ModeMethod.ARROW)
5. Using Config
from sparkpolars import Config
conf = Config(
map_elements=["column_should_be_converted_to_map_type", ...], # Specify columns to convert to MapType
time_unit="ms", # Literal["ns", "us", "ms"], defaults to "us"
)
spark_df = df.to_spark(config=conf)
polars_df = df.toPolars(config=conf)
Known Limitations
JVM Timezone Discrepancy
Spark timestamps are collected via the JVM, which may differ from Spark’s timezone settings. If issues arise, verify the JVM timezone.
Memory Constraints
Collecting large datasets into memory can exceed available driver memory, leading to failures. (as for pandas/arrow)
Handling MapType:
From Spark to Polars
If you have in Spark:
Type: StructField("example", MapType(StringType(), IntegerType()))
Data: {"a": 1, "b": 2}
Then it will become in Polars:
Type: {"example": List(Struct("key": String, "value": Int32))}
Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]
From Polars to Spark
If you have in Polars:
Type: {"example": List(Struct("key": String, "value": Int32))}
Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]
Then it will become in Spark without specifying any config (Default Behavior):
Type: StructField("example", ArrayType(StructType(StructField("key", StringType())), StructField("value", IntegerType())))
Data: [{"key": "a", "value": 1}, {"key": "b", "value": 2}]
If you want this data to be converted to MapType:
from sparkpolars import Config
conf = Config(
map_elements=["example"]
)
Type: StructField("example", MapType(StringType(), IntegerType()))
Data: {"a": 1, "b": 2}
License
- MIT License
Contribution
- Create an associated issue, or assign yourself to an existing issue
- Fork the project
- Install all the dependencies
pip install ".[dev,lint,test] - Install pre-commit file
pre-commit install - Develop your feature
- Unit-test your feature
- Create a Pull request
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sparkpolars-0.1.1rc11.tar.gz.
File metadata
- Download URL: sparkpolars-0.1.1rc11.tar.gz
- Upload date:
- Size: 29.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e647285a7a2fb359daf63e4f7547e5c8973c0d50a807bd9302fb34138ce18188
|
|
| MD5 |
cc3d6405bb8dbefbce109c3299db4f7c
|
|
| BLAKE2b-256 |
bcaab5f745abb7f6b584a7597154202b27d626444723d44fe0de8e0058d74d8b
|
File details
Details for the file sparkpolars-0.1.1rc11-py3-none-any.whl.
File metadata
- Download URL: sparkpolars-0.1.1rc11-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa35a91a7b6cbb6734cadf6de5d7b8f2aa567eee2d9732a376076298b79e5526
|
|
| MD5 |
55fdcfb32257bdee199a6cbaba98638f
|
|
| BLAKE2b-256 |
346d2b92547403f10352468b2f1adad028a819c3b2b758d056dd862eb6b5965f
|