A utility package for converting between PySpark and Polars DataFrames

These details have not been verified by PyPI

Project links

Project description

sparkpl

A lightweight, pandas-free Python package for seamless conversion between PySpark and Polars DataFrames.

Installation

pip install sparkpl

Features

🚀 Direct Arrow conversion - Uses native Arrow for maximum performance (Spark 4.0+)
⚡ Zero pandas dependency - Pure Polars ↔ Spark conversion
🔄 Bidirectional conversion - Seamless data exchange between frameworks
🛡️ Type preservation - Maintains data types during conversion
📊 Batch processing - Handles large datasets efficiently
🔍 Smart logging - Structured logging with loguru
🎯 Simple API - Both functional and class-based interfaces
💾 Minimal footprint - Lightweight with essential dependencies only

Quick Start

import polars as pl
from pyspark.sql import SparkSession
from sparkpl.converter import spark_to_polars, polars_to_spark

# Initialize Spark
spark = SparkSession.builder.appName("example").getOrCreate()

# Create sample data
spark_df = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])

# Convert Spark → Polars
polars_df = spark_to_polars(spark_df)
print(polars_df)

# Convert Polars → Spark
spark_df_back = polars_to_spark(polars_df)
spark_df_back.show()

Advanced Usage

Class-based API

from sparkpl.converter import DataFrameConverter

converter = DataFrameConverter(spark)

# With Arrow optimization (default)
polars_df = converter.spark_to_polars(spark_df, use_arrow=True)

# Native fallback for compatibility
polars_df = converter.spark_to_polars(spark_df, use_arrow=False)

# Batch processing for large datasets
polars_df = converter.spark_to_polars(large_spark_df, batch_size=100000)

# Create temporary view
spark_df = converter.polars_to_spark(polars_df, table_name="my_table")

Error Handling

from sparkpl.converter import DataFrameConverterError

try:
    polars_df = spark_to_polars(spark_df)
except DataFrameConverterError as e:
    print(f"Conversion failed: {e}")

Logging Configuration

from loguru import logger

# Configure structured logging
logger.add("sparkpl.log", rotation="10 MB", level="INFO")

# Conversions automatically log progress
polars_df = spark_to_polars(spark_df)  # Logs conversion metrics

Performance

SparkPL automatically selects the optimal conversion method:

Spark 4.0+: Direct Arrow conversion (toArrow() → createDataFrame(arrow_table))
Older versions: Native collection methods with fallback
Large datasets: Automatic batching to manage memory

Type Support

Polars Type	Spark Type	Notes
`pl.Utf8`	`StringType`
`pl.Int32`	`IntegerType`
`pl.Int64`	`LongType`
`pl.Float32`	`FloatType`
`pl.Float64`	`DoubleType`
`pl.Boolean`	`BooleanType`
`pl.Date`	`DateType`
`pl.Datetime`	`TimestampType`
`pl.Binary`	`BinaryType`
`pl.Time`	`StringType`	Converted to string
`pl.Duration`	`LongType`	Microseconds

Requirements

Python >=3.8
polars >=0.18.0
pyspark >=3.0.0
pyarrow >=5.0.0
loguru >=0.6.0

API Reference

Functions

spark_to_polars(spark_df, **kwargs) - Convert Spark DataFrame to Polars
polars_to_spark(polars_df, **kwargs) - Convert Polars DataFrame to Spark

DataFrameConverter Class

spark_to_polars(spark_df, use_arrow=True, batch_size=None)
polars_to_spark(polars_df, use_arrow=True, table_name=None)
validate_conversion(original_df, converted_df, check_data=False)

Why No Pandas?

SparkPL eliminates pandas dependency for:

Reduced footprint - Fewer dependencies to manage
Better performance - Direct conversion without intermediate steps
Simplified deployment - No pandas version conflicts
Pure workflow - Stay within Polars/Spark ecosystem

Examples

Basic Conversion

# Sample data
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
spark_df = spark.createDataFrame(data, ["name", "age"])

# Convert and process
polars_df = spark_to_polars(spark_df)
filtered = polars_df.filter(pl.col("age") > 28)
result_spark = polars_to_spark(filtered)

Working with Large Data

# Process large dataset in chunks
converter = DataFrameConverter(spark)
large_polars = converter.spark_to_polars(
    huge_spark_df, 
    batch_size=50000  # Process 50k rows at a time
)

Contributing

Fork the repository
Create feature branch: git checkout -b feature/my-feature
Make changes with tests
Commit: git commit -am 'Add feature'
Push: git push origin feature/my-feature
Create pull request

Development Setup

git clone https://github.com/yourusername/sparkpl.git
cd sparkpl
pip install -e ".[dev]"
pytest tests/

License

MIT License - see LICENSE file.

Support

Issues: GitHub Issues
Documentation: Coming soon
Community: Discussions welcome

Built with ❤️ for the Python data community.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.1

Jul 2, 2025

2.0.0

Jul 2, 2025

0.1.2

Dec 13, 2024

0.1.1

Dec 13, 2024

0.1.0

Dec 9, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkpl-2.0.1.tar.gz (7.7 kB view details)

Uploaded Jul 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparkpl-2.0.1-py3-none-any.whl (7.7 kB view details)

Uploaded Jul 2, 2025 Python 3

File details

Details for the file sparkpl-2.0.1.tar.gz.

File metadata

Download URL: sparkpl-2.0.1.tar.gz
Upload date: Jul 2, 2025
Size: 7.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for sparkpl-2.0.1.tar.gz
Algorithm	Hash digest
SHA256	`d6f9fcdad0da8ef490576ad76fb1cd558a07d0bb6f5009218d08df5586d5434a`
MD5	`1d4e1b6743c92b7861b636814031b3d6`
BLAKE2b-256	`b8aa9bd97f55cf2c71dc1c151ad1f4c8ab8d99c48a18713e5b510c9bd80370f0`

See more details on using hashes here.

File details

Details for the file sparkpl-2.0.1-py3-none-any.whl.

File metadata

Download URL: sparkpl-2.0.1-py3-none-any.whl
Upload date: Jul 2, 2025
Size: 7.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for sparkpl-2.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f44d1946cd9c3ddb9ecc99b6579af5c3f6b8ec0d67890392e2753f58a68a163e`
MD5	`73727ad9b2ac40c17c3e33c5dbacd25b`
BLAKE2b-256	`ad0a4b58241a4c681f7b87613875ce1034251442ce899f9650229d4b34654404`

See more details on using hashes here.

sparkpl 2.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

sparkpl

Installation

Features

Quick Start

Advanced Usage

Class-based API

Error Handling

Logging Configuration

Performance

Type Support

Requirements

API Reference

Functions

DataFrameConverter Class

Why No Pandas?

Examples

Basic Conversion

Working with Large Data

Contributing

Development Setup

License

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes