Skip to main content

A utility package for converting between PySpark and Polars DataFrames

Project description

sparkpl

A lightweight, pandas-free Python package for seamless conversion between PySpark and Polars DataFrames.

Installation

pip install sparkpl

Features

  • 🚀 Direct Arrow conversion - Uses native Arrow for maximum performance (Spark 4.0+)
  • Zero pandas dependency - Pure Polars ↔ Spark conversion
  • 🔄 Bidirectional conversion - Seamless data exchange between frameworks
  • 🛡️ Type preservation - Maintains data types during conversion
  • 📊 Batch processing - Handles large datasets efficiently
  • 🔍 Smart logging - Structured logging with loguru
  • 🎯 Simple API - Both functional and class-based interfaces
  • 💾 Minimal footprint - Lightweight with essential dependencies only

Quick Start

import polars as pl
from pyspark.sql import SparkSession
from sparkpl.converter import spark_to_polars, polars_to_spark

# Initialize Spark
spark = SparkSession.builder.appName("example").getOrCreate()

# Create sample data
spark_df = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])

# Convert Spark → Polars
polars_df = spark_to_polars(spark_df)
print(polars_df)

# Convert Polars → Spark
spark_df_back = polars_to_spark(polars_df)
spark_df_back.show()

Advanced Usage

Class-based API

from sparkpl.converter import DataFrameConverter

converter = DataFrameConverter(spark)

# With Arrow optimization (default)
polars_df = converter.spark_to_polars(spark_df, use_arrow=True)

# Native fallback for compatibility
polars_df = converter.spark_to_polars(spark_df, use_arrow=False)

# Batch processing for large datasets
polars_df = converter.spark_to_polars(large_spark_df, batch_size=100000)

# Create temporary view
spark_df = converter.polars_to_spark(polars_df, table_name="my_table")

Error Handling

from sparkpl.converter import DataFrameConverterError

try:
    polars_df = spark_to_polars(spark_df)
except DataFrameConverterError as e:
    print(f"Conversion failed: {e}")

Logging Configuration

from loguru import logger

# Configure structured logging
logger.add("sparkpl.log", rotation="10 MB", level="INFO")

# Conversions automatically log progress
polars_df = spark_to_polars(spark_df)  # Logs conversion metrics

Performance

SparkPL automatically selects the optimal conversion method:

  • Spark 4.0+: Direct Arrow conversion (toArrow()createDataFrame(arrow_table))
  • Older versions: Native collection methods with fallback
  • Large datasets: Automatic batching to manage memory

Type Support

Polars Type Spark Type Notes
pl.Utf8 StringType
pl.Int32 IntegerType
pl.Int64 LongType
pl.Float32 FloatType
pl.Float64 DoubleType
pl.Boolean BooleanType
pl.Date DateType
pl.Datetime TimestampType
pl.Binary BinaryType
pl.Time StringType Converted to string
pl.Duration LongType Microseconds

Requirements

  • Python >=3.8
  • polars >=0.18.0
  • pyspark >=3.0.0
  • pyarrow >=5.0.0
  • loguru >=0.6.0

API Reference

Functions

  • spark_to_polars(spark_df, **kwargs) - Convert Spark DataFrame to Polars
  • polars_to_spark(polars_df, **kwargs) - Convert Polars DataFrame to Spark

DataFrameConverter Class

  • spark_to_polars(spark_df, use_arrow=True, batch_size=None)
  • polars_to_spark(polars_df, use_arrow=True, table_name=None)
  • validate_conversion(original_df, converted_df, check_data=False)

Why No Pandas?

SparkPL eliminates pandas dependency for:

  • Reduced footprint - Fewer dependencies to manage
  • Better performance - Direct conversion without intermediate steps
  • Simplified deployment - No pandas version conflicts
  • Pure workflow - Stay within Polars/Spark ecosystem

Examples

Basic Conversion

# Sample data
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
spark_df = spark.createDataFrame(data, ["name", "age"])

# Convert and process
polars_df = spark_to_polars(spark_df)
filtered = polars_df.filter(pl.col("age") > 28)
result_spark = polars_to_spark(filtered)

Working with Large Data

# Process large dataset in chunks
converter = DataFrameConverter(spark)
large_polars = converter.spark_to_polars(
    huge_spark_df, 
    batch_size=50000  # Process 50k rows at a time
)

Contributing

  1. Fork the repository
  2. Create feature branch: git checkout -b feature/my-feature
  3. Make changes with tests
  4. Commit: git commit -am 'Add feature'
  5. Push: git push origin feature/my-feature
  6. Create pull request

Development Setup

git clone https://github.com/yourusername/sparkpl.git
cd sparkpl
pip install -e ".[dev]"
pytest tests/

License

MIT License - see LICENSE file.

Support

  • Issues: GitHub Issues
  • Documentation: Coming soon
  • Community: Discussions welcome

Built with ❤️ for the Python data community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkpl-2.0.1.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparkpl-2.0.1-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file sparkpl-2.0.1.tar.gz.

File metadata

  • Download URL: sparkpl-2.0.1.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for sparkpl-2.0.1.tar.gz
Algorithm Hash digest
SHA256 d6f9fcdad0da8ef490576ad76fb1cd558a07d0bb6f5009218d08df5586d5434a
MD5 1d4e1b6743c92b7861b636814031b3d6
BLAKE2b-256 b8aa9bd97f55cf2c71dc1c151ad1f4c8ab8d99c48a18713e5b510c9bd80370f0

See more details on using hashes here.

File details

Details for the file sparkpl-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: sparkpl-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for sparkpl-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f44d1946cd9c3ddb9ecc99b6579af5c3f6b8ec0d67890392e2753f58a68a163e
MD5 73727ad9b2ac40c17c3e33c5dbacd25b
BLAKE2b-256 ad0a4b58241a4c681f7b87613875ce1034251442ce899f9650229d4b34654404

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page