Skip to main content

Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API

Project description

pyspark-data-sources

pypi

This repository showcases custom Spark data sources built using the new Python Data Source API for the upcoming Apache Spark 4.0 release. For an in-depth understanding of the API, please refer to the API source code.

Installation

pip install pyspark-data-sources

Usage

Note: Currently the following code only works with Apache Spark master branch.

from pyspark_datasources.github import GithubDataSource

# Register the data source
spark.dataSource.register(GithubDataSource)

spark.read.format("github").load("apache/spark").show()

Contributing

We welcome and appreciate any contributions to enhance and expand the custom data sources. If you're interested in contributing:

  • Add New Data Sources: Want to add a new data source using the Python Data Source API? Submit a pull request or open an issue.
  • Suggest Enhancements: If you have ideas to improve a data source or the API, we'd love to hear them!
  • Report Bugs: Found something that doesn't work as expected? Let us know by opening an issue.

Need help or have questions? Don't hesitate to open a new issue, and we'll do our best to assist you.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_data_sources-0.1.1.tar.gz (8.1 kB view hashes)

Uploaded Source

Built Distribution

pyspark_data_sources-0.1.1-py3-none-any.whl (9.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page