Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API
Project description
pyspark-data-sources
This repository showcases custom Spark data sources built using the new Python Data Source API for the upcoming Apache Spark 4.0 release. For an in-depth understanding of the API, please refer to the API source code. Note this repo is demo only and please be aware that it is not intended for production use. Contributions and feedback are welcome to help improve the examples.
Installation
pip install pyspark-data-sources[all]
Usage
Install the pyspark 4.0 preview version: https://pypi.org/project/pyspark/4.0.0.dev1/
pip install "pyspark[connect]==4.0.0.dev1"
Or use Databricks Runtime 15.2 or above.
Batch Data Sources
from pyspark_datasources.github import GithubDataSource
# Register the data source
spark.dataSource.register(GithubDataSource)
spark.read.format("github").load("apache/spark").show()
See more here: https://allisonwang-db.github.io/pyspark-data-sources/.
Streaming Data Sources
from pyspark_datasource.weather import WeatherDataSource
# Register the data source
spark.dataSource.register(WeatherDataSource)
# Get an API key from tomorrow.io
api_key = "<your-api-key>"
sites = """[
(37.7749, -122.4194), # San Francisco
(40.7128, -74.0060), # New York City
]"""
# Ingest the weather data.
df = (
spark.readStream.format("weather")
.option("locations", sites)
.option("apikey", api_key)
.load()
)
# Write to the console.
df.writeStream.format("console").trigger(availableNow=True).start()
Contributing
We welcome and appreciate any contributions to enhance and expand the custom data sources. If you're interested in contributing:
- Add New Data Sources: Want to add a new data source using the Python Data Source API? Submit a pull request or open an issue.
- Suggest Enhancements: If you have ideas to improve a data source or the API, we'd love to hear them!
- Report Bugs: Found something that doesn't work as expected? Let us know by opening an issue.
Need help or have questions? Don't hesitate to open a new issue, and we'll do our best to assist you.
Development
poetry shell
Build docs
mkdocs serve
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyspark_data_sources-0.1.5.tar.gz
.
File metadata
- Download URL: pyspark_data_sources-0.1.5.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.12.3 Darwin/23.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8d4328dd4a407312bfa4eb0992277a424f21a57acb12f9450cbcfe6d37af233 |
|
MD5 | 5f77ea83527ad350280a01dff2542de0 |
|
BLAKE2b-256 | ec44f9683d07fb56e58698551b6b8dbc2aa5e9fcc237958cf7355106046cf017 |
File details
Details for the file pyspark_data_sources-0.1.5-py3-none-any.whl
.
File metadata
- Download URL: pyspark_data_sources-0.1.5-py3-none-any.whl
- Upload date:
- Size: 16.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.12.3 Darwin/23.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cdf544e82ae17ec5a0fec17fdf468e5c989b3e59957ffb62c850b66b43942217 |
|
MD5 | 5adab239a67ba506aaedba0c02cdc8b3 |
|
BLAKE2b-256 | 9c4e450dd4b21041508fe9479c390b5a6495fe24ab2ca120a76b2eb9a333003d |