Skip to main content

Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API

Project description

pyspark-data-sources

pypi

This repository showcases custom Spark data sources built using the new Python Data Source API for the upcoming Apache Spark 4.0 release. For an in-depth understanding of the API, please refer to the API source code. Note this repo is demo only and please be aware that it is not intended for production use. Contributions and feedback are welcome to help improve the examples.

Installation

pip install pyspark-data-sources[all]

Usage

Install the pyspark 4.0 preview version: https://pypi.org/project/pyspark/4.0.0.dev1/

pip install "pyspark[connect]==4.0.0.dev1"

Or use Databricks Runtime 15.2 or above.

Batch Data Sources

from pyspark_datasources.github import GithubDataSource

# Register the data source
spark.dataSource.register(GithubDataSource)

spark.read.format("github").load("apache/spark").show()

See more here: https://allisonwang-db.github.io/pyspark-data-sources/.

Streaming Data Sources

from pyspark_datasource.weather import WeatherDataSource

# Register the data source
spark.dataSource.register(WeatherDataSource)

# Get an API key from tomorrow.io
api_key = "<your-api-key>"
sites = """[
    (37.7749, -122.4194),    # San Francisco
    (40.7128, -74.0060),     # New York City
]"""

# Ingest the weather data.
df = (
    spark.readStream.format("weather")
        .option("locations", sites)
        .option("apikey", api_key)
        .load()
)

# Write to the console.
df.writeStream.format("console").trigger(availableNow=True).start()

Contributing

We welcome and appreciate any contributions to enhance and expand the custom data sources. If you're interested in contributing:

  • Add New Data Sources: Want to add a new data source using the Python Data Source API? Submit a pull request or open an issue.
  • Suggest Enhancements: If you have ideas to improve a data source or the API, we'd love to hear them!
  • Report Bugs: Found something that doesn't work as expected? Let us know by opening an issue.

Need help or have questions? Don't hesitate to open a new issue, and we'll do our best to assist you.

Development

poetry shell

Build docs

mkdocs serve

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_data_sources-0.1.5.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

pyspark_data_sources-0.1.5-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_data_sources-0.1.5.tar.gz.

File metadata

  • Download URL: pyspark_data_sources-0.1.5.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.12.3 Darwin/23.6.0

File hashes

Hashes for pyspark_data_sources-0.1.5.tar.gz
Algorithm Hash digest
SHA256 e8d4328dd4a407312bfa4eb0992277a424f21a57acb12f9450cbcfe6d37af233
MD5 5f77ea83527ad350280a01dff2542de0
BLAKE2b-256 ec44f9683d07fb56e58698551b6b8dbc2aa5e9fcc237958cf7355106046cf017

See more details on using hashes here.

File details

Details for the file pyspark_data_sources-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_data_sources-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 cdf544e82ae17ec5a0fec17fdf468e5c989b3e59957ffb62c850b66b43942217
MD5 5adab239a67ba506aaedba0c02cdc8b3
BLAKE2b-256 9c4e450dd4b21041508fe9479c390b5a6495fe24ab2ca120a76b2eb9a333003d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page