Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API

These details have not been verified by PyPI

Project description

PySpark Data Sources

This repository showcases custom Spark data sources built using the new Python Data Source API introduced in Apache Spark 4.0. For an in-depth understanding of the API, please refer to the API source code. Note this repo is demo only and please be aware that it is not intended for production use. Contributions and feedback are welcome to help improve the examples.

Installation

pip install pyspark-data-sources

Usage

Make sure you have pyspark >= 4.0.0 installed.

pip install pyspark

Or use Databricks Runtime 15.4 LTS or above versions, or Databricks Serverless.

from pyspark_datasources.fake import FakeDataSource

# Register the data source
spark.dataSource.register(FakeDataSource)

spark.read.format("fake").load().show()

# For streaming data generation
spark.readStream.format("fake").load().writeStream.format("console").start()

Example Data Sources

Data Source	Short Name	Description	Dependencies
GithubDataSource	`github`	Read pull requests from a Github repository	None
FakeDataSource	`fake`	Generate fake data using the `Faker` library	`faker`
StockDataSource	`stock`	Read stock data from Alpha Vantage	None
GoogleSheetsDataSource	`googlesheets`	Read table from public Google Sheets	None
KaggleDataSource	`kaggle`	Read datasets from Kaggle	`kagglehub`, `pandas`
SimpleJsonDataSource	`simplejson`	Write JSON data to Databricks DBFS	`databricks-sdk`
OpenSkyDataSource	`opensky`	Read from OpenSky Network.	None
SalesforceDataSource	`pyspark.datasource.salesforce`	Streaming datasource for writing data to Salesforce	`simple-salesforce`

See more here: https://allisonwang-db.github.io/pyspark-data-sources/.

Official Data Sources

For production use, consider these official data source implementations built with the Python Data Source API:

Data Source	Repository	Description	Features
HuggingFace Datasets	@huggingface/pyspark_huggingface	Production-ready Spark Data Source for 🤗 Hugging Face Datasets	• Stream datasets as Spark DataFrames • Select subsets/splits with filters • Authentication support • Save DataFrames to Hugging Face

Data Source Naming Convention

When creating custom data sources using the Python Data Source API, follow these naming conventions for the short_name parameter:

Recommended Approach

Use the system name directly: Use lowercase system names like huggingface, opensky, googlesheets, etc.
This provides clear, intuitive naming that matches the service being integrated

Conflict Resolution

If there's a naming conflict: Use the format pyspark.datasource.<system_name>
Example: pyspark.datasource.salesforce if "salesforce" conflicts with existing naming

Examples from this repository:

# Direct system naming (preferred)
spark.read.format("github").load()       # GithubDataSource
spark.read.format("googlesheets").load() # GoogleSheetsDataSource  
spark.read.format("opensky").load()      # OpenSkyDataSource

# Namespaced format (when conflicts exist)
spark.read.format("pyspark.datasource.opensky").load()

Contributing

We welcome and appreciate any contributions to enhance and expand the custom data sources.:

Add New Data Sources: Want to add a new data source using the Python Data Source API? Submit a pull request or open an issue.
Suggest Enhancements: If you have ideas to improve a data source or the API, we'd love to hear them!
Report Bugs: Found something that doesn't work as expected? Let us know by opening an issue.

Development

Environment Setup

poetry install
poetry env activate

Build Docs

mkdocs serve

Code Formatting

This project uses Ruff for code formatting and linting.

# Format code
poetry run ruff format .

# Run linter
poetry run ruff check .

# Run linter with auto-fix
poetry run ruff check . --fix

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.11

Jan 26, 2026

This version

0.1.10

Aug 7, 2025

0.1.9

Aug 1, 2025

0.1.8

Jul 22, 2025

0.1.7

Jul 21, 2025

0.1.6

Jun 4, 2025

0.1.5

Sep 5, 2024

0.1.4

Jun 7, 2024

0.1.4a0 pre-release

Jun 5, 2024

0.1.2

Feb 15, 2024

0.1.1

Feb 14, 2024

0.1.0

Feb 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_data_sources-0.1.10.tar.gz (29.7 kB view details)

Uploaded Aug 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyspark_data_sources-0.1.10-py3-none-any.whl (34.9 kB view details)

Uploaded Aug 7, 2025 Python 3

File details

Details for the file pyspark_data_sources-0.1.10.tar.gz.

File metadata

Download URL: pyspark_data_sources-0.1.10.tar.gz
Upload date: Aug 7, 2025
Size: 29.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.13.3 Darwin/24.5.0

File hashes

Hashes for pyspark_data_sources-0.1.10.tar.gz
Algorithm	Hash digest
SHA256	`7c1c43506f37fa7bded8089e0f5c408437077e2440df1bd999a7f37f8af8231f`
MD5	`3ee5f9e2a0b0cdf50c45d46d44cc0a20`
BLAKE2b-256	`b14edcbad7598090270a539bda09553778e20ce03bade92ad2398a600aa1a761`

See more details on using hashes here.

File details

Details for the file pyspark_data_sources-0.1.10-py3-none-any.whl.

File metadata

Download URL: pyspark_data_sources-0.1.10-py3-none-any.whl
Upload date: Aug 7, 2025
Size: 34.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.13.3 Darwin/24.5.0

File hashes

Hashes for pyspark_data_sources-0.1.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`883a3d7c31b35f6c905128378e6901be0b16230b44d8842179f6fe7ea0a2dd68`
MD5	`e3f4e9d79b9e7e9b3ebdcef424d19fa0`
BLAKE2b-256	`53a310f0500d610883799653c7f1abb23168a9de710f5c9deb3233861543ff64`

See more details on using hashes here.

pyspark-data-sources 0.1.10

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

PySpark Data Sources

Installation

Usage

Example Data Sources

Official Data Sources

Data Source Naming Convention

Recommended Approach

Conflict Resolution

Examples from this repository:

Contributing

Development

Environment Setup

Build Docs

Code Formatting

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes