A DataSource for reading and writing HuggingFace Datasets in Spark
Project description
Spark Data Source for Hugging Face Datasets
A Spark Data Source for accessing 🤗 Hugging Face Datasets:
- Stream datasets from Hugging Face as Spark DataFrames
- Select subsets and splits, apply projection and predicate filters
- Save Spark DataFrames as Parquet files to Hugging Face
- Fully distributed
- Authentication via
huggingface-cli loginor tokens - Compatible with Spark 4 (with auto-import)
- Backport for Spark 3.5, 3.4 and 3.3
Installation
pip install pyspark_huggingface
Usage
Load a dataset (here stanfordnlp/imdb):
import pyspark_huggingface
df = spark.read.format("huggingface").load("stanfordnlp/imdb")
Save to Hugging Face:
# Login with huggingface-cli login
df.write.format("huggingface").save("username/my_dataset")
# Or pass a token manually
df.write.format("huggingface").option("token", "hf_xxx").save("username/my_dataset")
Advanced
Select a split:
test_df = (
spark.read.format("huggingface")
.option("split", "test")
.load("stanfordnlp/imdb")
)
Select a subset/config:
test_df = (
spark.read.format("huggingface")
.option("config", "sample-10BT")
.load("HuggingFaceFW/fineweb-edu")
)
Filters columns and rows (especially efficient for Parquet datasets):
df = (
spark.read.format("huggingface")
.option("filters", '[("language_score", ">", 0.99)]')
.option("columns", '["text", "language_score"]')
.load("HuggingFaceFW/fineweb-edu")
)
Backport
While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions.
Importing pyspark_huggingface patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3:
>>> import pyspark_huggingface
huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4)
The import is only necessary on Spark 3.x to enable the backport.
Spark 4 automatically imports pyspark_huggingface as soon as it is installed, and registers the "huggingface" data source.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyspark_huggingface-1.0.0.tar.gz.
File metadata
- Download URL: pyspark_huggingface-1.0.0.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.11.6 Darwin/24.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
123be1a4640afde0de6e11ceb25f94e90c267358b0ac9df4c71fdf1eb969aeae
|
|
| MD5 |
db797fa3ee2651dccaf2b75e57a547e5
|
|
| BLAKE2b-256 |
d4b304b3a553b6b374b2c3046430c6b90309d85ceb58284893b84eebaa516e44
|
File details
Details for the file pyspark_huggingface-1.0.0-py3-none-any.whl.
File metadata
- Download URL: pyspark_huggingface-1.0.0-py3-none-any.whl
- Upload date:
- Size: 15.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.11.6 Darwin/24.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
759f4cf97cad920ab31649d28579cd2302e7148b7ffb3e0dea2114f8d1b49954
|
|
| MD5 |
aea4fd5d619f7fc90e41cc9107c9865d
|
|
| BLAKE2b-256 |
7a941c3e9dd56e8cf948560bca62e914d92b7c0dc47de8c2c3794b99e2b269d6
|