Skip to main content

A DataSource for reading and writing HuggingFace Datasets in Spark

Project description

Hugging Face x Spark

GitHub release Number of datasets

Spark Data Source for Hugging Face Datasets

A Spark Data Source for accessing 🤗 Hugging Face Datasets and 🤗 Storage Bucket:

  • Stream datasets from Hugging Face as Spark DataFrames
  • Select subsets and splits, apply projection and predicate filters
  • Save Spark DataFrames as Parquet files to Hugging Face
  • Fast deduped uploads
  • Fully distributed
  • Authentication via huggingface-cli login or tokens
  • Compatible with Spark 4 (with auto-import)
  • Backport for Spark 3.5, 3.4 and 3.3

Installation

pip install pyspark_huggingface

Usage with dataset repositories

Load a dataset (here stanfordnlp/imdb):

import pyspark_huggingface
df = spark.read.format("huggingface").load("stanfordnlp/imdb")

Save to Hugging Face:

# Login with huggingface-cli login
df.write.format("huggingface").mode("overwrite").save("username/my_dataset")
# Or pass a token manually
df.write.format("huggingface").option("token", "hf_xxx").mode("overwrite").save("username/my_dataset")

Usage with storage buckets

Load a data from a Storage Bucket:

import pyspark_huggingface
df = spark.read.format("huggingface").option("data_dir", "data").load("buckets/username/bucket_name")

Save to Hugging Face:

# Login with huggingface-cli login
df.write.format("huggingface").option("data_dir", "data").mode("overwrite").save("buckets/username/bucket_name")
# Or pass a token manually
df.write.format("huggingface").option("data_dir", "data").option("token", "hf_xxx").mode("overwrite").save("buckets/username/bucket_name")

Buckets support requires datasets>=4.8.4 and huggingface_hub>=1.10.1.

Advanced

Select a split:

test_df = (
    spark.read.format("huggingface")
    .option("split", "test")
    .load("stanfordnlp/imdb")
)

Select a subset/config:

sample_df = (
    spark.read.format("huggingface")
    .option("config", "sample-10BT")
    .load("HuggingFaceFW/fineweb-edu")
)

Specify data_files or data_dir:

one_file_df = (
    spark.read.format("huggingface")
    .option("data_files", "sample/10BT/000_00000.parquet")
    .load("HuggingFaceFW/fineweb-edu")
)
multiple_files_df = (
    spark.read.format("huggingface")
    .option("data_files", '["sample/10BT/000_00000.parquet", "sample/10BT/001_00000.parquet"]')
    .load("HuggingFaceFW/fineweb-edu")
)
glob_df = (
    spark.read.format("huggingface")
    .option("data_files", "sample/10BT/*.parquet")
    .load("HuggingFaceFW/fineweb-edu")
)
dir_df = (
    spark.read.format("huggingface")
    .option("data_dir", "sample/10BT")
    .load("HuggingFaceFW/fineweb-edu")
)

Filters columns and rows (especially efficient for Parquet datasets):

filtered_df = (
    spark.read.format("huggingface")
    .option("filters", '[("language_score", ">", 0.99)]')
    .option("columns", '["text", "language_score"]')
    .load("HuggingFaceFW/fineweb-edu")
)

Fast deduped uploads

Hugging Face uses Xet: a dedupe-based storage which enables fast deduped uploads.

Unlike traditional remote storage, uploads are faster on Xet because duplicate data is only uploaded once. For example: if some or all of the data already exists in other files on Xet, it is not uploaded again, saving bandwidth and speeding up uploads. Deduplication for Parquet is enabled through Content Defined Chunking (CDC).

Thanks to Parquet CDC and Xet deduplication, saving a dataset on Hugging Face is faster than on any traditional remote storage.

For more information, see https://huggingface.co/blog/parquet-cdc.

Backport

While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions.

Importing pyspark_huggingface patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3:

>>> import pyspark_huggingface
huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4)

The import is only necessary on Spark 3.x to enable the backport. Spark 4 automatically imports pyspark_huggingface as soon as it is installed, and registers the "huggingface" data source.

Development

Install uv if not already done.

Then, from the project root directory, sync dependencies and run tests.

uv sync
uv run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_huggingface-2.1.0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_huggingface-2.1.0-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_huggingface-2.1.0.tar.gz.

File metadata

  • Download URL: pyspark_huggingface-2.1.0.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.9

File hashes

Hashes for pyspark_huggingface-2.1.0.tar.gz
Algorithm Hash digest
SHA256 7871cd1dbac1ffae1c6dc57ac1fb1bc35da873de9b60e622b532da2f93da2a9f
MD5 1dd9ac900df4221e3cbadacc82e72080
BLAKE2b-256 94f551539cd4237a5bd447e1d0fb6d8e55c1ae033d897d688e84c8d3cb825cde

See more details on using hashes here.

File details

Details for the file pyspark_huggingface-2.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_huggingface-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 511f565659b0dc0cf41cea1a6dbdfb6e71f1fbf72229fcb493246f25a98a6893
MD5 acc9c8f0ce380c2401ffa823d4b10970
BLAKE2b-256 d8c4d89b26fcb4408393ae247394515d4f82d28f76522b3c4036761bb80a21bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page