Skip to main content

A DataSource for reading and writing HuggingFace Datasets in Spark

Project description

Hugging Face x Spark

GitHub release Number of datasets

Spark Data Source for Hugging Face Datasets

A Spark Data Source for accessing 🤗 Hugging Face Datasets:

  • Stream datasets from Hugging Face as Spark DataFrames
  • Select subsets and splits, apply projection and predicate filters
  • Save Spark DataFrames as Parquet files to Hugging Face
  • Fully distributed
  • Authentication via huggingface-cli login or tokens
  • Compatible with Spark 4 (with auto-import)
  • Backport for Spark 3.5, 3.4 and 3.3

Installation

pip install pyspark_huggingface

Usage

Load a dataset (here stanfordnlp/imdb):

import pyspark_huggingface
df = spark.read.format("huggingface").load("stanfordnlp/imdb")

Save to Hugging Face:

# Login with huggingface-cli login
df.write.format("huggingface").save("username/my_dataset")
# Or pass a token manually
df.write.format("huggingface").option("token", "hf_xxx").save("username/my_dataset")

Advanced

Select a split:

test_df = (
    spark.read.format("huggingface")
    .option("split", "test")
    .load("stanfordnlp/imdb")
)

Select a subset/config:

test_df = (
    spark.read.format("huggingface")
    .option("config", "sample-10BT")
    .load("HuggingFaceFW/fineweb-edu")
)

Filters columns and rows (especially efficient for Parquet datasets):

df = (
    spark.read.format("huggingface")
    .option("filters", '[("language_score", ">", 0.99)]')
    .option("columns", '["text", "language_score"]')
    .load("HuggingFaceFW/fineweb-edu")
)

Backport

While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions.

Importing pyspark_huggingface patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3:

>>> import pyspark_huggingface
huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4)

The import is only necessary on Spark 3.x to enable the backport. Spark 4 automatically imports pyspark_huggingface as soon as it is installed, and registers the "huggingface" data source.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_huggingface-1.0.0.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_huggingface-1.0.0-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_huggingface-1.0.0.tar.gz.

File metadata

  • Download URL: pyspark_huggingface-1.0.0.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.6 Darwin/24.4.0

File hashes

Hashes for pyspark_huggingface-1.0.0.tar.gz
Algorithm Hash digest
SHA256 123be1a4640afde0de6e11ceb25f94e90c267358b0ac9df4c71fdf1eb969aeae
MD5 db797fa3ee2651dccaf2b75e57a547e5
BLAKE2b-256 d4b304b3a553b6b374b2c3046430c6b90309d85ceb58284893b84eebaa516e44

See more details on using hashes here.

File details

Details for the file pyspark_huggingface-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_huggingface-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 759f4cf97cad920ab31649d28579cd2302e7148b7ffb3e0dea2114f8d1b49954
MD5 aea4fd5d619f7fc90e41cc9107c9865d
BLAKE2b-256 7a941c3e9dd56e8cf948560bca62e914d92b7c0dc47de8c2c3794b99e2b269d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page