Skip to main content

Dagster Integration with HF Datasets

Project description

Dagster-HF-Datasets

Dagster-HF-Datasets Logo

Overview

Dagster-HF-Datasets integrates Hugging Face datasets with Dagster for building reproducible, observable data pipelines. Load datasets directly as Dagster assets, apply transformations, and publish results back to the Hub.

Features

  • Hugging Face dataset assets — Load any HF dataset as a Dagster asset with automatic metadata.
  • Streaming support — Efficiently handle large datasets with runtime-only streaming mode.
  • Parquet persistence — Auto-save datasets to disk for caching and versioning.
  • Metadata & lineage — Rich metadata for observability and data lineage tracking.
  • Multi-asset pipelines — Create split-aware assets from datasets with multiple splits.
  • Hub publishing — Push processed datasets back to the Hugging Face Hub with dataset cards.

Installation

pip install dagster-hf-datasets

Development Install:

git clone https://github.com/dagster-io/dagster.git

cd libraries/dagster-hf-datasets

pip install -e .

Examples

Basic Asset Pipeline

Get started with a simple example of materializing a Hugging Face dataset as a Dagster asset:

See examples/basic_asset_pipeline.py

  • Dataset materialization with hf_dataset_asset
  • Parquet persistence via HFParquetIOManager
  • Automatic metadata enrichment
  • Hugging Face Hub observability

Multi-Asset Streaming Pipeline

Process large datasets efficiently with runtime-only streaming ingestion:

See examples/multi_asset_pipeline.py

  • Streaming dataset loading with load_dataset(..., streaming=True)
  • Deterministic sampling of IterableDatasets
  • Metadata extraction from streaming sources
  • Conversion to persistent materialized artifacts

Complete Dataset Pipeline

Build production-grade data pipelines with dataset cleaning, transformation and publishing:

See examples/multi_asset_pipeline.py

  • Deduplication and filtering of raw data
  • Text normalization and formatting
  • Multi-step lineage-aware transformations
  • Hugging Face Hub dataset publishing

Documentation

  • Usage Guide — Quick start, configuration, publishing datasets to Hugging Face Hub, and metadata/lineage tracking
  • API Reference — Complete API documentation for HuggingFaceResource, asset decorators, and the IO manager

Development

Test

make test

Build

make build

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dagster_hf_datasets-0.0.1.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dagster_hf_datasets-0.0.1-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file dagster_hf_datasets-0.0.1.tar.gz.

File metadata

  • Download URL: dagster_hf_datasets-0.0.1.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dagster_hf_datasets-0.0.1.tar.gz
Algorithm Hash digest
SHA256 d54842db8c73ad0f9ccbe6edba61c8d7433b70d23d02744d6d0e9918cf48d528
MD5 7cce45b3ff752eece2695bfccde45443
BLAKE2b-256 b2c6bd2eb35bf24c185932ea7bd3442f832957c1becaea72fdb6142ed8581d39

See more details on using hashes here.

File details

Details for the file dagster_hf_datasets-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: dagster_hf_datasets-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dagster_hf_datasets-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ddcd1d65ded8ae4d04325be2061a39985780a80862f9ff85696443364a18ee17
MD5 9ea55094979cc3cc0cb4eea91bf68fb7
BLAKE2b-256 9c3de0638886e2215be6374b050c0959b012dbc904b46569efe557b6e7d5b296

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page