Dagster Integration with HF Datasets
Project description
Dagster-HF-Datasets
Overview
Dagster-HF-Datasets integrates Hugging Face datasets with Dagster for building reproducible, observable data pipelines. Load datasets directly as Dagster assets, apply transformations, and publish results back to the Hub.
Features
- Hugging Face dataset assets — Load any HF dataset as a Dagster asset with automatic metadata.
- Streaming support — Efficiently handle large datasets with runtime-only streaming mode.
- Parquet persistence — Auto-save datasets to disk for caching and versioning.
- Metadata & lineage — Rich metadata for observability and data lineage tracking.
- Multi-asset pipelines — Create split-aware assets from datasets with multiple splits.
- Hub publishing — Push processed datasets back to the Hugging Face Hub with dataset cards.
Installation
pip install dagster-hf-datasets
Development Install:
git clone https://github.com/dagster-io/dagster.git
cd libraries/dagster-hf-datasets
pip install -e .
Examples
Basic Asset Pipeline
Get started with a simple example of materializing a Hugging Face dataset as a Dagster asset:
See examples/basic_asset_pipeline.py
- Dataset materialization with
hf_dataset_asset - Parquet persistence via
HFParquetIOManager - Automatic metadata enrichment
- Hugging Face Hub observability
Multi-Asset Streaming Pipeline
Process large datasets efficiently with runtime-only streaming ingestion:
See examples/multi_asset_pipeline.py
- Streaming dataset loading with
load_dataset(..., streaming=True) - Deterministic sampling of IterableDatasets
- Metadata extraction from streaming sources
- Conversion to persistent materialized artifacts
Complete Dataset Pipeline
Build production-grade data pipelines with dataset cleaning, transformation and publishing:
See examples/multi_asset_pipeline.py
- Deduplication and filtering of raw data
- Text normalization and formatting
- Multi-step lineage-aware transformations
- Hugging Face Hub dataset publishing
Documentation
- Usage Guide — Quick start, configuration, publishing datasets to Hugging Face Hub, and metadata/lineage tracking
- API Reference — Complete API documentation for
HuggingFaceResource, asset decorators, and the IO manager
Development
Test
make test
Build
make build
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dagster_hf_datasets-0.0.1.tar.gz.
File metadata
- Download URL: dagster_hf_datasets-0.0.1.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d54842db8c73ad0f9ccbe6edba61c8d7433b70d23d02744d6d0e9918cf48d528
|
|
| MD5 |
7cce45b3ff752eece2695bfccde45443
|
|
| BLAKE2b-256 |
b2c6bd2eb35bf24c185932ea7bd3442f832957c1becaea72fdb6142ed8581d39
|
File details
Details for the file dagster_hf_datasets-0.0.1-py3-none-any.whl.
File metadata
- Download URL: dagster_hf_datasets-0.0.1-py3-none-any.whl
- Upload date:
- Size: 17.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddcd1d65ded8ae4d04325be2061a39985780a80862f9ff85696443364a18ee17
|
|
| MD5 |
9ea55094979cc3cc0cb4eea91bf68fb7
|
|
| BLAKE2b-256 |
9c3de0638886e2215be6374b050c0959b012dbc904b46569efe557b6e7d5b296
|