Fast upload in parallel large datasets to HuggingFace Datasets hub.
Project description
HF-fastup
Pushes a HF dataset to the HF hub as a Parquet dataset, allowing streaming. The dataset is processed to shards and uploaded in parallel. It useful for large datasets, for example, with embedded data.
Usage
Make sure hf_transfer is installed and HF_HUB_ENABLE_HF_TRANSFER
is set to 1
.
import hffastup
import datasets
datasets.logging.set_verbosity_info()
# load any HF dataset
dataset = datasets.load_dataset("my_large_dataset.py")
hffastup.upload_to_hf_hub(dataset, "Org/repo") # upload to HF Hub
hffastup.push_dataset_card(dataset, "Org/repo") # Makes a dataset card and pushes it to HF Hub
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
hf-fastup-0.0.7.tar.gz
(5.9 kB
view hashes)
Built Distribution
Close
Hashes for hf_fastup-0.0.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 861a57cc1b690de39ffdbdda1d77b3c3f28beb180a7e88df560cf5e51eb87e6f |
|
MD5 | 10ea5a9042bb85627075d5084cfef120 |
|
BLAKE2b-256 | 2466c04abf09fa7a2945f83d449b7a51bc6658852563af91e77787ef24604287 |