Skip to main content

Fast upload in parallel large datasets to HuggingFace Datasets hub.

Project description

HF-fastup

Pushes a HF dataset to the HF hub as a Parquet dataset, allowing streaming. The dataset is processed to shards and uploaded in parallel. It useful for large datasets, for example, with embedded data.

Usage

Make sure hf_transfer is installed and HF_HUB_ENABLE_HF_TRANSFER is set to 1.

import hffastup
import datasets
datasets.logging.set_verbosity_info()

# load any HF dataset
dataset = datasets.load_dataset("my_large_dataset.py")

hffastup.upload_to_hf_hub(dataset, "Org/repo") # upload to HF Hub
hffastup.push_dataset_card(dataset, "Org/repo") # Makes a dataset card and pushes it to HF Hub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf-fastup-0.0.6.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

hf_fastup-0.0.6-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file hf-fastup-0.0.6.tar.gz.

File metadata

  • Download URL: hf-fastup-0.0.6.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for hf-fastup-0.0.6.tar.gz
Algorithm Hash digest
SHA256 ee40e1d31ab514cbbb1367dd4770a759cfa0992469d69788d010569f1fa72250
MD5 136fb1f2f57edcd2e2a5b1f6e976afc7
BLAKE2b-256 cb8c9c8c578bc87ee6401de98b89170e2bee59d03fc4d4b179fca107037d9aee

See more details on using hashes here.

File details

Details for the file hf_fastup-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: hf_fastup-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for hf_fastup-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 0339e59a622e98a17245e82298abc0a2f0b7e41804ec8952f68119b1499bd188
MD5 fd073783697d8eb5307e22d79196379c
BLAKE2b-256 8b14649e5d82dd648c267ac795598d13248c028922e292f5b8530ffc57eef8b0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page