Skip to main content

Fast upload in parallel large datasets to HuggingFace Datasets hub.

Project description

HF-fastup

Pushes a HF dataset to the HF hub as a Parquet dataset, allowing streaming. The dataset is processed to shards and uploaded in parallel. It useful for large datasets, for example, with embedded data.

Usage

Make sure hf_transfer is installed and HF_HUB_ENABLE_HF_TRANSFER is set to 1.

import hffastup
import datasets
datasets.logging.set_verbosity_info()

# load any HF dataset
dataset = datasets.load_dataset("my_large_dataset.py")

hffastup.upload_to_hf_hub(dataset, "Org/repo") # upload to HF Hub
hffastup.push_dataset_card(dataset, "Org/repo") # Makes a dataset card and pushes it to HF Hub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf-fastup-0.0.1.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

hf_fastup-0.0.1-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file hf-fastup-0.0.1.tar.gz.

File metadata

  • Download URL: hf-fastup-0.0.1.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for hf-fastup-0.0.1.tar.gz
Algorithm Hash digest
SHA256 5e23b064ea6374f0f98348947861a1743f39c411db54e4d316eea0b70b693ebe
MD5 9c75c836ecb12c8bfa70407d79f7bfaf
BLAKE2b-256 3b0931b5bd7755e279f8874ee49d4d77c175a7f0ec6da18d30196f60fd40f0be

See more details on using hashes here.

File details

Details for the file hf_fastup-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: hf_fastup-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for hf_fastup-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e8e9d43e49e66259eee72611269d3fba0f37f583be78b1411bd3e178c8b691e8
MD5 636523323c58041c7ff5f69203084f3a
BLAKE2b-256 45fedd04202c5bee710102f4166eefa64a9d819683b25f7044d302571a448916

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page