Skip to main content

Fast upload in parallel large datasets to HuggingFace Datasets hub.

Project description

HF-fastup

Pushes a HF dataset to the HF hub as a Parquet dataset, allowing streaming. The dataset is processed to shards and uploaded in parallel. It useful for large datasets, for example, with embedded data.

Usage

Make sure hf_transfer is installed and HF_HUB_ENABLE_HF_TRANSFER is set to 1.

import hffastup
import datasets
datasets.logging.set_verbosity_info()

# load any HF dataset
dataset = datasets.load_dataset("my_large_dataset.py")

hffastup.upload_to_hf_hub(dataset, "Org/repo") # upload to HF Hub
hffastup.push_dataset_card(dataset, "Org/repo") # Makes a dataset card and pushes it to HF Hub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf-fastup-0.0.2.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

hf_fastup-0.0.2-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file hf-fastup-0.0.2.tar.gz.

File metadata

  • Download URL: hf-fastup-0.0.2.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for hf-fastup-0.0.2.tar.gz
Algorithm Hash digest
SHA256 36cc30ca10356da362306176e01f59a069150c2799de5f8ed09633974b2dd718
MD5 a4ab7d3eb394c763b9657d0bf11667a5
BLAKE2b-256 e841cf6e9bfa696fac8167359ee8becc81ed5727534250ca01f17b9155f6f449

See more details on using hashes here.

File details

Details for the file hf_fastup-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: hf_fastup-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for hf_fastup-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1c3b0667b7cbe630c5acbd18e7e3f24a1a5d3503ab2fda77fcdd52b1a243ac1d
MD5 3b7890f2be92ad0043a1e88b5e4fd13c
BLAKE2b-256 91f90534ad154df83b0938539922567860c36a61c579e71652a74aeeff43dcd9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page