Skip to main content

Fast upload in parallel large datasets to HuggingFace Datasets hub.

Project description

HF-fastup

Pushes a HF dataset to the HF hub as a Parquet dataset, allowing streaming. The dataset is processed to shards and uploaded in parallel. It useful for large datasets, for example, with embedded data.

Usage

Make sure hf_transfer is installed and HF_HUB_ENABLE_HF_TRANSFER is set to 1.

import hffastup
import datasets
datasets.logging.set_verbosity_info()

# load any HF dataset
dataset = datasets.load_dataset("my_large_dataset.py")

hffastup.upload_to_hf_hub(dataset, "Org/repo") # upload to HF Hub
hffastup.push_dataset_card(dataset, "Org/repo") # Makes a dataset card and pushes it to HF Hub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf-fastup-0.0.7.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

hf_fastup-0.0.7-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file hf-fastup-0.0.7.tar.gz.

File metadata

  • Download URL: hf-fastup-0.0.7.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for hf-fastup-0.0.7.tar.gz
Algorithm Hash digest
SHA256 fda4046498680ab173ed5147d847b85657434331900e787d9da73f297c3bca10
MD5 7abaa48912c08f4419535fce1fd85d33
BLAKE2b-256 eca539d0568aae1a34384011294f041d4e1b967d1c97f2f65f5734f2c58d5bac

See more details on using hashes here.

File details

Details for the file hf_fastup-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: hf_fastup-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for hf_fastup-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 861a57cc1b690de39ffdbdda1d77b3c3f28beb180a7e88df560cf5e51eb87e6f
MD5 10ea5a9042bb85627075d5084cfef120
BLAKE2b-256 2466c04abf09fa7a2945f83d449b7a51bc6658852563af91e77787ef24604287

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page