Skip to main content

Fast upload in parallel large datasets to HuggingFace Datasets hub.

Project description

HF-fastup

Pushes a HF dataset to the HF hub as a Parquet dataset, allowing streaming. The dataset is processed to shards and uploaded in parallel. It useful for large datasets, for example, with embedded data.

Usage

Make sure hf_transfer is installed and HF_HUB_ENABLE_HF_TRANSFER is set to 1.

import hffastup
import datasets
datasets.logging.set_verbosity_info()

# load any HF dataset
dataset = datasets.load_dataset("my_large_dataset.py")

hffastup.upload_to_hf_hub(dataset, "Org/repo") # upload to HF Hub
hffastup.push_dataset_card(dataset, "Org/repo") # Makes a dataset card and pushes it to HF Hub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf-fastup-0.0.3.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

hf_fastup-0.0.3-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file hf-fastup-0.0.3.tar.gz.

File metadata

  • Download URL: hf-fastup-0.0.3.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for hf-fastup-0.0.3.tar.gz
Algorithm Hash digest
SHA256 ad6f93504bab77012a801d006fb19ca7f8c17f9daa665640573568406af343e6
MD5 147a6aabb88d8ca1b174bb527dcaff50
BLAKE2b-256 56115bac02bd3be768cf63cb9ecb150f386b6f5886711cb98aad3f730ea411be

See more details on using hashes here.

File details

Details for the file hf_fastup-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: hf_fastup-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for hf_fastup-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 410f3f5c3bce6f2e734e9af94b2587fba46e644e385c2a24bde2188fd90bbdcf
MD5 f834c16453195c1336cbc7b4311459bb
BLAKE2b-256 b935f9b8064957bf5a34cc82513aae905f27ab47a1d6d1eb38ccad6b151db835

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page