Skip to main content

Fast upload in parallel large datasets to HuggingFace Datasets hub.

Project description

HF-fastup

Pushes a HF dataset to the HF hub as a Parquet dataset, allowing streaming. The dataset is processed to shards and uploaded in parallel. It useful for large datasets, for example, with embedded data.

Usage

Make sure hf_transfer is installed and HF_HUB_ENABLE_HF_TRANSFER is set to 1.

import hffastup
import datasets
datasets.logging.set_verbosity_info()

# load any HF dataset
dataset = datasets.load_dataset("my_large_dataset.py")

hffastup.upload_to_hf_hub(dataset, "Org/repo") # upload to HF Hub
hffastup.push_dataset_card(dataset, "Org/repo") # Makes a dataset card and pushes it to HF Hub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf-fastup-0.0.4.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

hf_fastup-0.0.4-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file hf-fastup-0.0.4.tar.gz.

File metadata

  • Download URL: hf-fastup-0.0.4.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for hf-fastup-0.0.4.tar.gz
Algorithm Hash digest
SHA256 4df84c86f96370ec861a037979b27dc47747517e6c32950efd874551d722e88e
MD5 024ca69c5135e18afff2651917c910f9
BLAKE2b-256 8d60362a495456d571471203714a33240a8611348b75424756f53d280aa0862f

See more details on using hashes here.

File details

Details for the file hf_fastup-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: hf_fastup-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for hf_fastup-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 6d79e7a7a6c6000f614daf5f5887c4a7fe24ba787d2d867c69479c1d8e827cf3
MD5 c07cebd6e7af5fb4994df6da565098d2
BLAKE2b-256 98043cf62e816560c8e8f4608c1fc63cd2220ccd89311dd80a015446979f340b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page