Skip to main content

CLI tool for uploading HuggingFace datasets and GitHub repos to Load S3 with tags

Project description

Load Pools CLI

A command-line tool for uploading HuggingFace datasets and GitHub repositories to Load S3 and tagging them for use with Load's data pools.

Features

  • GitHub repository upload: Clone and upload entire repos with file-by-file tagging
  • HuggingFace integration: Upload datasets by path
  • Automatic tagging: Data-Protocol, Path, Filename, and Content-Type to make data discoverable
  • Query support: All uploads are queryable via s3-agent's tag query API

Installation

Prerequisites

A Load Network account API key from cloud.load.network. HuggingFace calls may ask for a HuggingFace API key, freely available for registered users here

Usage

Upload a GitHub Repository

Upload all files from a GitHub repository with proper folder structure tagging:

load-pools create --github https://github.com/owner/repo --auth YOUR_LOAD_API_KEY

What happens:

  1. Repository is cloned to a temporary directory
  2. Each file is uploaded individually to s3-agent
  3. Files are tagged with:
    • Data-Protocol: "owner/repo"
    • Path: "folder/subfolder" (relative path from repo root)
    • Filename: "file.ext"
    • Content-Type: "mime/type"

Example tags for file images/grayscale/9582.png:

[
  {"key": "Data-Protocol", "value": "owner/repo"},
  {"key": "Path", "value": "images/grayscale"},
  {"key": "Filename", "value": "9582.png"},
  {"key": "Content-Type", "value": "image/png"}
]

Upload a HuggingFace Dataset

Upload a HuggingFace dataset. Tables are uploaded row by row, with proper tagging.

load-pools create --hugging-face username/dataset-name --auth YOUR_LOAD_API_KEY 

For private datasets or to bypass anonymous rate limits, pass --hf-auth <YOUR_TOKEN>.

What happens:

  1. HuggingFace dataset is downloaded
  2. Table rows are extracted into individual dataitems
  3. Rows are uploaded to s3-agent with metadata tags

Command Options

load-pools create [OPTIONS]

Options:
  --github TEXT          GitHub repository URL
  --hugging-face TEXT    HuggingFace repository slug (user/repo)
  --auth TEXT            Load account API key [required]
  -v, --verbose          Show detailed upload progress
  --help                 Show help message

Querying Uploaded Data

After uploading, you can query your data using the s3-agent tags API.

Query all files from a GitHub repository:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"}
    ]
  }'

Query files in a specific folder:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Path", "value": "images/grayscale"}
    ]
  }'

Query by filename:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Filename", "value": "9582.png"}
    ]
  }'

Query by content type:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Content-Type", "value": "image/png"}
    ]
  }'

Examples

Upload a dataset repository:

load-pools create \
  --github https://github.com/username/my-dataset \
  --auth load_acc_xxxxxxxxxxxxx

Upload with verbose output:

load-pools create \
  --github https://github.com/ml-datasets/images \
  --auth load_acc_xxxxxxxxxxxxx \
  --verbose

Upload a HuggingFace dataset:

load-pools create \
  --hugging-face openai/graphwalks \
  --auth load_acc_xxxxxxxxxxxxx

License

MIT License - see LICENSE file for details.

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

load_pools-0.1.19.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

load_pools-0.1.19-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file load_pools-0.1.19.tar.gz.

File metadata

  • Download URL: load_pools-0.1.19.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for load_pools-0.1.19.tar.gz
Algorithm Hash digest
SHA256 bc27bc002648682c7377bec3542725b1fa4006e6f9782ec4e8d6aca393284dfd
MD5 f5d16ba9da2408c095f2b98347c68738
BLAKE2b-256 468fd7f5f9d49d3d73072487ccb91e0b03cf556cb3e0e54bf24f2a5c4e3eef29

See more details on using hashes here.

File details

Details for the file load_pools-0.1.19-py3-none-any.whl.

File metadata

  • Download URL: load_pools-0.1.19-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for load_pools-0.1.19-py3-none-any.whl
Algorithm Hash digest
SHA256 752edb0f2b84881dc7214d6418bb4f6a3616d19377c1f17b03a7602e119878b5
MD5 c32546a89540457aea283e8c2212f3c4
BLAKE2b-256 ce98517a9072ab0b1056c7e302b371d946d6a055ad53eae27e48c3e40643e8a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page