Skip to main content

CLI tool for uploading HuggingFace datasets and GitHub repos to Load S3 with tags

Project description

Load Pools CLI

A command-line tool for uploading HuggingFace datasets and GitHub repositories to Load S3 and tagging them for use with Load's data pools.

Features

  • GitHub repository upload: Clone and upload entire repos with file-by-file tagging
  • HuggingFace integration: Upload datasets by path
  • Automatic tagging: Data-Protocol, Path, Filename, and Content-Type to make data discoverable
  • Query support: All uploads are queryable via s3-agent's tag query API

Installation

Prerequisites

A Load Network account API key from cloud.load.network. HuggingFace calls may ask for a HuggingFace API key, freely available for registered users here

Usage

Upload a GitHub Repository

Upload all files from a GitHub repository with proper folder structure tagging:

load-pools create --github https://github.com/owner/repo --auth YOUR_LOAD_API_KEY

What happens:

  1. Repository is cloned to a temporary directory
  2. Each file is uploaded individually to s3-agent
  3. Files are tagged with:
    • Data-Protocol: "owner/repo"
    • Path: "folder/subfolder" (relative path from repo root)
    • Filename: "file.ext"
    • Content-Type: "mime/type"

Example tags for file images/grayscale/9582.png:

[
  {"key": "Data-Protocol", "value": "owner/repo"},
  {"key": "Path", "value": "images/grayscale"},
  {"key": "Filename", "value": "9582.png"},
  {"key": "Content-Type", "value": "image/png"}
]

Upload a HuggingFace Dataset

Upload a HuggingFace dataset. Tables are uploaded row by row, with proper tagging.

load-pools create --hugging-face username/dataset-name --auth YOUR_LOAD_API_KEY 

For private datasets or to bypass anonymous rate limits, pass --hf-auth <YOUR_TOKEN>.

What happens:

  1. HuggingFace dataset is downloaded
  2. Table rows are extracted into individual dataitems
  3. Rows are uploaded to s3-agent with metadata tags

Command Options

load-pools create [OPTIONS]

Options:
  --github TEXT          GitHub repository URL
  --hugging-face TEXT    HuggingFace repository slug (user/repo)
  --auth TEXT            Load account API key [required]
  -v, --verbose          Show detailed upload progress
  --help                 Show help message

Querying Uploaded Data

After uploading, you can query your data using the s3-agent tags API.

Query all files from a GitHub repository:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"}
    ]
  }'

Query files in a specific folder:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Path", "value": "images/grayscale"}
    ]
  }'

Query by filename:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Filename", "value": "9582.png"}
    ]
  }'

Query by content type:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Content-Type", "value": "image/png"}
    ]
  }'

Examples

Upload a dataset repository:

load-pools create \
  --github https://github.com/username/my-dataset \
  --auth load_acc_xxxxxxxxxxxxx

Upload with verbose output:

load-pools create \
  --github https://github.com/ml-datasets/images \
  --auth load_acc_xxxxxxxxxxxxx \
  --verbose

Upload a HuggingFace dataset:

load-pools create \
  --hugging-face openai/graphwalks \
  --auth load_acc_xxxxxxxxxxxxx

License

MIT License - see LICENSE file for details.

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

load_pools-0.1.16.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

load_pools-0.1.16-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file load_pools-0.1.16.tar.gz.

File metadata

  • Download URL: load_pools-0.1.16.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for load_pools-0.1.16.tar.gz
Algorithm Hash digest
SHA256 049d8b343c590c828887ccb3be6f269c88d892ea1aaab265e0b7e14b1266e34d
MD5 47d0f61c2fe601d7a9a105b9be89cc09
BLAKE2b-256 836d76b9d7f8fd2c6653af880ea4c3ff1f9c5b5656f6be5de862a2cda1fd927a

See more details on using hashes here.

File details

Details for the file load_pools-0.1.16-py3-none-any.whl.

File metadata

  • Download URL: load_pools-0.1.16-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for load_pools-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 8dd35f3182edfedefe074cbd3d6fbf36ae05d192aace43bc71a608089bff27a4
MD5 33801a06c5366d219b3e3ad58340765b
BLAKE2b-256 61642ff2fca22651de08157faf552c19083856872964ce1619832440939de365

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page