Skip to main content

CLI tool for uploading HuggingFace datasets and GitHub repos to Load S3 with tags

Project description

Load Pools CLI

A command-line tool for uploading HuggingFace datasets and GitHub repositories to Load S3 and tagging them for use with Load's data pools.

Features

  • GitHub repository upload: Clone and upload entire repos with file-by-file tagging
  • HuggingFace integration: Upload datasets by path
  • Automatic tagging: Data-Protocol, Path, Filename, and Content-Type to make data discoverable
  • Query support: All uploads are queryable via s3-agent's tag query API

Installation

Prerequisites

A Load Network account API key from cloud.load.network. HuggingFace calls may ask for a HuggingFace API key, freely available for registered users here

Usage

Upload a GitHub Repository

Upload all files from a GitHub repository with proper folder structure tagging:

load-pools create --github https://github.com/owner/repo --auth YOUR_LOAD_API_KEY

What happens:

  1. Repository is cloned to a temporary directory
  2. Each file is uploaded individually to s3-agent
  3. Files are tagged with:
    • Data-Protocol: "owner/repo"
    • Path: "folder/subfolder" (relative path from repo root)
    • Filename: "file.ext"
    • Content-Type: "mime/type"

Example tags for file images/grayscale/9582.png:

[
  {"key": "Data-Protocol", "value": "owner/repo"},
  {"key": "Path", "value": "images/grayscale"},
  {"key": "Filename", "value": "9582.png"},
  {"key": "Content-Type", "value": "image/png"}
]

Upload a HuggingFace Dataset

Upload a HuggingFace dataset. Tables are uploaded row by row, with proper tagging.

load-pools create --hugging-face username/dataset-name --auth YOUR_LOAD_API_KEY 

For private datasets or to bypass anonymous rate limits, pass --hf-auth <YOUR_TOKEN>.

What happens:

  1. HuggingFace dataset is downloaded
  2. Table rows are extracted into individual dataitems
  3. Rows are uploaded to s3-agent with metadata tags

Command Options

load-pools create [OPTIONS]

Options:
  --github TEXT          GitHub repository URL
  --hugging-face TEXT    HuggingFace repository slug (user/repo)
  --auth TEXT            Load account API key [required]
  -v, --verbose          Show detailed upload progress
  --help                 Show help message

Querying Uploaded Data

After uploading, you can query your data using the s3-agent tags API.

Query all files from a GitHub repository:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"}
    ]
  }'

Query files in a specific folder:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Path", "value": "images/grayscale"}
    ]
  }'

Query by filename:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Filename", "value": "9582.png"}
    ]
  }'

Query by content type:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Content-Type", "value": "image/png"}
    ]
  }'

Examples

Upload a dataset repository:

load-pools create \
  --github https://github.com/username/my-dataset \
  --auth load_acc_xxxxxxxxxxxxx

Upload with verbose output:

load-pools create \
  --github https://github.com/ml-datasets/images \
  --auth load_acc_xxxxxxxxxxxxx \
  --verbose

Upload a HuggingFace dataset:

load-pools create \
  --hugging-face openai/graphwalks \
  --auth load_acc_xxxxxxxxxxxxx

License

MIT License - see LICENSE file for details.

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

load_pools-0.1.17.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

load_pools-0.1.17-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file load_pools-0.1.17.tar.gz.

File metadata

  • Download URL: load_pools-0.1.17.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for load_pools-0.1.17.tar.gz
Algorithm Hash digest
SHA256 71e1b4dc4b4a7ef37c18518cd0e883ecbeb67d7b71098e080c2bb8706d36a99f
MD5 90f1f574ead9c95d8bd0006092e7e383
BLAKE2b-256 d0e3e58c5351681315ab3563a1e37c23407fbefdbf6419a1688d2007c35b1ce5

See more details on using hashes here.

File details

Details for the file load_pools-0.1.17-py3-none-any.whl.

File metadata

  • Download URL: load_pools-0.1.17-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for load_pools-0.1.17-py3-none-any.whl
Algorithm Hash digest
SHA256 92ec6a53ca4ff074668b6d0cb7da5f80bce4092a303a5183a6cda7b05ad35194
MD5 ed5f5f6e3a0e766d77cf0d43b81fe9ed
BLAKE2b-256 e030cf00bcd84ada3ebf3894e9af0707538db9e2260f5b6c00324b31cbe6bee9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page