Skip to main content

CLI tool for uploading HuggingFace datasets and GitHub repos to Load S3 with tags

Project description

Load Pools CLI

A command-line tool for uploading HuggingFace datasets and GitHub repositories to Load S3 and tagging them for use with Load's data pools.

Features

  • GitHub repository upload: Clone and upload entire repos with file-by-file tagging
  • HuggingFace integration: Upload datasets by path
  • Automatic tagging: Data-Protocol, Path, Filename, and Content-Type to make data discoverable
  • Query support: All uploads are queryable via s3-agent's tag query API

Installation

Prerequisites

A Load Network account API key from cloud.load.network. HuggingFace calls may ask for a HuggingFace API key, freely available for registered users here

Usage

Upload a GitHub Repository

Upload all files from a GitHub repository with proper folder structure tagging:

load-pools create --github https://github.com/owner/repo --auth YOUR_LOAD_API_KEY

What happens:

  1. Repository is cloned to a temporary directory
  2. Each file is uploaded individually to s3-agent
  3. Files are tagged with:
    • Data-Protocol: "owner/repo"
    • Path: "folder/subfolder" (relative path from repo root)
    • Filename: "file.ext"
    • Content-Type: "mime/type"

Example tags for file images/grayscale/9582.png:

[
  {"key": "Data-Protocol", "value": "owner/repo"},
  {"key": "Path", "value": "images/grayscale"},
  {"key": "Filename", "value": "9582.png"},
  {"key": "Content-Type", "value": "image/png"}
]

Upload a HuggingFace Dataset

Upload a HuggingFace dataset. Tables are uploaded row by row, with proper tagging.

load-pools create --hugging-face username/dataset-name --auth YOUR_LOAD_API_KEY 

For private datasets or to bypass anonymous rate limits, pass --hf-auth <YOUR_TOKEN>.

What happens:

  1. HuggingFace dataset is downloaded
  2. Table rows are extracted into individual dataitems
  3. Rows are uploaded to s3-agent with metadata tags

Command Options

load-pools create [OPTIONS]

Options:
  --github TEXT          GitHub repository URL
  --hugging-face TEXT    HuggingFace repository slug (user/repo)
  --auth TEXT            Load account API key [required]
  -v, --verbose          Show detailed upload progress
  --help                 Show help message

Querying Uploaded Data

After uploading, you can query your data using the s3-agent tags API.

Query all files from a GitHub repository:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"}
    ]
  }'

Query files in a specific folder:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Path", "value": "images/grayscale"}
    ]
  }'

Query by filename:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Filename", "value": "9582.png"}
    ]
  }'

Query by content type:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Content-Type", "value": "image/png"}
    ]
  }'

Examples

Upload a dataset repository:

load-pools create \
  --github https://github.com/username/my-dataset \
  --auth load_acc_xxxxxxxxxxxxx

Upload with verbose output:

load-pools create \
  --github https://github.com/ml-datasets/images \
  --auth load_acc_xxxxxxxxxxxxx \
  --verbose

Upload a HuggingFace dataset:

load-pools create \
  --hugging-face openai/graphwalks \
  --auth load_acc_xxxxxxxxxxxxx

License

MIT License - see LICENSE file for details.

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

load_pools-0.1.12.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

load_pools-0.1.12-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file load_pools-0.1.12.tar.gz.

File metadata

  • Download URL: load_pools-0.1.12.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for load_pools-0.1.12.tar.gz
Algorithm Hash digest
SHA256 743464fa187b0754637f648f7b1c4e49079d0b626a91fda8dd908669f0c28bd6
MD5 83a7565a0f83b876b10edc4331f9ce0a
BLAKE2b-256 0eef942119fae38aef70112d1de7a653d7f4903fa6975705aa9f910348460650

See more details on using hashes here.

File details

Details for the file load_pools-0.1.12-py3-none-any.whl.

File metadata

  • Download URL: load_pools-0.1.12-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for load_pools-0.1.12-py3-none-any.whl
Algorithm Hash digest
SHA256 9637ee97d7f90232e1524ebab2abd23488eba81d25a49f2d077110d526e7aea6
MD5 f018d5484e1b2d9eee4b121a157185e4
BLAKE2b-256 4347a0891a55cac18f5fccaa386d5f5ee5b08caba3f57ddd97ab1481972a1ebb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page