Skip to main content

CLI tool for uploading HuggingFace datasets and GitHub repos to Load S3 with tags

Project description

Load Pools CLI

A command-line tool for uploading HuggingFace datasets and GitHub repositories to Load S3 and tagging them for use with Load's data pools.

Features

  • GitHub repository upload: Clone and upload entire repos with file-by-file tagging
  • HuggingFace integration: Upload datasets by path
  • Automatic tagging: Data-Protocol, Path, Filename, and Content-Type to make data discoverable
  • Query support: All uploads are queryable via s3-agent's tag query API

Installation

Prerequisites

A Load Network account API key from cloud.load.network. HuggingFace calls may ask for a HuggingFace API key, freely available for registered users here

Usage

Upload a GitHub Repository

Upload all files from a GitHub repository with proper folder structure tagging:

load-pools create --github https://github.com/owner/repo --auth YOUR_LOAD_API_KEY

What happens:

  1. Repository is cloned to a temporary directory
  2. Each file is uploaded individually to s3-agent
  3. Files are tagged with:
    • Data-Protocol: "owner/repo"
    • Path: "folder/subfolder" (relative path from repo root)
    • Filename: "file.ext"
    • Content-Type: "mime/type"

Example tags for file images/grayscale/9582.png:

[
  {"key": "Data-Protocol", "value": "owner/repo"},
  {"key": "Path", "value": "images/grayscale"},
  {"key": "Filename", "value": "9582.png"},
  {"key": "Content-Type", "value": "image/png"}
]

Upload a HuggingFace Dataset

Upload a HuggingFace dataset. Tables are uploaded row by row, with proper tagging.

load-pools create --hugging-face username/dataset-name --auth YOUR_LOAD_API_KEY 

For private datasets or to bypass anonymous rate limits, pass --hf-auth <YOUR_TOKEN>.

What happens:

  1. HuggingFace dataset is downloaded
  2. Table rows are extracted into individual dataitems
  3. Rows are uploaded to s3-agent with metadata tags

Command Options

load-pools create [OPTIONS]

Options:
  --github TEXT          GitHub repository URL
  --hugging-face TEXT    HuggingFace repository slug (user/repo)
  --auth TEXT            Load account API key [required]
  -v, --verbose          Show detailed upload progress
  --help                 Show help message

Querying Uploaded Data

After uploading, you can query your data using the s3-agent tags API.

Query all files from a GitHub repository:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"}
    ]
  }'

Query files in a specific folder:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Path", "value": "images/grayscale"}
    ]
  }'

Query by filename:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Filename", "value": "9582.png"}
    ]
  }'

Query by content type:

curl -X POST https://load-s3-agent.load.network/tags/query \
  -H "Content-Type: application/json" \
  -d '{
    "filters": [
      {"key": "Data-Protocol", "value": "owner/repo"},
      {"key": "Content-Type", "value": "image/png"}
    ]
  }'

Examples

Upload a dataset repository:

load-pools create \
  --github https://github.com/username/my-dataset \
  --auth load_acc_xxxxxxxxxxxxx

Upload with verbose output:

load-pools create \
  --github https://github.com/ml-datasets/images \
  --auth load_acc_xxxxxxxxxxxxx \
  --verbose

Upload a HuggingFace dataset:

load-pools create \
  --hugging-face openai/graphwalks \
  --auth load_acc_xxxxxxxxxxxxx

License

MIT License - see LICENSE file for details.

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

load_pools-0.1.14.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

load_pools-0.1.14-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file load_pools-0.1.14.tar.gz.

File metadata

  • Download URL: load_pools-0.1.14.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for load_pools-0.1.14.tar.gz
Algorithm Hash digest
SHA256 ee4699da3de226f7a15ed36bdee8e47cab4423172bf1ed2ee5f5761f49ff12cc
MD5 4691825160e759544194645465874a0c
BLAKE2b-256 0716a8f5af692e4e79e1aff8b30d9099c18fff959ac704a81c7e97ab2b689ebb

See more details on using hashes here.

File details

Details for the file load_pools-0.1.14-py3-none-any.whl.

File metadata

  • Download URL: load_pools-0.1.14-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for load_pools-0.1.14-py3-none-any.whl
Algorithm Hash digest
SHA256 91cbec4e863e95f106703bc7fc5ba43358ab7fc6c5407a66a8feacfb34f870e5
MD5 5c47ad69fc9eb746251a2c2e6ab4c269
BLAKE2b-256 a89ed2988d0b3bd04ae8073759e92f921cfa87e3ccd40adeea734f69d6ab5627

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page