Skip to main content

Python SDK for the AI Factory Compute API

Project description

AI Factory SDK

Python SDK for the AI Factory Compute API — submit and manage HPC jobs from Python.

Features

  • Synchronous and asynchronous clients (AIFactoryClient, AsyncAIFactoryClient)
  • Typed request/response models with Pydantic validation
  • Job polling with configurable timeout and retry (client.wait())
  • Automatic retry on transient errors (429, 5xx)
  • PEP 561 compatible — full type annotation coverage
  • ai-factory CLI for shell workflows (ai-factory jobs list/get/submit-container/cancel)

Installation

pip install ai-factory-sdk

Or with uv:

uv add ai-factory-sdk

Pre-release versions

Development builds published from the dev branch use PEP 440 pre-release suffixes (e.g., 0.2.0.dev1). Install them with:

pip install ai-factory-sdk --pre

Quick Start

from ai_factory.sdk import AIFactoryClient, JobRequest

# Credentials resolve from ~/.ai-factory/config.yaml, env vars, or constructor
# args (see "Configuration" below). Passed explicitly here for clarity:
with AIFactoryClient(
    api_key="dev-portal-api-key",
    slurm_token="slurm-jwt",
    slurm_user="jane",
) as client:
    # Submit a job
    resp = client.submit_job(
        JobRequest(name="hello", script="#!/bin/bash\necho Hello from SLURM")
    )
    print(f"Submitted job {resp.job_id}")

    # Wait for completion
    if resp.job_id is not None:
        detail = client.wait(str(resp.job_id), timeout=3600)
        print(f"Job finished with status: {detail.status}")

Async Usage

import asyncio
from ai_factory.sdk import AsyncAIFactoryClient, JobRequest

async def main():
    async with AsyncAIFactoryClient(
        api_key="dev-portal-api-key",
        slurm_token="slurm-jwt",
        slurm_user="jane",
    ) as client:
        resp = await client.submit_job(
            JobRequest(name="async-job", script="#!/bin/bash\nsleep 10 && echo done")
        )
        if resp.job_id is not None:
            detail = await client.wait(str(resp.job_id))
            print(detail.status)

asyncio.run(main())

Container Jobs

from ai_factory.sdk import AIFactoryClient, ContainerJobRequest

with AIFactoryClient(
    api_key="dev-portal-api-key",
    slurm_token="slurm-jwt",
    slurm_user="jane",
) as client:
    resp = client.submit_container(
        ContainerJobRequest(
            name="gpu-training",
            image="docker://nvcr.io/nvidia/pytorch:24.01-py3",
            container_command="python train.py",
            gres="gpu:a40:1",
            time_limit=120,
        )
    )

Configuration

The Compute API sits behind an APISIX gateway, so two distinct credentials are required (see onboarding & auth flow):

  • api_key — the Developer Portal API key. The SDK sends it as the apikey header so APISIX's key-auth plugin lets the request through.
  • slurm_token — the Slurm JWT (scontrol token). The Compute API forwards it to the upstream Slurm REST endpoints.

Credentials resolve from three sources, in priority order:

  1. Explicit constructor argumentsClient(api_key=..., slurm_token=..., slurm_user=...).
  2. Environment variables.
  3. YAML config file at ~/.ai-factory/config.yaml.

If a required value is missing from all three sources, Client() raises ValueError with a message listing all three options.

Parameter Environment Variable Config File Key Default
base_url AI_FACTORY_API_URL api_url https://aifactory.ai-factory.datalab.tuwien.ac.at/compute-api/v1
api_key AI_FACTORY_API_KEY api_key (required — Developer Portal key, sent as apikey)
slurm_token AI_FACTORY_SLURM_TOKEN slurm_token (required — Slurm JWT, sent as X-SLURM-USER-TOKEN)
slurm_user AI_FACTORY_SLURM_USER slurm_user (required)
timeout 30.0 (HTTP timeout in seconds)

Config file

Example ~/.ai-factory/config.yaml:

api_url: "https://aifactory.ai-factory.datalab.tuwien.ac.at/compute-api/v1"
api_key: "your-developer-portal-api-key"
slurm_token: "eyJhbGciOiJSUzI1NiIs..."   # scontrol token output
slurm_user: "jane.doe"

Secure the file so only your user can read it:

chmod 600 ~/.ai-factory/config.yaml

The SDK emits a UserWarning when a Client() is constructed if the file is group- or world-accessible. A malformed or unreadable file raises ConfigFileError (a subclass of SDKError).

Command-Line Interface

Installing the SDK also registers the ai-factory console script for users who prefer shell workflows or want to drive the platform from bash:

# Set credentials once (or use ~/.ai-factory/config.yaml — same resolution chain as Client())
export AI_FACTORY_API_KEY="your-developer-portal-api-key"     # -> apikey header
export AI_FACTORY_SLURM_TOKEN="eyJhbGciOi..."                 # -> X-SLURM-USER-TOKEN
export AI_FACTORY_SLURM_USER="jane.doe"

ai-factory --version                    # print SDK version and exit
ai-factory jobs list                    # table output
ai-factory jobs list --json             # machine-readable
ai-factory jobs get 459381              # single job detail
ai-factory jobs submit-container \
    --name training-run \
    --image docker://nvcr.io/nvidia/pytorch:24.01-py3 \
    --command "python train.py" \
    --partition GPU-a100 \
    --gres gpu:a40:1 \
    --time-limit 120
ai-factory jobs cancel 459381

Every subcommand supports --help and --json. Errors map to distinct exit codes so shell scripts can branch on the failure mode (codes start at 10 so they do not collide with Click/Typer's argument-parse exit 2):

Exit code Meaning
0 success
2 usage error (raised by Typer for unknown options/missing arguments)
10 configuration error (missing credentials, bad config file)
11 authentication failed (expired/invalid token)
12 resource not found (e.g. unknown job ID)
13 API error (server returned non-success status)
14 other SDK error

The CLI shares the credential resolution chain with AIFactoryClient: explicit env vars take precedence over ~/.ai-factory/config.yaml.

There is no ai-factory jobs wait subcommand yet. Poll with jobs get --json in a script until .status is one of completed / errored / cancelled, or use the Python client.wait() method directly.

API Reference

Clients

Class Description
AIFactoryClient Synchronous client (context manager)
AsyncAIFactoryClient Asynchronous client (async context manager)

Methods

Method Description
submit_job(request) Submit a Slurm job script
submit_container(request) Submit a containerised job
get_job(job_id) Get job details by ID
list_jobs(...) List jobs with optional filters and pagination
cancel_job(job_id) Cancel a running or pending job
wait(job_id, ...) Poll until the job reaches a terminal state

Request Models

Model Fields
JobRequest name, script, partition, tasks, cpus_per_task, time_limit, gres, standard_output, standard_error
ContainerJobRequest name, image, container_command, partition, tasks, cpus_per_task, time_limit, gres, standard_output, standard_error

Response Models

Model Fields
SubmitJobResponse job_id, output_dir, logs_url
JobDetail job_id, name, status, partition, nodes, exit_code, duration, start_time, end_time, submit_time, working_directory, standard_output, standard_error, gres, output_dir, logs_url
JobListItem job_id, name, status, duration, start_time, end_time
JobList jobs, total, limit, offset
CancelJobResponse message

Exceptions

Exception When
SDKError Base for all SDK errors
APIError Non-2xx HTTP response
AuthError 401 or 403 response
NotFoundError 404 response
WaitTimeoutError wait() exceeded its deadline
ConfigFileError ~/.ai-factory/config.yaml unreadable or malformed

Requirements

End-to-end verification

A SDK-driven Path 2 test lives at test/e2e/test_sdk_path2.py in the monorepo. It submits a real container job through the published-shape AIFactoryClient, polls until terminal state, and validates the JobDetail schema.

Run it locally against staging:

export COMPUTE_API_URL="https://aifactory-dev.ai-factory.datalab.tuwien.ac.at/compute-api/v1"
export APISIX_CI_API_KEY="your-developer-portal-api-key"    # sent as the apikey header
export SLURM_USERNAME="$(whoami)"
export SLURM_USER_TOKEN="$(scontrol token | cut -d= -f2)"   # Slurm JWT, rotates often

uv run pytest test/e2e/test_sdk_path2.py -v -s -m sdk_e2e

In CI, trigger the manual sdk-e2e job in the post-deploy-verify stage. The job is intentionally manual because each run queues a real Slurm job.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_factory_sdk-0.2.0.dev5-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file ai_factory_sdk-0.2.0.dev5-py3-none-any.whl.

File metadata

File hashes

Hashes for ai_factory_sdk-0.2.0.dev5-py3-none-any.whl
Algorithm Hash digest
SHA256 e88cb15177c5ff3afe98719264a4f64414e44b65612aafa19bffcda647ac6344
MD5 e98b633dea0d724df5d400ee5f5472eb
BLAKE2b-256 c2a44e1005f49dd1c8762ebedef46784dba371e96e9a18b2fe1e811fe24a872c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page