Python SDK for the AI Factory Compute API
Project description
AI Factory SDK
Python SDK for the AI Factory Compute API — submit and manage HPC jobs from Python.
Features
- Synchronous and asynchronous clients (
AIFactoryClient,AsyncAIFactoryClient) - Typed request/response models with Pydantic validation
- Job polling with configurable timeout and retry (
client.wait()) - Automatic retry on transient errors (429, 5xx)
- PEP 561 compatible — full type annotation coverage
ai-factoryCLI for shell workflows (ai-factory jobs list/get/submit-container/cancel)
Installation
pip install ai-factory-sdk
Or with uv:
uv add ai-factory-sdk
Pre-release versions
Development builds published from the dev branch use PEP 440 pre-release
suffixes (e.g., 0.2.0.dev1). Install them with:
pip install ai-factory-sdk --pre
Quick Start
from ai_factory.sdk import AIFactoryClient, JobRequest
# Credentials resolve from ~/.ai-factory/config.yaml, env vars, or constructor
# args (see "Configuration" below). Passed explicitly here for clarity:
with AIFactoryClient(
api_key="dev-portal-api-key",
slurm_token="slurm-jwt",
slurm_user="jane",
) as client:
# Submit a job
resp = client.submit_job(
JobRequest(name="hello", script="#!/bin/bash\necho Hello from SLURM")
)
print(f"Submitted job {resp.job_id}")
# Wait for completion
if resp.job_id is not None:
detail = client.wait(str(resp.job_id), timeout=3600)
print(f"Job finished with status: {detail.status}")
Async Usage
import asyncio
from ai_factory.sdk import AsyncAIFactoryClient, JobRequest
async def main():
async with AsyncAIFactoryClient(
api_key="dev-portal-api-key",
slurm_token="slurm-jwt",
slurm_user="jane",
) as client:
resp = await client.submit_job(
JobRequest(name="async-job", script="#!/bin/bash\nsleep 10 && echo done")
)
if resp.job_id is not None:
detail = await client.wait(str(resp.job_id))
print(detail.status)
asyncio.run(main())
Container Jobs
from ai_factory.sdk import AIFactoryClient, ContainerJobRequest
with AIFactoryClient(
api_key="dev-portal-api-key",
slurm_token="slurm-jwt",
slurm_user="jane",
) as client:
resp = client.submit_container(
ContainerJobRequest(
name="gpu-training",
image="docker://nvcr.io/nvidia/pytorch:24.01-py3",
container_command="python train.py",
gres="gpu:a40:1",
time_limit=120,
)
)
Configuration
The Compute API sits behind an APISIX gateway, so two distinct credentials are required (see onboarding & auth flow):
api_key— the Developer Portal API key. The SDK sends it as theapikeyheader so APISIX'skey-authplugin lets the request through.slurm_token— the Slurm JWT (scontrol token). The Compute API forwards it to the upstream Slurm REST endpoints.
Credentials resolve from three sources, in priority order:
- Explicit constructor arguments —
Client(api_key=..., slurm_token=..., slurm_user=...). - Environment variables.
- YAML config file at
~/.ai-factory/config.yaml.
If a required value is missing from all three sources, Client() raises
ValueError with a message listing all three options.
| Parameter | Environment Variable | Config File Key | Default |
|---|---|---|---|
base_url |
AI_FACTORY_API_URL |
api_url |
https://aifactory.ai-factory.datalab.tuwien.ac.at/compute-api/v1 |
api_key |
AI_FACTORY_API_KEY |
api_key |
(required — Developer Portal key, sent as apikey) |
slurm_token |
AI_FACTORY_SLURM_TOKEN |
slurm_token |
(required — Slurm JWT, sent as X-SLURM-USER-TOKEN) |
slurm_user |
AI_FACTORY_SLURM_USER |
slurm_user |
(required) |
timeout |
— | — | 30.0 (HTTP timeout in seconds) |
Config file
Example ~/.ai-factory/config.yaml:
api_url: "https://aifactory.ai-factory.datalab.tuwien.ac.at/compute-api/v1"
api_key: "your-developer-portal-api-key"
slurm_token: "eyJhbGciOiJSUzI1NiIs..." # scontrol token output
slurm_user: "jane.doe"
Secure the file so only your user can read it:
chmod 600 ~/.ai-factory/config.yaml
The SDK emits a UserWarning when a Client() is constructed if the file
is group- or world-accessible. A malformed or unreadable file raises
ConfigFileError (a subclass of SDKError).
Command-Line Interface
Installing the SDK also registers the ai-factory console script for users
who prefer shell workflows or want to drive the platform from bash:
# Set credentials once (or use ~/.ai-factory/config.yaml — same resolution chain as Client())
export AI_FACTORY_API_KEY="your-developer-portal-api-key" # -> apikey header
export AI_FACTORY_SLURM_TOKEN="eyJhbGciOi..." # -> X-SLURM-USER-TOKEN
export AI_FACTORY_SLURM_USER="jane.doe"
ai-factory --version # print SDK version and exit
ai-factory jobs list # table output
ai-factory jobs list --json # machine-readable
ai-factory jobs get 459381 # single job detail
ai-factory jobs submit-container \
--name training-run \
--image docker://nvcr.io/nvidia/pytorch:24.01-py3 \
--command "python train.py" \
--partition GPU-a100 \
--gres gpu:a40:1 \
--time-limit 120
ai-factory jobs cancel 459381
Every subcommand supports --help and --json. Errors map to distinct
exit codes so shell scripts can branch on the failure mode (codes start at
10 so they do not collide with Click/Typer's argument-parse exit 2):
| Exit code | Meaning |
|---|---|
0 |
success |
2 |
usage error (raised by Typer for unknown options/missing arguments) |
10 |
configuration error (missing credentials, bad config file) |
11 |
authentication failed (expired/invalid token) |
12 |
resource not found (e.g. unknown job ID) |
13 |
API error (server returned non-success status) |
14 |
other SDK error |
The CLI shares the credential resolution chain with AIFactoryClient:
explicit env vars take precedence over ~/.ai-factory/config.yaml.
There is no ai-factory jobs wait subcommand yet. Poll with jobs get --json
in a script until .status is one of completed / errored / cancelled,
or use the Python client.wait() method directly.
API Reference
Clients
| Class | Description |
|---|---|
AIFactoryClient |
Synchronous client (context manager) |
AsyncAIFactoryClient |
Asynchronous client (async context manager) |
Methods
| Method | Description |
|---|---|
submit_job(request) |
Submit a Slurm job script |
submit_container(request) |
Submit a containerised job |
get_job(job_id) |
Get job details by ID |
list_jobs(...) |
List jobs with optional filters and pagination |
cancel_job(job_id) |
Cancel a running or pending job |
wait(job_id, ...) |
Poll until the job reaches a terminal state |
Request Models
| Model | Fields |
|---|---|
JobRequest |
name, script, partition, tasks, cpus_per_task, time_limit, gres, standard_output, standard_error |
ContainerJobRequest |
name, image, container_command, partition, tasks, cpus_per_task, time_limit, gres, standard_output, standard_error |
Response Models
| Model | Fields |
|---|---|
SubmitJobResponse |
job_id, output_dir, logs_url |
JobDetail |
job_id, name, status, partition, nodes, exit_code, duration, start_time, end_time, submit_time, working_directory, standard_output, standard_error, gres, output_dir, logs_url |
JobListItem |
job_id, name, status, duration, start_time, end_time |
JobList |
jobs, total, limit, offset |
CancelJobResponse |
message |
Exceptions
| Exception | When |
|---|---|
SDKError |
Base for all SDK errors |
APIError |
Non-2xx HTTP response |
AuthError |
401 or 403 response |
NotFoundError |
404 response |
WaitTimeoutError |
wait() exceeded its deadline |
ConfigFileError |
~/.ai-factory/config.yaml unreadable or malformed |
Requirements
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_factory_sdk-0.2.0.dev4-py3-none-any.whl.
File metadata
- Download URL: ai_factory_sdk-0.2.0.dev4-py3-none-any.whl
- Upload date:
- Size: 22.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e336377c0ef5dda19c5ba2508f318eaf07b529d18078de4f11c7fe74afd679af
|
|
| MD5 |
70967d16b968a40cd4fa8eaec81e907a
|
|
| BLAKE2b-256 |
883e5ed13628e97ec92af3e395ab3d7b7b53c5bf382fe48712d6c71c3e60835a
|