Skip to main content

Python SDK for the AI Factory Compute API

Project description

AI Factory SDK

Python SDK for the AI Factory Compute API — submit and manage HPC jobs from Python.

Features

  • Synchronous and asynchronous clients (AIFactoryClient, AsyncAIFactoryClient)
  • Typed request/response models with Pydantic validation
  • Job polling with configurable timeout and retry (client.wait())
  • Automatic retry on transient errors (429, 5xx)
  • PEP 561 compatible — full type annotation coverage

Installation

pip install ai-factory-sdk

Or with uv:

uv add ai-factory-sdk

Pre-release versions

Development builds published from the dev branch use PEP 440 pre-release suffixes (e.g., 0.2.0.dev1). Install them with:

pip install ai-factory-sdk --pre

Quick Start

from ai_factory.sdk import AIFactoryClient, JobRequest

# Credentials resolve from ~/.ai-factory/config.yaml, env vars, or constructor
# args (see "Configuration" below). Passed explicitly here for clarity:
with AIFactoryClient(token="...", slurm_user="jane") as client:
    # Submit a job
    resp = client.submit_job(
        JobRequest(name="hello", script="#!/bin/bash\necho Hello from SLURM")
    )
    print(f"Submitted job {resp.job_id}")

    # Wait for completion
    if resp.job_id is not None:
        detail = client.wait(str(resp.job_id), timeout=3600)
        print(f"Job finished with status: {detail.status}")

Async Usage

import asyncio
from ai_factory.sdk import AsyncAIFactoryClient, JobRequest

async def main():
    async with AsyncAIFactoryClient(token="...", slurm_user="jane") as client:
        resp = await client.submit_job(
            JobRequest(name="async-job", script="#!/bin/bash\nsleep 10 && echo done")
        )
        if resp.job_id is not None:
            detail = await client.wait(str(resp.job_id))
            print(detail.status)

asyncio.run(main())

Container Jobs

from ai_factory.sdk import AIFactoryClient, ContainerJobRequest

with AIFactoryClient(token="...", slurm_user="jane") as client:
    resp = client.submit_container(
        ContainerJobRequest(
            name="gpu-training",
            image="docker://nvcr.io/nvidia/pytorch:24.01-py3",
            container_command="python train.py",
            gres="gpu:a40:1",
            time_limit=120,
        )
    )

Configuration

Credentials resolve from three sources, in priority order:

  1. Explicit constructor argumentsClient(token=..., slurm_user=...).
  2. Environment variables.
  3. YAML config file at ~/.ai-factory/config.yaml.

If a required value is missing from all three sources, Client() raises ValueError with a message listing all three options.

Parameter Environment Variable Config File Key Default
base_url AI_FACTORY_API_URL api_url https://aifactory.ai-factory.datalab.tuwien.ac.at/compute-api/v1
token AI_FACTORY_API_KEY api_key (required)
slurm_user AI_FACTORY_SLURM_USER slurm_user (required)
timeout 30.0 (HTTP timeout in seconds)

Config file

Example ~/.ai-factory/config.yaml:

api_url: "https://aifactory.ai-factory.datalab.tuwien.ac.at/compute-api/v1"
api_key: "eyJhbGciOiJSUzI1NiIs..."
slurm_user: "jane.doe"

Secure the file so only your user can read it:

chmod 600 ~/.ai-factory/config.yaml

The SDK emits a UserWarning when a Client() is constructed if the file is group- or world-accessible. A malformed or unreadable file raises ConfigFileError (a subclass of SDKError).

API Reference

Clients

Class Description
AIFactoryClient Synchronous client (context manager)
AsyncAIFactoryClient Asynchronous client (async context manager)

Methods

Method Description
submit_job(request) Submit a Slurm job script
submit_container(request) Submit a containerised job
get_job(job_id) Get job details by ID
list_jobs(...) List jobs with optional filters and pagination
cancel_job(job_id) Cancel a running or pending job
wait(job_id, ...) Poll until the job reaches a terminal state

Request Models

Model Fields
JobRequest name, script, partition, tasks, cpus_per_task, time_limit, gres, standard_output, standard_error
ContainerJobRequest name, image, container_command, partition, tasks, cpus_per_task, time_limit, gres, standard_output, standard_error

Response Models

Model Fields
SubmitJobResponse job_id, output_dir, logs_url
JobDetail job_id, name, status, partition, nodes, exit_code, duration, start_time, end_time, submit_time, working_directory, standard_output, standard_error, gres, output_dir, logs_url
JobListItem job_id, name, status, duration, start_time, end_time
JobList jobs, total, limit, offset
CancelJobResponse message

Exceptions

Exception When
SDKError Base for all SDK errors
APIError Non-2xx HTTP response
AuthError 401 or 403 response
NotFoundError 404 response
WaitTimeoutError wait() exceeded its deadline
ConfigFileError ~/.ai-factory/config.yaml unreadable or malformed

Requirements

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_factory_sdk-0.2.0.dev2-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file ai_factory_sdk-0.2.0.dev2-py3-none-any.whl.

File metadata

File hashes

Hashes for ai_factory_sdk-0.2.0.dev2-py3-none-any.whl
Algorithm Hash digest
SHA256 facc0a6bf36de2883de72af22b57d4d371ecc6fa8e5a188d54f3be14aafdf3d9
MD5 b1c2aaaae1ae25c5aa73a0c8e8041dd0
BLAKE2b-256 b0aff9740beeb47c44c7adb7368bf445c1b74a334c64a6d04658e4bb8d6ad13a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page