Skip to main content

Python SDK for the AI Factory Compute API

Project description

AI Factory SDK

Python SDK for the AI Factory Compute API — submit and manage HPC jobs from Python.

Features

  • Synchronous and asynchronous clients (AIFactoryClient, AsyncAIFactoryClient)
  • Typed request/response models with Pydantic validation
  • Job polling with configurable timeout and retry (client.wait())
  • Automatic retry on transient errors (429, 5xx)
  • PEP 561 compatible — full type annotation coverage

Installation

pip install ai-factory-sdk

Or with uv:

uv add ai-factory-sdk

Pre-release versions

Development builds published from the dev branch use PEP 440 pre-release suffixes (e.g., 0.2.0.dev1). Install them with:

pip install ai-factory-sdk --pre

Quick Start

from ai_factory.sdk import AIFactoryClient, JobRequest

# Credentials resolve from ~/.ai-factory/config.yaml, env vars, or constructor
# args (see "Configuration" below). Passed explicitly here for clarity:
with AIFactoryClient(token="...", slurm_user="jane") as client:
    # Submit a job
    resp = client.submit_job(
        JobRequest(name="hello", script="#!/bin/bash\necho Hello from SLURM")
    )
    print(f"Submitted job {resp.job_id}")

    # Wait for completion
    if resp.job_id is not None:
        detail = client.wait(str(resp.job_id), timeout=3600)
        print(f"Job finished with status: {detail.status}")

Async Usage

import asyncio
from ai_factory.sdk import AsyncAIFactoryClient, JobRequest

async def main():
    async with AsyncAIFactoryClient(token="...", slurm_user="jane") as client:
        resp = await client.submit_job(
            JobRequest(name="async-job", script="#!/bin/bash\nsleep 10 && echo done")
        )
        if resp.job_id is not None:
            detail = await client.wait(str(resp.job_id))
            print(detail.status)

asyncio.run(main())

Container Jobs

from ai_factory.sdk import AIFactoryClient, ContainerJobRequest

with AIFactoryClient(token="...", slurm_user="jane") as client:
    resp = client.submit_container(
        ContainerJobRequest(
            name="gpu-training",
            image="docker://nvcr.io/nvidia/pytorch:24.01-py3",
            container_command="python train.py",
            gres="gpu:a40:1",
            time_limit=120,
        )
    )

Configuration

Credentials resolve from three sources, in priority order:

  1. Explicit constructor argumentsClient(token=..., slurm_user=...).
  2. Environment variables.
  3. YAML config file at ~/.ai-factory/config.yaml.

If a required value is missing from all three sources, Client() raises ValueError with a message listing all three options.

Parameter Environment Variable Config File Key Default
base_url AI_FACTORY_API_URL api_url https://aifactory.ai-factory.datalab.tuwien.ac.at/compute-api/v1
token AI_FACTORY_API_KEY api_key (required)
slurm_user AI_FACTORY_SLURM_USER slurm_user (required)
timeout 30.0 (HTTP timeout in seconds)

Config file

Example ~/.ai-factory/config.yaml:

api_url: "https://aifactory.ai-factory.datalab.tuwien.ac.at/compute-api/v1"
api_key: "eyJhbGciOiJSUzI1NiIs..."
slurm_user: "jane.doe"

Secure the file so only your user can read it:

chmod 600 ~/.ai-factory/config.yaml

The SDK emits a UserWarning when a Client() is constructed if the file is group- or world-accessible. A malformed or unreadable file raises ConfigFileError (a subclass of SDKError).

API Reference

Clients

Class Description
AIFactoryClient Synchronous client (context manager)
AsyncAIFactoryClient Asynchronous client (async context manager)

Methods

Method Description
submit_job(request) Submit a Slurm job script
submit_container(request) Submit a containerised job
get_job(job_id) Get job details by ID
list_jobs(...) List jobs with optional filters and pagination
cancel_job(job_id) Cancel a running or pending job
wait(job_id, ...) Poll until the job reaches a terminal state

Request Models

Model Fields
JobRequest name, script, partition, tasks, cpus_per_task, time_limit, gres, standard_output, standard_error
ContainerJobRequest name, image, container_command, partition, tasks, cpus_per_task, time_limit, gres, standard_output, standard_error

Response Models

Model Fields
SubmitJobResponse job_id, output_dir, logs_url
JobDetail job_id, name, status, partition, nodes, exit_code, duration, start_time, end_time, submit_time, working_directory, standard_output, standard_error, gres, output_dir, logs_url
JobListItem job_id, name, status, duration, start_time, end_time
JobList jobs, total, limit, offset
CancelJobResponse message

Exceptions

Exception When
SDKError Base for all SDK errors
APIError Non-2xx HTTP response
AuthError 401 or 403 response
NotFoundError 404 response
WaitTimeoutError wait() exceeded its deadline
ConfigFileError ~/.ai-factory/config.yaml unreadable or malformed

Requirements

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_factory_sdk-0.2.0.dev3-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file ai_factory_sdk-0.2.0.dev3-py3-none-any.whl.

File metadata

File hashes

Hashes for ai_factory_sdk-0.2.0.dev3-py3-none-any.whl
Algorithm Hash digest
SHA256 bd6289b16e623762a806d921f1ba6ac38cab1bda1a47a353acc62396ad1abed5
MD5 3ca0c84b9ffb1aa013208852147d01d2
BLAKE2b-256 3b2c6e5283b12a0ea37afd847ed1f5c045f1e720f0764eb2c51dce328319ac23

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page