Skip to main content

MLOps Python SDK for XCloud Service API

Project description

SDK

Software Development Kits for integrating with the XCloud Service API.

[!NOTE] SDK Support SDKs provide type-safe, high-level interfaces for interacting with the platform API. They handle authentication, error handling, and request retries automatically.

Available SDKs

Python SDK

Installation

The Python SDK installation.

pip install mlops-python-sdk

Configuration

The SDK reads configuration from environment variables by default:

  • MLOPS_API_KEY: API key (required)
  • MLOPS_DOMAIN: API domain, e.g. localhost:8090 or https://example.com
  • MLOPS_API_PATH: API path prefix (default: /api/v1)
  • MLOPS_DEBUG: true|false (default: false)

Or configure in code:

from mlops import ConnectionConfig, Task

config = ConnectionConfig(
    api_key="xck_...",
    domain="https://example.com",
    api_path="/api/v1",
    debug=False,
)
task = Task(config=config)

Usage

from mlops import Task
from mlops.api.client.models.task_status import TaskStatus
from pathlib import Path

# Initialize Task client (uses environment variables by default)
task = Task()

# Submit a task with gpu type
try:
    result = task.submit(
        name="gpu-task-from-sdk",
        image="/mnt/minio/images/01ai-registry.cn-shanghai.cr.aliyuncs.com+public+llamafactory+0.9.3.sqsh",
        entry_command="llamafactory-cli train /workspace/config/test_lora.yaml",
        resources={
            "partition": "gpu",
            "nodes": 2,
            "ntasks": 2,
            "cpus_per_task": 2,
            "memory": "4G",
            "time": "01:00:00",
            "gres": "gpu:nvidia_a10:1",
            "qos": "qos_xcloud",
            },
        cluster_name="slurm-cn",
        team_id=1,
        file_path="your file path", # optional, support for .zip, .tar.gz, .tgz
    )

    if result is not None:
        print("==== gpu task submitted successfully ====")
        job_id = result.job_id
    else:
        print("==== gpu task submitted failed ====")
except Exception as e:
    print("==== gpu task submitted failed error ====", e)

# Submit a task with cpu type
try:
    entry_content = Path("entry.sh").read_text(encoding="utf-8")
    result = task.submit(
        name="cpu-task-from-sdk",
        image="docker://01ai-registry.cn-shanghai.cr.aliyuncs.com/01-ai/xcs/v2/alpine:3.23.0",
        entry_command=entry_content,
        resources={
            "partition": "cpu",
            "nodes": 1,
            "ntasks": 1,
            "cpus_per_task": 1,
            "memory": "1G",
            "time": "01:00:00",
            "qos": "qos_xcloud",
        },
        cluster_name="slurm-cn",
        team_id=1,
    )

    if result is not None:
        print("==== cpu task submitted successfully ====")
        job_id = result.job_id
    else:
        print("==== cpu task submitted failed ====")
except Exception as e:
    print("==== cpu task submitted failed error ====", e)

# List tasks with filters
try:
    completed_tasks = task.list(
        status=TaskStatus.COMPLETED,
        cluster_name="slurm-cn",
        page=1,
        page_size=20
    )

    # Get task details
    if completed_tasks is not None and len(completed_tasks.tasks) > 0:
        print("==== completed_tasks number ====", len(completed_tasks.tasks))
        task_info = task.get(task_id=completed_tasks.tasks[0].job_id, cluster_name="slurm-cn")
        print("==== task_info ====", task_info)
    else:
        print("==== no completed tasks to get details ====")
except Exception as e:
    print("==== get task details failed error ====", e)


# Cancel a running task
try:
    running_tasks = task.list(
        status=TaskStatus.RUNNING,
        cluster_name="slurm-cn",
        page=1,
        page_size=20
    )
    if running_tasks is not None and len(running_tasks.tasks) > 0:
        print("==== running_tasks number ====", len(running_tasks.tasks))
        # Cancel a task
        result = task.cancel(task_id=running_tasks.tasks[0].job_id, cluster_name="slurm-cn")
        print("==== task cancelled ====", running_tasks.tasks[0].job_id, result)
    else:
        print("==== no running tasks to cancel ====")
except Exception as e:
    print("==== cancel running task failed error ====", e)


# Delete a task
try:
    completed_tasks = task.list(
        status=TaskStatus.COMPLETED,
        cluster_name="slurm-cn",
        page=1,
        page_size=20
    )
    if completed_tasks is not None and len(completed_tasks.tasks) > 0:
        print("==== completed_tasks number ====", len(completed_tasks.tasks))
        # Delete a task
        result = task.delete(task_id=completed_tasks.tasks[0].job_id, cluster_name="slurm-cn")
        print("==== task deleted ====", completed_tasks.tasks[0].job_id, result)
    else:
        print("==== no completed tasks to delete ====")
except Exception as e:
    print("==== delete completed task failed error ====", e)

Task Management Methods:

  • submit() - Submit a new task with container image and entry command
  • get() - Get task details by task ID
  • list() - List tasks with optional filters (status, cluster_name, team_id, user_id)
  • cancel() - Cancel a running task
  • delete() - Delete a task record

Task Status Values:

from mlops.api.client.models.task_status import TaskStatus

TaskStatus.PENDING      # Task is pending
TaskStatus.QUEUED       # Task is queued
TaskStatus.RUNNING      # Task is running
TaskStatus.COMPLETED    # Task completed successfully
TaskStatus.SUCCEEDED    # Task succeeded
TaskStatus.FAILED       # Task failed
TaskStatus.CANCELLED    # Task was cancelled
TaskStatus.CREATED      # Task was created

Error Handling:

from mlops.exceptions import (
    APIException,
    AuthenticationException,
    NotFoundException,
    RateLimitException,
    TimeoutException,
    InvalidArgumentException,
    NotEnoughSpaceException
)

try:
    result = task.submit(name="test", cluster_name="slurm-cn", command="echo hello")
except AuthenticationException as e:
    print(f"Authentication failed: {e}")
except NotFoundException as e:
    print(f"Resource not found: {e}")
except APIException as e:
    print(f"API error: {e}")

[!TIP] Error Handling SDKs automatically handle common errors and retry failed requests. Check SDK documentation for error handling best practices.

Features

  • Type-safe API clients
  • Automatic authentication
  • Error handling
  • Request retry logic
  • Response validation

Resources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlops_python_sdk-1.0.2.tar.gz (33.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlops_python_sdk-1.0.2-py3-none-any.whl (58.0 kB view details)

Uploaded Python 3

File details

Details for the file mlops_python_sdk-1.0.2.tar.gz.

File metadata

  • Download URL: mlops_python_sdk-1.0.2.tar.gz
  • Upload date:
  • Size: 33.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.8 Darwin/23.2.0

File hashes

Hashes for mlops_python_sdk-1.0.2.tar.gz
Algorithm Hash digest
SHA256 ab9aaa7c036492edce240434987150b33a34c11f99c8de0053efe70eb022bda5
MD5 ae99a52334e53404f5490d8a18ba4c16
BLAKE2b-256 4d6c07a4f5024af0aa0a7beffaecf18bc8c5a170da609df914e280836bc208b1

See more details on using hashes here.

File details

Details for the file mlops_python_sdk-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: mlops_python_sdk-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 58.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.8 Darwin/23.2.0

File hashes

Hashes for mlops_python_sdk-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 042c12d78faaada47fc2cf0172f717d7bb4f5adc210cd119a1a5733f8295d6e7
MD5 c1ecfb8211b6ed836a8159757f50982c
BLAKE2b-256 b8ee1712f6243de4245b81f5102f45d10380fbd6b6ecc517c21677e60d8f29ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page