Manage GPU sessions on Clouditia platform
Project description
Clouditia Manager SDK
SDK Python pour gérer des sessions GPU distantes sur la plateforme Clouditia via l'API Computing (sk_compute_).
Installation
pip install clouditia-manager
# Avec support S3 (pour sauvegarder des outputs)
pip install clouditia-manager[s3]
1. Configuration
from clouditia_manager import GPUManager
# Configuration par defaut (URL: https://clouditia.com/jobs)
manager = GPUManager(api_key="sk_compute_xxxxx")
# Configuration avancee
manager = GPUManager(
api_key="sk_compute_xxxxx",
base_url="https://clouditia.com/jobs", # URL de l'API (defaut)
timeout=120 # Timeout en secondes (defaut: 60)
)
# Votre cle API est verifiee automatiquement
print(f"Utilisateur: {manager.user['username']}")
print(f"Email: {manager.user['email']}")
Ou trouver votre cle API ? Rendez-vous sur clouditia.com/manage/api-keys/ pour creer une cle
sk_compute_.
2. Consulter les GPUs disponibles
Avant de lancer une session, consultez l'inventaire GPU en temps reel :
inventory = manager.get_inventory()
if not inventory:
print("Aucun GPU disponible actuellement")
else:
for gpu in inventory:
print(f"{gpu.gpu_name} ({gpu.price_per_hour}EUR/h) : "
f"{gpu.available} dispo, {gpu.on_demand} on-demand || "
f"datacenter : {gpu.datacenter}, "
f"[datacenter_id : {gpu.datacenter_id}]")
Exemple de sortie :
NVIDIA RTX 3090 (1.0EUR/h) : 1 dispo, 0 on-demand || datacenter : France-Poissy, [datacenter_id : e7aabe3c-...]
NVIDIA RTX 3060 Ti (0.5EUR/h) : 1 dispo, 0 on-demand || datacenter : France-Poissy, [datacenter_id : e7aabe3c-...]
NVIDIA RTX 3060 Ti (0.5EUR/h) : 0 dispo, 4 on-demand || datacenter : France-vo8, [datacenter_id : 487754eb-...]
Comprendre les statuts :
- available : GPUs sur des machines allumees, prets immediatement
- on_demand : GPUs sur des machines eteintes, demarrage en ~2-5 minutes
- in_use : GPUs utilises par des sessions actives
Filtrer par datacenter
Utilisez le datacenter_id (UUID) pour filtrer l'inventaire d'un datacenter :
inventory = manager.get_inventory(datacenter_id="487754eb-a676-4502-a0f4-21a88e52c25a")
for gpu in inventory:
print(f"{gpu.gpu_name}: {gpu.available} dispo, {gpu.on_demand} on-demand")
Lister les datacenters
datacenters = manager.list_datacenters()
for dc in datacenters:
print(f"{dc.name} (datacenter_id={dc.datacenter_id}) , GPUs: {dc.gpu_count}")
# Exemple de sortie :
# France-Poissy (datacenter_id=e7aabe3c-...) , GPUs: 2
# France-vo8 (datacenter_id=487754eb-...) , GPUs: 4
3. Creer une session GPU
Une fois le GPU choisi, lancez une session. Le SDK attend automatiquement que la session soit prete :
session = manager.create_session(
gpu_type="nvidia-rtx-3090", # Type de GPU (slug depuis l'inventaire)
vcpu=2, # Nombre de vCPUs
ram=4, # RAM en GB
storage=20 # Stockage en GB
)
print(f"Session prete: {session.name}")
print(f"URL VS Code: {session.url}")
print(f"Password: {session.password}")
Cibler un datacenter
Utilisez le datacenter_id pour lancer la session sur un datacenter precis :
session = manager.create_session(
gpu_type="nvidia-rtx-3060-ti",
vcpu=2, ram=4, storage=20,
datacenter_id="487754eb-a676-4502-a0f4-21a88e52c25a" # France-vo8
)
Suivi de progression
Le SDK surveille automatiquement chaque etape de la creation (power on, deploiement, etc.) et affiche la progression en temps reel. Pas besoin de definir un timeout — le SDK detecte les erreurs reelles et les retourne immediatement :
session = manager.create_session(
gpu_type="nvidia-rtx-3090",
vcpu=4, ram=16, storage=20,
wait_ready=True, # Attendre que la session soit prete (defaut: True)
verbose=True # Afficher les etapes en temps reel (defaut: True)
)
# Sortie typique (node on-demand) :
# Waiting for session f091ef1c to be ready...
# [powering_on] Powering on node...
# [waiting_nodes] Waiting for nodes...
# [deploying] Deploying GPU session...
# [waiting_ready] Waiting for pod ready...
#
# SESSION READY
# ...
# Mode silencieux (sans attente ni messages)
session = manager.create_session(
gpu_type="nvidia-rtx-3090",
vcpu=2, ram=4, storage=20,
wait_ready=False,
verbose=False
)
4. Arreter une session
# Arret standard (attend la suppression du pod)
result = manager.stop_session(f"{session.short_id}") # ex: "0e4c713a"
print(f"Session arretee: {result.name}")
# Mode silencieux
result = manager.stop_session(f"{session.short_id}", wait_stopped=False, verbose=False)
5. Executer du code dans une session
Pour executer du code dans une session active, generez une cle sk_live_ et utilisez le SDK clouditia.
Etape 1 : Installer le SDK d'execution
pip install clouditia
Etape 2 : Generer une cle d'execution
# Generer une cle sk_live_ liee a la session
sdk_key = manager.generate_sdk_key(
session_id=session.short_id, # ex: "0e4c713a"
name="My Execution Key"
)
print(f"Cle: {sdk_key}") # sk_live_xxxxx...
Etape 3 : Executer des commandes shell
from clouditia import GPUSession
session_live_gpu = GPUSession(api_key=sdk_key)
# Installer des packages systeme
result = session_live_gpu.shell("sudo apt update && sudo apt install -y ffmpeg")
print(result)
# Verifier l'installation
result = session_live_gpu.shell("ffmpeg -version")
print(result)
# Installer des packages Python
result = session_live_gpu.shell("pip install transformers accelerate")
print(result)
# Verifier l'installation
result = session_live_gpu.shell("python3 -c 'import transformers; print(transformers.__version__)'")
print(result)
# Executer un script depuis le workspace
result = session_live_gpu.shell("cd /home/coder/workspace && python3 train.py")
print(result)
Etape 4 : Executer du code Python
# Executer du code Python directement (via gpu.run)
result = session_live_gpu.run("""
import torch
print(f'CUDA: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0)}')
a = torch.randn(1000, 1000, device='cuda')
b = torch.randn(1000, 1000, device='cuda')
c = torch.matmul(a, b)
print(f'Resultat: {c.shape}')
""")
print(result)
Alternative : Lambda GPU (sans gerer de session)
Pour une execution rapide sans creer de session manuellement, utilisez lambda_gpu() (voir section 10).
6. Gerer ses sessions
Lister les sessions
# Toutes les sessions
sessions = manager.list_sessions()
# Filtrer par status
running = manager.list_sessions(status="running")
stopped = manager.list_sessions(status="stopped")
for session in sessions:
print(f"{session.name} ({session.short_id}): {session.status} - {session.gpu_type}")
Obtenir les details d'une session
session = manager.get_session("0e4c713a")
print(f"Nom: {session.name}")
print(f"Status: {session.status}")
print(f"GPU: {session.gpu_type}")
print(f"URL: {session.url}")
print(f"Password: {session.password}")
Renommer une session
session = manager.rename_session("0e4c713a", "mon-projet-ml-v1")
print(f"Nouveau nom: {session.name}")
Cout et duree d'une session
cost_info = manager.get_session_cost("0e4c713a")
print(f"Cout actuel: {cost_info['cost']} EUR")
print(f"Taux horaire: {cost_info['hourly_rate']} EUR/h")
print(f"Duree: {cost_info['duration_display']}")
Consulter le solde de credits
balance = manager.get_balance()
print(f"Solde: {balance['balance']} {balance['currency']}")
7. Sessions Multi-GPU
Creez une session avec plusieurs GPUs, eventuellement de types differents :
session = manager.create_session(
gpus=[
{'type': 'nvidia-rtx-3090', 'count': 1},
{'type': 'nvidia-rtx-3060-ti', 'count': 1}
],
vcpu=4,
ram=16,
storage=20
)
print(f"GPU Count: {session.gpu_count}") # 2
print(f"GPUs: {session.gpus}")
GPU availability (allow_partial, auto_add_gpus)
If some requested GPUs are not available:
# Default: raises an error if any GPU is unavailable
session = manager.create_session(
gpus=[
{'type': 'nvidia-rtx-3090', 'count': 1},
{'type': 'nvidia-rtx-4090', 'count': 1} # If unavailable -> error
],
vcpu=4, ram=16, storage=20
)
# InsufficientResourcesError: Some GPUs unavailable: nvidia-rtx-4090.
# Use allow_partial=True to create with available GPUs only.
# allow_partial=True: create immediately with available GPUs only
session = manager.create_session(
gpus=[
{'type': 'nvidia-rtx-3090', 'count': 1},
{'type': 'nvidia-rtx-4090', 'count': 1}
],
vcpu=4, ram=16, storage=20,
allow_partial=True # Starts with 3090 only, no error
)
# auto_add_gpus=True: start with available GPUs, then automatically
# add missing GPUs when they become available (checks every 30s)
session = manager.create_session(
gpus=[
{'type': 'nvidia-rtx-3090', 'count': 1},
{'type': 'nvidia-rtx-4090', 'count': 1}
],
vcpu=4, ram=16, storage=20,
allow_partial=True,
auto_add_gpus=True # Background thread watches for nvidia-rtx-4090
)
# Output:
# Session created: f091ef1c
# ...
# SESSION READY (1 GPU)
# Auto-add enabled: watching for nvidia-rtx-4090
# ...
# GPU nvidia-rtx-4090 now available! Adding to session f091ef1c...
# GPU nvidia-rtx-4090 added to session f091ef1c
8. Limites automatiques (Auto-stop)
Definissez des limites pour arreter automatiquement une session :
# Limite de cout: auto-stop a 5 EUR
session = manager.create_session(
gpu_type="nvidia-rtx-3090",
vcpu=4, ram=16,
cost_limit=5.0
)
# Limite de duree: auto-stop apres 2 heures
session = manager.create_session(
gpu_type="nvidia-rtx-3090",
vcpu=4, ram=16,
duration_limit=7200 # secondes
)
# Les deux: arret des que l'une est atteinte
session = manager.create_session(
gpu_type="nvidia-rtx-3090",
vcpu=4, ram=16,
cost_limit=10.0,
duration_limit=3600
)
print(f"Auto-stop active: {session.auto_stop_enabled}")
print(f"Limite cout: {session.cost_limit} EUR")
print(f"Limite duree: {session.duration_limit} secondes")
9. File d'attente (Queue)
Si les GPUs ne sont pas disponibles immediatement, placez votre demande en file d'attente :
result = manager.create_session(
gpu_type="nvidia-rtx-4090",
vcpu=4, ram=16, storage=20,
queue_if_unavailable=True
)
if isinstance(result, GPUSession):
print(f"Session creee: {result.name}")
elif isinstance(result, dict) and result.get('queued'):
print(f"En queue! Position: #{result['position']}")
print(f"Queue ID: {result['queue_id']}")
Gerer la queue
# Lister les jobs en queue
queue_jobs = manager.list_queue_jobs()
for job in queue_jobs:
print(f"Position #{job.position}: {job.status_display}")
# Details d'un job avec historique
result = manager.get_queue_job("a1b2c3d4", verbose=True)
# Annuler un job
manager.cancel_queue_job("a1b2c3d4")
10. Lambda GPU (Execution Serverless)
Executez du code Python directement sur un GPU distant, sans gerer de session :
result = manager.lambda_gpu(
script="""
import torch
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
a = torch.randn(5000, 5000, device='cuda')
b = torch.randn(5000, 5000, device='cuda')
c = torch.matmul(a, b)
print(f"5000x5000 matmul OK, shape: {c.shape}")
""",
gpu_type="nvidia-rtx-3060-ti",
vcpu=2,
ram=4,
storage=20
)
print(f"Exit code: {result.exit_code}")
print(f"Output: {result.stdout}")
print(f"Cout: {result.cost} EUR")
print(f"Duree: {result.duration_seconds}s")
Lambda avec environnement Docker personnalise
result = manager.lambda_gpu(
script="import torch; print(torch.cuda.is_available())",
gpu_type="nvidia-rtx-3090",
environment_id="3a07d1e9-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
vcpu=8, ram=32, storage=100
)
run_and_stop() (Session complete avec upload S3)
Cree une session, execute un script, uploade les resultats vers S3, puis arrete :
result = manager.run_and_stop(
script="python train.py --epochs 10",
gpu_type="nvidia-rtx-3090",
input_files=["train.py", "data.csv"],
output_files=["model.pt", "logs/"],
s3_bucket="my-bucket",
s3_prefix="training-results/run-001/"
)
print(f"Success: {result.success}")
print(f"Cout: {result.cost} EUR")
11. Connexion S3 et sauvegarde d'outputs
Configurer S3
s3 = manager.s3_connect(
bucket="mon-bucket",
access_key="AWS_ACCESS_KEY_ID",
secret_key="AWS_SECRET_ACCESS_KEY",
endpoint="https://s3.amazonaws.com", # Optionnel (defaut: AWS S3)
region="us-east-1", # Optionnel
prefix="lambda-outputs/" # Optionnel
)
# Compatible MinIO, Wasabi, DigitalOcean Spaces, etc.
Sauvegarder des resultats
import torch
# Sauvegarder un modele PyTorch
model = torch.nn.Linear(784, 10)
manager.lambda_output("model.pt", model.state_dict(), s3=s3)
# Sauvegarder des metriques JSON
manager.lambda_output("metrics.json", {"accuracy": 0.95, "loss": 0.05}, s3=s3)
# Uploader un fichier existant
manager.lambda_output_file("/tmp/checkpoint.pt", s3=s3)
Formats supportes : .pt/.pth (PyTorch), .npy/.npz (NumPy), .json, .pkl, bytes, strings.
12. Sessions reprises depuis un environnement personnalise
Quand vous reprenez une session sauvegardee, le workspace doit etre re-telecharge depuis S3. Le SDK gere cela automatiquement :
session = manager.create_session(
gpu_type="nvidia-rtx-3090",
environment_id="3a07d1e9-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
vcpu=8, ram=32
)
# Affiche une barre de progression:
# Workspace restore [========............] 28.4% 3.12/10.98 GB ETA 124s
# Verifier manuellement la progression
session = manager.get_session("0e4c713a")
print(f"Pret: {session.ready}")
if session.workspace_sync and session.workspace_sync.get("in_progress"):
print(f"Progression: {session.workspace_sync['pct']:.0f}%")
13. Notifications et couts
Envoyer un email
manager.send_email(
subject="Training Complete",
message="Accuracy: 95%, Model saved to S3"
)
Generer une cle SDK (sk_live_)
sdk_key = manager.generate_sdk_key("0e4c713a", name="Ma cle SDK")
print(f"Cle SDK: {sdk_key}")
# Utiliser avec le SDK clouditia
from clouditia import GPUSession
gpu = GPUSession(api_key=sdk_key)
result = gpu.run("print('Hello GPU!')")
Cout de plusieurs sessions
# Sessions specifiques
costs = manager.get_sessions_cost(["0e4c713a", "f0b09214"])
print(f"Cout total: {costs['total_cost']} EUR")
# Toutes les sessions actives
active_costs = manager.get_active_sessions_cost()
print(f"Sessions actives: {active_costs['session_count']}")
print(f"Cout total: {active_costs['total_cost']} EUR")
Gestion des erreurs
from clouditia_manager import (
GPUManager,
AuthenticationError,
SessionNotFoundError,
InsufficientResourcesError,
APIError
)
try:
manager = GPUManager(api_key="sk_compute_xxxxx")
session = manager.create_session(gpu_type="nvidia-rtx-4090")
except AuthenticationError:
print("Cle API invalide")
except InsufficientResourcesError:
print("Aucun GPU disponible")
except SessionNotFoundError:
print("Session non trouvee")
except APIError as e:
print(f"Erreur API: {e}")
Reference API
| Methode | Description |
|---|---|
GPUManager(api_key, base_url, timeout) |
Initialise le SDK |
| Inventaire | |
get_inventory(datacenter_id) |
GPUs disponibles en temps reel (available, on_demand, in_use) par datacenter |
list_datacenters() |
Liste les datacenters disponibles |
| Sessions | |
create_session(gpu_type, gpus, vcpu, ram, storage, datacenter_id, allow_partial, auto_add_gpus, ...) |
Create a GPU session |
stop_session(session_id, ...) |
Arrete une session |
get_session(session_id) |
Details d'une session |
list_sessions(status) |
Liste les sessions |
rename_session(session_id, new_name) |
Renomme une session |
generate_sdk_key(session_id, name) |
Genere une cle sk_live_ pour executer du code dans une session |
| Lambda GPU | |
lambda_gpu(script, gpu_type, ...) |
Execution serverless sur GPU |
run_and_stop(script, gpu_type, ...) |
Session complete avec upload S3 |
| Queue | |
list_queue_jobs(status) |
Liste les jobs en queue |
get_queue_job(queue_id, verbose) |
Details d'un job en queue |
cancel_queue_job(queue_id) |
Annule un job en queue |
| S3 et outputs | |
s3_connect(bucket, access_key, secret_key, ...) |
Connexion S3 |
lambda_output(filename, data, s3) |
Sauvegarde un objet vers S3 |
lambda_output_file(filepath, s3) |
Upload un fichier vers S3 |
| Divers | |
get_balance() |
Solde de credits |
get_session_cost(session_id) |
Cout d'une session |
get_session_duration(session_id) |
Duree d'une session |
get_sessions_cost(session_ids) |
Cout de plusieurs sessions |
get_active_sessions_cost() |
Cout des sessions actives |
send_email(subject, message) |
Envoie un email de notification |
Attributs des objets
GPUSession
| Attribut | Type | Description |
|---|---|---|
id |
str | UUID complet |
short_id |
str | ID court (8 caracteres) |
name |
str | Nom de la session |
status |
str | running, stopped, pending, failed |
ready |
bool | True si la session est pleinement utilisable |
gpu_type |
str | Type(s) de GPU |
gpu_count |
int | Nombre total de GPUs |
gpus |
list | Liste des configs GPU (multi-GPU) |
vcpu |
int | Nombre de vCPUs |
ram |
str | RAM allouee |
storage |
str | Stockage alloue |
url |
str | URL d'acces VS Code |
password |
str | Mot de passe VS Code |
cost_limit |
float | Limite de cout (EUR) |
duration_limit |
int | Limite de duree (secondes) |
auto_stop_enabled |
bool | Auto-stop active |
estimated_ready_in_seconds |
int | ETA avant ready |
workspace_sync |
dict | Progression du restore workspace |
GPUInventory
| Attribut | Type | Description |
|---|---|---|
gpu_type |
str | Slug du GPU (ex: nvidia-rtx-3090) |
gpu_name |
str | Nom complet (ex: NVIDIA RTX 3090) |
available |
int | GPUs prets immediatement (nodes online) |
on_demand |
int | GPUs demarrables en ~2-5 min (nodes offline) |
in_use |
int | GPUs utilises par des sessions actives |
total |
int | available + on_demand + in_use |
datacenter |
str | Nom du datacenter |
datacenter_code |
str | Code du datacenter |
datacenter_id |
str | UUID du datacenter (pour filtrer) |
cluster_name |
str | Nom du cluster Kubernetes |
price_per_hour |
float | Prix par heure (EUR) |
Datacenter
| Attribut | Type | Description |
|---|---|---|
datacenter_id |
str | UUID du datacenter |
name |
str | Nom du datacenter |
is_primary |
bool | Datacenter principal |
gpu_count |
int | Nombre total de GPUs |
LambdaResult
| Attribut | Type | Description |
|---|---|---|
success |
bool | True si exit_code == 0 |
exit_code |
int | Code de sortie |
stdout |
str | Sortie standard |
stderr |
str | Sortie d'erreur |
duration_seconds |
float | Duree totale |
cost |
float | Cout (EUR) |
session_id |
str | ID de la session utilisee |
output_files |
list | Fichiers telecharges |
error |
str | Message d'erreur |
QueueJob
| Attribut | Type | Description |
|---|---|---|
queue_id |
str | UUID du job |
position |
int | Position dans la queue |
status |
str | pending, processing, completed, failed, cancelled |
status_display |
str | Libelle du status |
gpu_config |
dict | Configuration GPU demandee |
attempt_count |
int | Nombre de tentatives |
last_attempt_at |
datetime | Derniere tentative |
created_at |
datetime | Date de creation |
created_session_id |
str | ID session creee (si succes) |
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clouditia_manager-1.15.2.tar.gz.
File metadata
- Download URL: clouditia_manager-1.15.2.tar.gz
- Upload date:
- Size: 36.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1add0dae2dd86904e62165ce33f4fcae1b13fd666fee0a95997e666e9fca85fd
|
|
| MD5 |
7f1691cecc1918ea463ad2e4a158b363
|
|
| BLAKE2b-256 |
968d81335d2ebc3db25919835a3391226f4f618811ca89715ec64b34123f1fa0
|
File details
Details for the file clouditia_manager-1.15.2-py3-none-any.whl.
File metadata
- Download URL: clouditia_manager-1.15.2-py3-none-any.whl
- Upload date:
- Size: 27.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
981a391ba83f52d0d6f104b1d7a7035da305947ad840f40c5c48eec6255ff2ef
|
|
| MD5 |
add229e10eebb86b9f7b7d4eb9cffa31
|
|
| BLAKE2b-256 |
29660a4c362195a5527e618d3cada2e31ef8af22911d710a522be1a971030f55
|