Skip to main content

Hybrid GPU Inference Orchestrator — serverless for cold starts, spot for scale

Project description

Tuna

Tuna

Spot GPUs are 3-5x cheaper than on-demand, but they take minutes to start and can be interrupted at any time. Serverless GPUs start in seconds and never get interrupted, but you pay a premium for that convenience. What if you didn't have to choose?

Tuna is a smart router that combines both behind a single OpenAI-compatible endpoint. It serves requests from serverless while spot instances boot up, shifts traffic to spot once ready, and falls back to serverless if spot gets preempted. You only pay for the compute you actually use — spot rates for steady traffic, serverless only during cold starts and failover.

Serverless Spot

Modal

RunPod

Cloud Run

AWS

GCP

Baseten

Azure

Cerebrium

View Roadmap

Note: Not all GPU types across all providers have been end-to-end tested yet. We are actively testing more combinations. If you run into issues with a specific GPU + provider pair, please open an issue.

Prerequisites

Note: By default Tuna deploys both a serverless backend and a spot backend. AWS credentials are required for spot instances, which run on AWS via SkyPilot. Alternatively, use --spots-cloud gcp for GCP spot instances. Use --serverless-only to skip spot + router (no cloud credentials needed for spot).

Quick Start

1. Install

pip install tandemn-tuna[modal]      # Modal as serverless provider
pip install tandemn-tuna[cloudrun]   # Cloud Run as serverless provider
pip install tandemn-tuna[baseten]    # Baseten as serverless provider
pip install tandemn-tuna[azure]      # Azure Container Apps as serverless provider
pip install tandemn-tuna[cerebrium]  # Cerebrium as serverless provider
pip install tandemn-tuna             # RunPod (no extra deps needed)
pip install tandemn-tuna[all]        # everything

This project is under active development and experimental. For the latest version, install from source:

git clone https://github.com/Tandemn-Labs/tandemn-tuna.git
cd tandemn-tuna
pip install -e ".[all]"

2. Set up spot GPU cloud (pick one)

AWS (default)
aws configure          # set up AWS credentials
sky check aws          # verify SkyPilot can see your AWS account
GCP
gcloud auth login
gcloud auth application-default login
gcloud config set project <YOUR_PROJECT_ID>
sky check gcp          # verify SkyPilot can see your GCP account

Note: GCP spot instances (preemptible VMs) require GPU quota in your project. Check quota at: https://console.cloud.google.com/iam-admin/quotas Search for "Preemptible" GPU quotas in your target region. GCP preemptible VMs have a 24-hour maximum lifetime — SkyPilot handles automatic recovery.

3. Set up your serverless provider (pick one)

Modal
modal token new
RunPod
export RUNPOD_API_KEY=<your-key>  # https://www.runpod.io/console/user/settings

Add this to your ~/.bashrc or ~/.zshrc to persist it.

Cold start note: RunPod does not currently support cold start optimization via the API. RunPod's "Cached Models" feature is console-only, and network volumes pin workers to a single datacenter — reducing GPU availability and increasing queue times. This means RunPod downloads model weights from HuggingFace on every cold start (~10 min for a 4B model). If cold start latency matters, consider Modal or Cerebrium which cache weights automatically.

Cloud Run

Requires the gcloud CLI.

gcloud auth login
gcloud auth application-default login    # required for the Python SDK
gcloud config set project <YOUR_PROJECT_ID>

You also need billing enabled and the Cloud Run API (run.googleapis.com) enabled on your project.

GPU deployments: For reliable GPU deploys, set HF_TOKEN so model downloads aren't rate-limited by HuggingFace:

export HF_TOKEN=<your-token>  # https://huggingface.co/settings/tokens
Baseten

Step 1: Create account — sign up at https://app.baseten.co/signup/

Step 2: Get API key — go to Settings > API Keys (https://app.baseten.co/settings/api_keys), create a key, copy it immediately

Step 3: Set the API key

export BASETEN_API_KEY=<your-api-key>

Add to ~/.bashrc or ~/.zshrc to persist.

Step 4: Install and authenticate the Truss CLI

pip install --upgrade truss
truss login --api-key $BASETEN_API_KEY

Step 5: (For gated models) Add HuggingFace token — go to Settings > Secrets (https://app.baseten.co/settings/secrets), add a secret named hf_access_token with your HF token.

Azure Container Apps

Requires the Azure CLI.

Step 1: Install Azure CLI and log in

az login

Step 2: Register required resource providers

az provider register --namespace Microsoft.App
az provider register --namespace Microsoft.OperationalInsights

Registration can take a few minutes. Check status with az provider show --namespace Microsoft.App --query registrationState.

Step 3: Create a resource group (if you don't have one)

az group create --name tuna-rg --location eastus

Step 4: Set environment variables

export AZURE_SUBSCRIPTION_ID=$(az account show --query id -o tsv)
export AZURE_RESOURCE_GROUP=tuna-rg
export AZURE_REGION=eastus

Add to ~/.bashrc or ~/.zshrc to persist.

Step 5: Install the Azure SDK

pip install tandemn-tuna[azure]

Step 6: Verify setup

tuna check --provider azure

GPU availability: Azure Container Apps supports T4 ($0.26/hr) and A100 80GB ($1.90/hr) GPUs. GPU quota must be requested via the Azure portal — search "Quotas" and request Managed Environment Consumption T4 Gpus or Managed Environment Consumption NCA100 Gpus capacity for Container Apps in your region. Note: this is separate from VM-level (Compute) GPU quota.

Environment reuse: The first Azure deploy creates a Container Apps environment (~30 min). Subsequent deploys reuse it (~2 min). Environments are preserved on destroy — use --azure-cleanup-env to remove them. An idle environment with no running apps incurs no charges.

Cerebrium

Step 1: Create account — sign up at https://www.cerebrium.ai/ ($30 free credits on Hobby plan)

Step 2: Install the Cerebrium CLI

pip install tandemn-tuna[cerebrium]

Step 3: Create a service account token — go to Dashboard > API Keys > Create Service Account > Copy the token

Step 4: Set the API key

export CEREBRIUM_API_KEY=<your-service-account-token>

Add to ~/.bashrc or ~/.zshrc to persist.

Step 5: Set your project context

The service account token contains your project ID, but the CLI needs it set explicitly:

# List your projects to find the ID
cerebrium projects list

# Set the project context
cerebrium project set <your-project-id>

Note: Your project ID (e.g. p-ad42316a) can be found in the Cerebrium dashboard URL or by running cerebrium projects list. This step is required — without it, deploys will fail with "no project configured".

For CI/CD / headless environments: Set CEREBRIUM_API_KEY and run cerebrium project set before deploying. No cerebrium login needed.

Step 6: Verify setup

tuna check --provider cerebrium

GPU availability: Hobby plan ($0/mo) gives access to T4, A10, L4, L40S. A100 and H100 require the Enterprise plan.

4. (Optional) Set HuggingFace token for gated models

export HF_TOKEN=<your-token>  # https://huggingface.co/settings/tokens

Required for models like Llama, Mistral, Gemma, and other gated models. Not needed for open models like Qwen.

5. Validate your setup

tuna check --provider modal                          # check Modal credentials
tuna check --provider runpod                         # check RunPod API key
tuna check --provider cloudrun --gcp-project <id> --gcp-region us-central1  # check Cloud Run
tuna check --provider baseten                        # check Baseten API key + truss CLI
tuna check --provider azure                          # check Azure CLI + SDK + resource providers
tuna check --provider cerebrium                      # check Cerebrium API key + CLI

6. Deploy a model

tuna deploy --model Qwen/Qwen3-0.6B --gpu L4 --service-name my-first-deploy

Tuna auto-selects the cheapest serverless provider for your GPU, launches spot instances on AWS, and gives you a single endpoint. The router handles everything — serverless covers traffic immediately while spot boots up in the background.

# Deploy with GCP spot instances instead of AWS
tuna deploy --model Qwen/Qwen3-0.6B --gpu T4 --spots-cloud gcp --service-name my-gcp-deploy

6a. (Alternative) Deploy serverless-only

Skip spot + router for dev/test or low-traffic:

tuna deploy --model Qwen/Qwen3-0.6B --gpu L4 --serverless-only

Returns the provider's direct endpoint. No AWS credentials needed.

7. Send requests (OpenAI-compatible)

curl http://<router-ip>:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello"}]}'

8. Monitor and manage

tuna status --service-name my-first-deploy    # check deployment status
tuna cost --service-name my-first-deploy      # real-time cost dashboard
tuna list                                     # list all deployments
tuna destroy --service-name my-first-deploy   # tear down a specific deployment
tuna destroy --all                            # tear down all active deployments

Tip: If you don't pass --service-name during deploy, Tuna auto-generates a name like tuna-a3f8c21b. Use tuna list to find it.

9. Browse GPU pricing

tuna show-gpus                                    # compare serverless pricing across providers
tuna show-gpus --spot                             # include AWS spot prices (default)
tuna show-gpus --spot --spots-cloud gcp           # include GCP spot prices
tuna show-gpus --gpu H100                         # detailed pricing for a specific GPU
tuna show-gpus --provider runpod                  # filter to one provider

Architecture

                ┌──────────────────────┐
                │    User Traffic      │
                │ (OpenAI-compatible)  │
                └──────────┬───────────┘
                           │
                  ┌────────▼────────┐
                  │  Smart Router   │
                  │  (meta_lb)      │
                  └────────┬────────┘
                           │
              ┌────────────┴────────────┐
              │                         │
     ┌────────▼─────────┐    ┌─────────▼─────────┐
     │ Serverless        │    │ Spot GPUs          │
     │ Modal / RunPod /  │    │ AWS / GCP via      │
     │ Cloud Run         │    │ SkyPilot           │
     │                   │    │ • 3-5x cheaper     │
     │ • Fast cold start │    │ • Slower cold start│
     │ • Per-second bill │    │ • Auto-failover    │
     │ • Always ready    │    │ • Scale to zero    │
     └───────────────────┘    └────────────────────┘

The router uses a 3-state machine (COLD → WARMING → READY):

  • COLD — spot is down, all traffic goes to serverless
  • WARMING — spot is booting, background pokes keep it alive, serverless handles requests
  • READY — spot is up, all traffic routes to spot (cheapest path)

Key behaviors:

  • Auto-enters WARMING on startup — spot starts provisioning immediately
  • Falls back to serverless if spot fails (zero data loss)
  • Scales spot to zero when traffic stops (min_replicas=0, downscale_delay=60s)
  • Cold starts from zero in ~5 min when traffic resumes
  • Streams responses token-by-token (no buffering)

CLI Reference

Command Description
deploy Deploy a model across serverless + spot
destroy Tear down a deployment (--service-name <name> or --all for all active)
status Check deployment status
cost Show cost dashboard (requires running deployment)
list List all deployments (filter with --status active|destroyed|failed)
show-gpus GPU pricing across providers (filter with --provider, --gpu, --spot)
check Validate provider credentials and setup
benchmark cold-start Measure cold start latency across providers
benchmark load-test Load test with TTFT/ITL/throughput metrics + cost tracking

deploy flags

Flag Default Description
--model (required) HuggingFace model ID (e.g. Qwen/Qwen3-0.6B)
--gpu (required) GPU type (e.g. L4, L40S, A100, H100)
--gpu-count 1 Number of GPUs
--serverless-provider auto (cheapest for GPU) modal, runpod, cloudrun, baseten, azure, or cerebrium
--spots-cloud aws Cloud provider for spot GPUs
--region Cloud region for spot instances
--tp-size 1 Tensor parallelism degree
--max-model-len 4096 Maximum sequence length (context window)
--concurrency Override serverless concurrency limit
--workers-max Max serverless workers (RunPod only)
--cold-start-mode fast_boot fast_boot (uses --enforce-eager, faster startup but lower throughput) or no_fast_boot
--no-scale-to-zero off Keep minimum 1 spot replica running
--scaling-policy Path to scaling YAML (see below)
--service-name auto-generated Custom service name (recommended — makes status/destroy easier)
--serverless-only off Serverless only (no spot, no router). No AWS needed.
--quantization Quantization method for vLLM (e.g. awq, gptq, fp8)
--public off Make service publicly accessible (no auth)
--use-different-vm-for-lb off Launch router on a separate VM instead of colocating on controller
--gcp-project Google Cloud project ID
--gcp-region Google Cloud region (e.g. us-central1)
--azure-subscription Azure subscription ID
--azure-resource-group Azure resource group name
--azure-region Azure region (e.g. eastus)
--azure-environment Name of existing Container Apps environment to reuse

Use -v / --verbose with any command for debug logging.

Scaling Policy

All autoscaling parameters can be configured via a YAML file passed with --scaling-policy. If omitted, sane defaults apply.

spot:
  min_replicas: 0        # 0 = scale to zero (default)
  max_replicas: 5
  target_qps: 10         # per-replica QPS target
  upscale_delay: 5       # seconds before adding replicas
  downscale_delay: 60    # seconds before removing replicas (default: 60)

serverless:
  concurrency: 32        # max concurrent requests per container
  scaledown_window: 60   # seconds idle before scaling down
  timeout: 600           # request timeout in seconds
  workers_min: 0         # min workers (RunPod only)
  workers_max: 1         # max workers (RunPod only)
  scaler_value: 4        # queue delay scaler threshold (RunPod only)

Precedence: defaults <- YAML file <- CLI flags. For example, --concurrency 64 overrides serverless.concurrency from the YAML. --no-scale-to-zero forces spot.min_replicas to at least 1 and sets serverless.scaledown_window to 300s.

Unknown keys in the YAML will error immediately (catches typos).

Benchmarking

Tuna includes built-in benchmarking for both cold starts and sustained load testing.

Cold start benchmark — measure time-to-inference across providers:

tuna benchmark cold-start --provider modal,cerebrium --gpu L4 --model Qwen/Qwen3-0.6B

Load test — measure TTFT, ITL, throughput, and cost savings using aiperf:

# Install aiperf (separate from tuna due to dependency conflict with truss)
uv pip install aiperf

# Run load test against a deployment
tuna benchmark load-test \
  --endpoint-url http://<router-ip>:8080 \
  --concurrency 30 \
  --duration 2h \
  --profile day-cycle \
  --model Qwen/Qwen3-0.6B \
  --api-key <key>

The day-cycle profile generates Poisson-distributed traffic with 3 zero-traffic gaps that trigger spot scale-to-zero and recovery. The benchmark reports both aiperf metrics (TTFT, ITL, token throughput) and tuna cost metrics (spot vs serverless split, savings vs alternatives).

See benchmark docs for all options and profiles.

Troubleshooting

Setup issues

Start with the built-in diagnostic tool:

tuna check --provider runpod
tuna check --provider modal
tuna check --provider cloudrun --gcp-project <id> --gcp-region us-central1
tuna check --provider baseten
tuna check --provider azure

This validates credentials, API access, project configuration, and GPU region availability.

Endpoint not responding

# Check your deployment status
tuna status --service-name <name>

# Check router health directly
curl http://<router-ip>:8080/router/health

# Check SkyServe status
sky status --refresh

High latency

Check which backend is serving traffic:

curl http://<router-ip>:8080/router/health

Check spot_state: cold = spot is down (all traffic on serverless), warming = spot is booting (serverless handles requests), ready = spot is serving. This is working correctly — traffic shifts automatically once spot is ready.

Gated model fails to load

If the deployment succeeds but the model fails to start, you likely need a HuggingFace token:

export HF_TOKEN=<your-token>

Then redeploy.

Contact

License

MIT

This project depends on SkyPilot (Apache License 2.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tandemn_tuna-0.1.0.tar.gz (151.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tandemn_tuna-0.1.0-py3-none-any.whl (115.2 kB view details)

Uploaded Python 3

File details

Details for the file tandemn_tuna-0.1.0.tar.gz.

File metadata

  • Download URL: tandemn_tuna-0.1.0.tar.gz
  • Upload date:
  • Size: 151.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22.2","id":"zara","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tandemn_tuna-0.1.0.tar.gz
Algorithm Hash digest
SHA256 260719b744b0ba134647bec435712cce34bb16194d847ee456e858a48fc97d2e
MD5 855c8414b873d8b4765c7e8ad7b4da91
BLAKE2b-256 04be690a14b62abbfabc6b4b0a637fcc1b60d6c7c1da6a5cb5c60382395ad8df

See more details on using hashes here.

File details

Details for the file tandemn_tuna-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tandemn_tuna-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 115.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22.2","id":"zara","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tandemn_tuna-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a474856d99ad04cfd9bf2a8a5c86ee5fe2fc40a33cce6c60cf90ae9e5a7c2a43
MD5 27f975391a5da7e6f5ac59cb72761a70
BLAKE2b-256 e8c5dfef06740a18f3a668977895e413995a7ac95976b13c123605e3c5d5ee2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page