Skip to main content

Hybrid GPU Inference Orchestrator — serverless for cold starts, spot for scale

Project description

Tuna

Tuna

Spot GPUs are 3-5x cheaper than on-demand, but they take minutes to start and can be interrupted at any time. Serverless GPUs start in seconds and never get interrupted, but you pay a premium for that convenience. What if you didn't have to choose?

Tuna is a smart router that combines both behind a single OpenAI-compatible endpoint. It serves requests from serverless while spot instances boot up, shifts traffic to spot once ready, and falls back to serverless if spot gets preempted. You only pay for the compute you actually use — spot rates for steady traffic, serverless only during cold starts and failover.

Serverless Spot

Modal

RunPod

Cloud Run

Baseten

AWS via SkyPilot

View Roadmap

Prerequisites

Note: Tuna always deploys both a serverless backend and a spot backend. AWS credentials are required even if your serverless provider is Modal or RunPod, because spot instances run on AWS via SkyPilot.

Quick Start

1. Install

pip install tandemn-tuna[modal] --pre     # Modal as serverless provider
pip install tandemn-tuna[cloudrun] --pre  # Cloud Run as serverless provider
pip install tandemn-tuna[baseten] --pre   # Baseten as serverless provider
pip install tandemn-tuna --pre            # RunPod (no extra deps needed)
pip install tandemn-tuna[all] --pre       # everything

This project is under active development and experimental. For the latest version, install from source:

git clone https://github.com/Tandemn-Labs/tandemn-tuna.git
cd tandemn-tuna
pip install -e ".[all]"

2. Set up AWS (required for all deployments)

aws configure          # set up AWS credentials
sky check              # verify SkyPilot can see your AWS account

3. Set up your serverless provider (pick one)

Modal
modal token new
RunPod
export RUNPOD_API_KEY=<your-key>  # https://www.runpod.io/console/user/settings

Add this to your ~/.bashrc or ~/.zshrc to persist it.

Cloud Run

Requires the gcloud CLI.

gcloud auth login
gcloud auth application-default login    # required for the Python SDK
gcloud config set project <YOUR_PROJECT_ID>

You also need billing enabled and the Cloud Run API (run.googleapis.com) enabled on your project.

Baseten

Step 1: Create account — sign up at https://app.baseten.co/signup/

Step 2: Get API key — go to Settings > API Keys (https://app.baseten.co/settings/api_keys), create a key, copy it immediately

Step 3: Set the API key

export BASETEN_API_KEY=<your-api-key>

Add to ~/.bashrc or ~/.zshrc to persist.

Step 4: Install and authenticate the Truss CLI

pip install --upgrade truss
truss login --api-key $BASETEN_API_KEY

Step 5: (For gated models) Add HuggingFace token — go to Settings > Secrets (https://app.baseten.co/settings/secrets), add a secret named hf_access_token with your HF token.

4. (Optional) Set HuggingFace token for gated models

export HF_TOKEN=<your-token>  # https://huggingface.co/settings/tokens

Required for models like Llama, Mistral, Gemma, and other gated models. Not needed for open models like Qwen.

5. Validate your setup

tuna check --provider modal                          # check Modal credentials
tuna check --provider runpod                         # check RunPod API key
tuna check --provider cloudrun --gcp-project <id> --gcp-region us-central1  # check Cloud Run
tuna check --provider baseten                        # check Baseten API key + truss CLI

6. Deploy a model

tuna deploy --model Qwen/Qwen3-0.6B --gpu L4 --service-name my-first-deploy

Tuna auto-selects the cheapest serverless provider for your GPU, launches spot instances on AWS, and gives you a single endpoint. The router handles everything — serverless covers traffic immediately while spot boots up in the background.

7. Send requests (OpenAI-compatible)

curl http://<router-ip>:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello"}]}'

8. Monitor and manage

tuna status --service-name my-first-deploy    # check deployment status
tuna cost --service-name my-first-deploy      # real-time cost dashboard
tuna list                                     # list all deployments
tuna destroy --service-name my-first-deploy   # tear down everything

Tip: If you don't pass --service-name during deploy, Tuna auto-generates a name like tuna-a3f8c21b. Use tuna list to find it.

9. Browse GPU pricing

tuna show-gpus                     # compare serverless pricing across providers
tuna show-gpus --spot              # include AWS spot prices
tuna show-gpus --gpu H100          # detailed pricing for a specific GPU
tuna show-gpus --provider runpod   # filter to one provider

Architecture

                ┌──────────────────────┐
                │    User Traffic      │
                │ (OpenAI-compatible)  │
                └──────────┬───────────┘
                           │
                  ┌────────▼────────┐
                  │  Smart Router   │
                  │  (meta_lb)      │
                  └────────┬────────┘
                           │
              ┌────────────┴────────────┐
              │                         │
     ┌────────▼─────────┐    ┌─────────▼─────────┐
     │ Serverless        │    │ Spot GPUs          │
     │ Modal / RunPod /  │    │ AWS via SkyPilot   │
     │ Cloud Run         │    │                    │
     │                   │    │ • 3-5x cheaper     │
     │ • Fast cold start │    │ • Slower cold start│
     │ • Per-second bill │    │ • Auto-failover    │
     │ • Always ready    │    │ • Scale to zero    │
     └───────────────────┘    └────────────────────┘

The router:

  • Routes to serverless while spot instances are starting up
  • Shifts traffic to spot once ready (cheaper)
  • Falls back to serverless if spot has issues or high latency
  • Scales serverless down to zero when spot is serving

CLI Reference

Command Description
deploy Deploy a model across serverless + spot
destroy Tear down a deployment
status Check deployment status
cost Show cost dashboard (requires running deployment)
list List all deployments (filter with --status active|destroyed|failed)
show-gpus GPU pricing across providers (filter with --provider, --gpu, --spot)
check Validate provider credentials and setup

deploy flags

Flag Default Description
--model (required) HuggingFace model ID (e.g. Qwen/Qwen3-0.6B)
--gpu (required) GPU type (e.g. L4, L40S, A100, H100)
--gpu-count 1 Number of GPUs
--serverless-provider auto (cheapest for GPU) modal, runpod, cloudrun, or baseten
--spots-cloud aws Cloud provider for spot GPUs
--region Cloud region for spot instances
--tp-size 1 Tensor parallelism degree
--max-model-len 4096 Maximum sequence length (context window)
--concurrency Override serverless concurrency limit
--workers-max Max serverless workers (RunPod only)
--cold-start-mode fast_boot fast_boot (uses --enforce-eager, faster startup but lower throughput) or no_fast_boot
--no-scale-to-zero off Keep minimum 1 spot replica running
--scaling-policy Path to scaling YAML (see below)
--service-name auto-generated Custom service name (recommended — makes status/destroy easier)
--public off Make service publicly accessible (no auth)
--use-different-vm-for-lb off Launch router on a separate VM instead of colocating on controller
--gcp-project Google Cloud project ID
--gcp-region Google Cloud region (e.g. us-central1)

Use -v / --verbose with any command for debug logging.

Scaling Policy

All autoscaling parameters can be configured via a YAML file passed with --scaling-policy. If omitted, sane defaults apply.

spot:
  min_replicas: 0        # 0 = scale to zero (default)
  max_replicas: 5
  target_qps: 10         # per-replica QPS target
  upscale_delay: 5       # seconds before adding replicas
  downscale_delay: 300   # seconds before removing replicas

serverless:
  concurrency: 32        # max concurrent requests per container
  scaledown_window: 60   # seconds idle before scaling down
  timeout: 600           # request timeout in seconds
  workers_min: 0         # min workers (RunPod only)
  workers_max: 1         # max workers (RunPod only)
  scaler_value: 4        # queue delay scaler threshold (RunPod only)

Precedence: defaults <- YAML file <- CLI flags. For example, --concurrency 64 overrides serverless.concurrency from the YAML. --no-scale-to-zero forces spot.min_replicas to at least 1 and sets serverless.scaledown_window to 300s.

Unknown keys in the YAML will error immediately (catches typos).

Troubleshooting

Setup issues

Start with the built-in diagnostic tool:

tuna check --provider runpod
tuna check --provider modal
tuna check --provider cloudrun --gcp-project <id> --gcp-region us-central1
tuna check --provider baseten

This validates credentials, API access, project configuration, and GPU region availability.

Endpoint not responding

# Check your deployment status
tuna status --service-name <name>

# Check router health directly
curl http://<router-ip>:8080/router/health

# Check SkyServe status
sky status --refresh

High latency

Check which backend is serving traffic:

curl http://<router-ip>:8080/router/health

If skyserve_ready is false, spot instances are still booting — requests are going through serverless (which is working correctly). Once spot boots, traffic shifts automatically.

Gated model fails to load

If the deployment succeeds but the model fails to start, you likely need a HuggingFace token:

export HF_TOKEN=<your-token>

Then redeploy.

Contact

License

MIT

This project depends on SkyPilot (Apache License 2.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tandemn_tuna-0.0.1a6.tar.gz (74.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tandemn_tuna-0.0.1a6-py3-none-any.whl (58.6 kB view details)

Uploaded Python 3

File details

Details for the file tandemn_tuna-0.0.1a6.tar.gz.

File metadata

  • Download URL: tandemn_tuna-0.0.1a6.tar.gz
  • Upload date:
  • Size: 74.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22.2","id":"zara","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tandemn_tuna-0.0.1a6.tar.gz
Algorithm Hash digest
SHA256 26f722cd36ba3a9e4db2a16ffca76cc3ec63b0aa1467601c23b4218ed4d627f7
MD5 9d97f82528e9effd263536bc420438cf
BLAKE2b-256 7c8e94c1ecb8d9799828f1e91b3f584303164d877d78bb9b82e2f73912f2df73

See more details on using hashes here.

File details

Details for the file tandemn_tuna-0.0.1a6-py3-none-any.whl.

File metadata

  • Download URL: tandemn_tuna-0.0.1a6-py3-none-any.whl
  • Upload date:
  • Size: 58.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22.2","id":"zara","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tandemn_tuna-0.0.1a6-py3-none-any.whl
Algorithm Hash digest
SHA256 0c6d8a734af6426a9da96ad96c0b94e28c2f971c260eb4b0ec67ab1e32a9c359
MD5 769bb2412a2b93d5df996a7788f3c6ff
BLAKE2b-256 92b1ed8ffb18d17e4094c2df2629cea00d2a21f8621894839deef50e88d7cbcc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page