Skip to main content

Deployment tools for OverlayBD on GCE VMs

Project description

OverlayBD Deployment for Colab Runtime

Deploy OverlayBD on a GCE VM to run large container images with fast startup using block-device-level snapshotting.

What This Does

OverlayBD converts OCI container images into a block-device format that containerd can mount efficiently. Instead of extracting every layer sequentially, the image is presented as a virtual block device via the Linux TCMU (Target Core Module Userspace) subsystem.

Tested results: ~170ms container startup (warm cache) for a 27GB Colab runtime image. With lazy loading: ~2.2s lazy pull + ~5.6s cold start.

Prerequisites

  • GCE VM running Debian 11 (Bullseye) with target_core_user kernel support
  • containerd installed (the install script handles this)
  • gcloud CLI (for OAuth2 token generation) or a GCP service account key
  • Root/sudo access

Installation

pip install .
# or for development (with pytest, mypy, ruff):
pip install -e ".[dev]"

This installs the overlaybd-deploy CLI (subcommands require sudo):

overlaybd-deploy install
overlaybd-deploy setup-credentials
overlaybd-deploy pull-image
overlaybd-deploy convert-image
overlaybd-deploy profile-startup
overlaybd-deploy manage-cache
overlaybd-deploy health-check

Quick Start

# 1. Install OverlayBD (idempotent)
sudo overlaybd-deploy install

# 2. Set up credentials (pick one)
sudo overlaybd-deploy setup-credentials oauth2                          # temporary (1 hour)
sudo overlaybd-deploy setup-credentials service-account /path/to/key.json  # permanent

# 3. Pull and run
export GOOGLE_CLOUD_PROJECT=${GOOGLE_CLOUD_PROJECT:-$(gcloud config get-value project)}
sudo overlaybd-deploy pull-image
sudo ctr run --snapshotter overlaybd --rm \
  "us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/colab-optimized/runtime:latest_obd" \
  test /bin/echo "hello from overlaybd"

Detailed Walkthrough

1. Install

sudo overlaybd-deploy install

This script:

  • Loads the target_core_user kernel module
  • Installs overlaybd-tcmu and overlaybd-snapshotter packages
  • Writes config files to /etc/overlaybd/ and /etc/overlaybd-snapshotter/
  • Adds the OverlayBD proxy plugin to /etc/containerd/config.toml
  • Creates /opt/overlaybd/cred.json (empty, for credentials)
  • Starts all three services and verifies they're healthy

2. Configure Credentials

OverlayBD needs registry credentials stored in /opt/overlaybd/cred.json. The format must be:

{
  "auths": {
    "us-docker.pkg.dev": {
      "username": "oauth2accesstoken",
      "password": "<token>"
    }
  }
}

Important: The {"auths": {...}} wrapper is required. Flat credential objects will not work.

Option A: Service Account Key (Recommended)

Service account keys don't expire and are suitable for production/automation.

sudo overlaybd-deploy setup-credentials service-account /path/to/sa-key.json

The key file is used as the password with _json_key as the username.

Option B: OAuth2 Access Token

Quick setup for testing. Tokens expire in ~60 minutes.

sudo overlaybd-deploy setup-credentials oauth2

Re-run to refresh when the token expires.

Verify credentials

sudo overlaybd-deploy setup-credentials verify

3. Pull an Image

# Pull the pre-converted Colab runtime image (downloads all blobs)
sudo overlaybd-deploy pull-image

# Pull a custom image
sudo overlaybd-deploy pull-image us-docker.pkg.dev/my-project/my-repo/my-image:tag_obd

The script uses rpull --user --download-blobs which:

  1. Fetches the OverlayBD manifest and layer metadata
  2. Downloads all blob data locally (reliable for large/private images)

Lazy Loading (no blob download)

For faster pulls, use --no-download to skip downloading blobs. Layers are fetched on-demand from the registry when the container reads them:

sudo overlaybd-deploy pull-image --no-download

When --no-download is used, the command automatically:

  1. Refreshes OAuth2 credentials (via overlaybd-deploy setup-credentials oauth2)
  2. Restarts overlaybd-tcmu to clear stale cached state

This prevents a known issue where stale tokens cause TCMU to fail authentication.

To skip the automatic refresh (e.g., when using service account keys):

sudo overlaybd-deploy pull-image --no-download --skip-refresh

4. Run a Container

# Quick test
sudo ctr run --snapshotter overlaybd --rm \
  "us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/colab-optimized/runtime:latest_obd" \
  test /bin/echo "hello"

# Interactive shell
sudo ctr run --snapshotter overlaybd --rm -t \
  "us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/colab-optimized/runtime:latest_obd" \
  shell /bin/bash

5. Convert Your Own Images

If you have a standard OCI image and want to convert it to OverlayBD format:

sudo overlaybd-deploy convert-image \
  us-docker.pkg.dev/colab-images/public/runtime \
  "us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/colab-optimized/runtime:latest_obd"

Requirements: Push access to the target repository. The obdconv method pulls the source, converts locally, and pushes the result.

6. Manage Cache (Profile, Warm, Snapshot, Deploy)

To optimize container startup, you can profile an application's initial block accesses, use that profile to create a warm cache, and then snapshot and deploy that cache to other machines.

Step 1: Profile Application Startup

overlaybd-deploy profile-startup clears the cache and runs a container to record which data blocks are accessed during its initial startup. This generates a startup-profile.json file.

# Profile the default runtime image
sudo overlaybd-deploy profile-startup

# Profile a specific image and command
sudo overlaybd-deploy profile-startup --cmd "/bin/echo hello" <image-ref>

Step 2: Pre-warm the Local Cache

overlaybd-deploy manage-cache warm reads the startup-profile.json and downloads all the required blobs into the local SSD cache.

# Pre-warm the cache using the generated profile
sudo overlaybd-deploy manage-cache warm

After this step, subsequent container starts will be much faster as they will read from the local SSD instead of the network.

Step 3: Snapshot the Warm Cache

overlaybd-deploy manage-cache snapshot exports the entire warm cache to either Google Cloud Storage (GCS) or a persistent disk snapshot. This creates a portable artifact that can be deployed to other VMs.

# Snapshot the cache to a GCS bucket (default method)
sudo overlaybd-deploy manage-cache snapshot --bucket my-project-overlaybd-cache

# Snapshot the cache to a GCE disk snapshot
sudo overlaybd-deploy manage-cache snapshot --method disk --name my-cache-snapshot-v1

Step 4: Deploy the Cache to a New VM

overlaybd-deploy manage-cache deploy is the final step. On a new VM, it imports a cache from GCS or a disk snapshot, making it ready for immediate warm starts.

# Deploy the latest cache from a GCS bucket
sudo overlaybd-deploy manage-cache deploy --bucket my-project-overlaybd-cache

# Deploy a specific cache version from GCS
sudo overlaybd-deploy manage-cache deploy --bucket my-project-overlaybd-cache --name my-cache-v1

# Deploy from a disk snapshot
sudo overlaybd-deploy manage-cache deploy --method disk --snapshot my-cache-snapshot-v1

This workflow ensures that new VMs can be provisioned with a fully populated cache, providing consistent, fast container startup times across a fleet.

Health Check

sudo overlaybd-deploy health-check         # quick check
sudo overlaybd-deploy health-check -v      # verbose output

Checks: kernel module, services, containerd plugin, config files, credentials, disk space, and loaded images.

Performance Numbers (Actual Tested)

Metric Value
Image size (Colab runtime) ~27 GB
Image layers 63
Container startup (warm cache) ~170ms
Container startup (cold, lazy) ~5.6s (210 fetches, 66 MB)
rpull with --download-blobs Depends on network (downloads full image)
rpull with --no-download (lazy) ~2.2s (96KB metadata only)
Cache warm profile 62 blobs, 66 MB
Lazy pull + warm start ~2.4s total

Architecture

containerd
  └── overlaybd snapshotter (proxy plugin)
        ├── overlaybd-snapshotter  (manages snapshots, serves gRPC)
        └── overlaybd-tcmu         (presents layers as TCMU block devices)
              └── target_core_user (kernel module)

Key files:

  • /etc/overlaybd/overlaybd.json — TCMU config (cache, credentials, logging)
  • /etc/overlaybd-snapshotter/config.json — Snapshotter config (socket, root dir)
  • /etc/containerd/config.toml — Containerd proxy plugin registration
  • /opt/overlaybd/cred.json — Registry credentials
  • /opt/overlaybd/startup-profile.json — Startup block access profile (generated by overlaybd-deploy profile-startup)

Known Limitations

  1. Must use registryFsVersion: "v1" with Google Artifact Registry: The default v2 HTTP client cannot handle relative 302 redirects that Artifact Registry returns for blob downloads, causing connections to 0.0.0.0:80. The v1 client uses libcurl which handles this correctly. The config template already sets "v1".

  2. OAuth2 tokens expire: Tokens from gcloud auth print-access-token last ~60 minutes. For long-running or automated setups, use service account keys.

  3. TurboOCI conversion requires push access: The turboOCIconv method needs push access to the source repository (to write acceleration metadata). Use obdconv instead, which pushes to a separate target ref.

  4. Credential format: Must use {"auths": {"registry": {...}}} Docker config format. Other formats are silently ignored.

  5. Cache tuning: The bundled overlaybd.json config is tuned for performance: 40 GB SSD cache, download.delay=0 (background download starts immediately after lazy pull), and download.maxMBps=1000. During profiling, overlaybd-deploy profile-startup temporarily sets delay=999999 to disable background download so that real on-demand fetches are captured.

Troubleshooting

Services won't start

# Check logs
sudo journalctl -u overlaybd-tcmu -n 50
sudo journalctl -u overlaybd-snapshotter -n 50

# Verify kernel module
lsmod | grep target_core_user
sudo modprobe target_core_user

Lazy loading connects to 0.0.0.0:80

This happens when registryFsVersion is set to "v2" (the default). The v2 HTTP client cannot follow relative 302 redirects from Google Artifact Registry. Fix: set "registryFsVersion": "v1" in /etc/overlaybd/overlaybd.json and restart overlaybd-tcmu.

It can also happen with expired OAuth2 tokens. Refresh with:

sudo overlaybd-deploy setup-credentials oauth2
sudo systemctl restart overlaybd-tcmu

For a permanent fix, use service account keys:

sudo overlaybd-deploy setup-credentials service-account /path/to/sa-key.json

rpull fails with auth errors

# Verify credential file format
sudo cat /opt/overlaybd/cred.json | python3 -m json.tool

# Refresh OAuth2 token
sudo overlaybd-deploy setup-credentials oauth2

# Test with explicit credentials
TOKEN=$(gcloud auth print-access-token)
sudo /opt/overlaybd/snapshotter/ctr rpull \
  --user "oauth2accesstoken:${TOKEN}" \
  --download-blobs \
  "us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/colab-optimized/runtime:latest_obd"

containerd doesn't see overlaybd plugin

# Check plugin is registered
sudo ctr plugin ls | grep overlaybd

# Verify config
grep -A3 'proxy_plugins.overlaybd' /etc/containerd/config.toml

# Restart everything in order
sudo systemctl restart overlaybd-tcmu
sudo systemctl restart overlaybd-snapshotter
sudo systemctl restart containerd

Container fails to start

# Check overlaybd logs
sudo tail -50 /var/log/overlaybd.log

# Check audit log
sudo tail -50 /var/log/overlaybd-audit.log

# Verify image is properly loaded
sudo /opt/overlaybd/snapshotter/ctr image ls

File Layout

overlaybd-deploy/
├── pyproject.toml                     # Package config (pip install -e .)
├── README.md                          # This file
├── INSTALL.md                         # End-to-end deployment guide
├── overlaybd_deploy/                  # Python package
│   ├── __init__.py
│   ├── constants.py                   # Shared paths, URLs, config
│   ├── utils.py                       # Logging, subprocess wrappers
│   ├── registry.py                    # Registry/image reference utilities
│   ├── config.py                      # Bundled config file access
│   ├── data/                          # Bundled config templates
│   │   ├── overlaybd.json             # TCMU config (registryFsVersion v1)
│   │   └── snapshotter-config.json    # Snapshotter config
│   ├── cli.py                         # Single entry point dispatcher
│   └── commands/                      # Subcommand implementations
│       ├── install.py                 # overlaybd-deploy install
│       ├── setup_credentials.py       # overlaybd-deploy setup-credentials
│       ├── pull_image.py              # overlaybd-deploy pull-image
│       ├── convert_image.py           # overlaybd-deploy convert-image
│       ├── profile_startup.py         # overlaybd-deploy profile-startup
│       ├── manage_cache.py            # overlaybd-deploy manage-cache
│       └── health_check.py            # overlaybd-deploy health-check
└── tests/                             # pytest test suite
    ├── conftest.py
    ├── test_utils.py
    ├── test_registry.py
    ├── test_config.py
    └── commands/
        └── test_*.py                  # One test file per command

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

overlaybd_deploy-1.0.0.tar.gz (34.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

overlaybd_deploy-1.0.0-py3-none-any.whl (34.8 kB view details)

Uploaded Python 3

File details

Details for the file overlaybd_deploy-1.0.0.tar.gz.

File metadata

  • Download URL: overlaybd_deploy-1.0.0.tar.gz
  • Upload date:
  • Size: 34.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for overlaybd_deploy-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a1d62e9305d75666dcff12a2718638986aa25c2f7f84e95bced068ae8f486fc1
MD5 1e5108fc10ef3808a8e62c11ef293d86
BLAKE2b-256 d33a0fda762d2373250c22004152754eba979c32f0fceaf236667e453c8b5fa9

See more details on using hashes here.

File details

Details for the file overlaybd_deploy-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for overlaybd_deploy-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d11f4babb597423f4377ef0834eb23dfeae412f9a0335038eb9b028c1ca626fe
MD5 316786c47ffbe7ee93880033b3b1a39c
BLAKE2b-256 a13d5b9e2bb57acbb60728be35cf9a61f31226af6ade1d45b64e7dd9cedda852

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page