Deployment tools for OverlayBD on GCE VMs
Project description
OverlayBD Deployment for Colab Runtime
Deploy OverlayBD on a GCE VM to run large container images with fast startup using block-device-level snapshotting.
What This Does
OverlayBD converts OCI container images into a block-device format that containerd can mount efficiently. Instead of extracting every layer sequentially, the image is presented as a virtual block device via the Linux TCMU (Target Core Module Userspace) subsystem.
Tested results: ~170ms container startup (warm cache) for a 27GB Colab runtime image. With lazy loading: ~2.2s lazy pull + ~5.6s cold start.
Prerequisites
- GCE VM running Debian 11 (Bullseye) with
target_core_userkernel support containerdinstalled (the install script handles this)gcloudCLI (for OAuth2 token generation) or a GCP service account key- Root/sudo access
Installation
pip install .
# or for development (with pytest, mypy, ruff):
pip install -e ".[dev]"
This installs the overlaybd-deploy CLI (subcommands require sudo):
overlaybd-deploy install
overlaybd-deploy setup-credentials
overlaybd-deploy pull-image
overlaybd-deploy convert-image
overlaybd-deploy profile-startup
overlaybd-deploy manage-cache
overlaybd-deploy health-check
Quick Start
# 1. Install OverlayBD (idempotent)
sudo overlaybd-deploy install
# 2. Set up credentials (pick one)
sudo overlaybd-deploy setup-credentials oauth2 # temporary (1 hour)
sudo overlaybd-deploy setup-credentials service-account /path/to/key.json # permanent
# 3. Pull and run
export GOOGLE_CLOUD_PROJECT=${GOOGLE_CLOUD_PROJECT:-$(gcloud config get-value project)}
sudo overlaybd-deploy pull-image
sudo ctr run --snapshotter overlaybd --rm \
"us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/colab-optimized/runtime:latest_obd" \
test /bin/echo "hello from overlaybd"
Detailed Walkthrough
1. Install
sudo overlaybd-deploy install
This script:
- Loads the
target_core_userkernel module - Installs
overlaybd-tcmuandoverlaybd-snapshotterpackages - Writes config files to
/etc/overlaybd/and/etc/overlaybd-snapshotter/ - Adds the OverlayBD proxy plugin to
/etc/containerd/config.toml - Creates
/opt/overlaybd/cred.json(empty, for credentials) - Starts all three services and verifies they're healthy
2. Configure Credentials
OverlayBD needs registry credentials stored in /opt/overlaybd/cred.json. The format must be:
{
"auths": {
"us-docker.pkg.dev": {
"username": "oauth2accesstoken",
"password": "<token>"
}
}
}
Important: The {"auths": {...}} wrapper is required. Flat credential objects will not work.
Option A: Service Account Key (Recommended)
Service account keys don't expire and are suitable for production/automation.
sudo overlaybd-deploy setup-credentials service-account /path/to/sa-key.json
The key file is used as the password with _json_key as the username.
Option B: OAuth2 Access Token
Quick setup for testing. Tokens expire in ~60 minutes.
sudo overlaybd-deploy setup-credentials oauth2
Re-run to refresh when the token expires.
Verify credentials
sudo overlaybd-deploy setup-credentials verify
3. Pull an Image
# Pull the pre-converted Colab runtime image (downloads all blobs)
sudo overlaybd-deploy pull-image
# Pull a custom image
sudo overlaybd-deploy pull-image us-docker.pkg.dev/my-project/my-repo/my-image:tag_obd
The script uses rpull --user --download-blobs which:
- Fetches the OverlayBD manifest and layer metadata
- Downloads all blob data locally (reliable for large/private images)
Lazy Loading (no blob download)
For faster pulls, use --no-download to skip downloading blobs. Layers are fetched on-demand from the registry when the container reads them:
sudo overlaybd-deploy pull-image --no-download
When --no-download is used, the command automatically:
- Refreshes OAuth2 credentials (via
overlaybd-deploy setup-credentials oauth2) - Restarts
overlaybd-tcmuto clear stale cached state
This prevents a known issue where stale tokens cause TCMU to fail authentication.
To skip the automatic refresh (e.g., when using service account keys):
sudo overlaybd-deploy pull-image --no-download --skip-refresh
4. Run a Container
# Quick test
sudo ctr run --snapshotter overlaybd --rm \
"us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/colab-optimized/runtime:latest_obd" \
test /bin/echo "hello"
# Interactive shell
sudo ctr run --snapshotter overlaybd --rm -t \
"us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/colab-optimized/runtime:latest_obd" \
shell /bin/bash
5. Convert Your Own Images
If you have a standard OCI image and want to convert it to OverlayBD format:
sudo overlaybd-deploy convert-image \
us-docker.pkg.dev/colab-images/public/runtime \
"us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/colab-optimized/runtime:latest_obd"
Requirements: Push access to the target repository. The obdconv method pulls the source, converts locally, and pushes the result.
6. Manage Cache (Profile, Warm, Snapshot, Deploy)
To optimize container startup, you can profile an application's initial block accesses, use that profile to create a warm cache, and then snapshot and deploy that cache to other machines.
Step 1: Profile Application Startup
overlaybd-deploy profile-startup clears the cache and runs a container to record which data blocks are accessed during its initial startup. This generates a startup-profile.json file.
# Profile the default runtime image
sudo overlaybd-deploy profile-startup
# Profile a specific image and command
sudo overlaybd-deploy profile-startup --cmd "/bin/echo hello" <image-ref>
Step 2: Pre-warm the Local Cache
overlaybd-deploy manage-cache warm reads the startup-profile.json and downloads all the required blobs into the local SSD cache.
# Pre-warm the cache using the generated profile
sudo overlaybd-deploy manage-cache warm
After this step, subsequent container starts will be much faster as they will read from the local SSD instead of the network.
Step 3: Snapshot the Warm Cache
overlaybd-deploy manage-cache snapshot exports the entire warm cache to either Google Cloud Storage (GCS) or a persistent disk snapshot. This creates a portable artifact that can be deployed to other VMs.
# Snapshot the cache to a GCS bucket (default method)
sudo overlaybd-deploy manage-cache snapshot --bucket my-project-overlaybd-cache
# Snapshot the cache to a GCE disk snapshot
sudo overlaybd-deploy manage-cache snapshot --method disk --name my-cache-snapshot-v1
Step 4: Deploy the Cache to a New VM
overlaybd-deploy manage-cache deploy is the final step. On a new VM, it imports a cache from GCS or a disk snapshot, making it ready for immediate warm starts.
# Deploy the latest cache from a GCS bucket
sudo overlaybd-deploy manage-cache deploy --bucket my-project-overlaybd-cache
# Deploy a specific cache version from GCS
sudo overlaybd-deploy manage-cache deploy --bucket my-project-overlaybd-cache --name my-cache-v1
# Deploy from a disk snapshot
sudo overlaybd-deploy manage-cache deploy --method disk --snapshot my-cache-snapshot-v1
This workflow ensures that new VMs can be provisioned with a fully populated cache, providing consistent, fast container startup times across a fleet.
Health Check
sudo overlaybd-deploy health-check # quick check
sudo overlaybd-deploy health-check -v # verbose output
Checks: kernel module, services, containerd plugin, config files, credentials, disk space, and loaded images.
Performance Numbers (Actual Tested)
| Metric | Value |
|---|---|
| Image size (Colab runtime) | ~27 GB |
| Image layers | 63 |
| Container startup (warm cache) | ~170ms |
| Container startup (cold, lazy) | ~5.6s (210 fetches, 66 MB) |
| rpull with --download-blobs | Depends on network (downloads full image) |
| rpull with --no-download (lazy) | ~2.2s (96KB metadata only) |
| Cache warm profile | 62 blobs, 66 MB |
| Lazy pull + warm start | ~2.4s total |
Architecture
containerd
└── overlaybd snapshotter (proxy plugin)
├── overlaybd-snapshotter (manages snapshots, serves gRPC)
└── overlaybd-tcmu (presents layers as TCMU block devices)
└── target_core_user (kernel module)
Key files:
/etc/overlaybd/overlaybd.json— TCMU config (cache, credentials, logging)/etc/overlaybd-snapshotter/config.json— Snapshotter config (socket, root dir)/etc/containerd/config.toml— Containerd proxy plugin registration/opt/overlaybd/cred.json— Registry credentials/opt/overlaybd/startup-profile.json— Startup block access profile (generated byoverlaybd-deploy profile-startup)
Known Limitations
-
Must use
registryFsVersion: "v1"with Google Artifact Registry: The default v2 HTTP client cannot handle relative302redirects that Artifact Registry returns for blob downloads, causing connections to0.0.0.0:80. The v1 client uses libcurl which handles this correctly. The config template already sets"v1". -
OAuth2 tokens expire: Tokens from
gcloud auth print-access-tokenlast ~60 minutes. For long-running or automated setups, use service account keys. -
TurboOCI conversion requires push access: The
turboOCIconvmethod needs push access to the source repository (to write acceleration metadata). Useobdconvinstead, which pushes to a separate target ref. -
Credential format: Must use
{"auths": {"registry": {...}}}Docker config format. Other formats are silently ignored. -
Cache tuning: The bundled
overlaybd.jsonconfig is tuned for performance: 40 GB SSD cache,download.delay=0(background download starts immediately after lazy pull), anddownload.maxMBps=1000. During profiling,overlaybd-deploy profile-startuptemporarily setsdelay=999999to disable background download so that real on-demand fetches are captured.
Troubleshooting
Services won't start
# Check logs
sudo journalctl -u overlaybd-tcmu -n 50
sudo journalctl -u overlaybd-snapshotter -n 50
# Verify kernel module
lsmod | grep target_core_user
sudo modprobe target_core_user
Lazy loading connects to 0.0.0.0:80
This happens when registryFsVersion is set to "v2" (the default). The v2 HTTP client cannot follow relative 302 redirects from Google Artifact Registry. Fix: set "registryFsVersion": "v1" in /etc/overlaybd/overlaybd.json and restart overlaybd-tcmu.
It can also happen with expired OAuth2 tokens. Refresh with:
sudo overlaybd-deploy setup-credentials oauth2
sudo systemctl restart overlaybd-tcmu
For a permanent fix, use service account keys:
sudo overlaybd-deploy setup-credentials service-account /path/to/sa-key.json
rpull fails with auth errors
# Verify credential file format
sudo cat /opt/overlaybd/cred.json | python3 -m json.tool
# Refresh OAuth2 token
sudo overlaybd-deploy setup-credentials oauth2
# Test with explicit credentials
TOKEN=$(gcloud auth print-access-token)
sudo /opt/overlaybd/snapshotter/ctr rpull \
--user "oauth2accesstoken:${TOKEN}" \
--download-blobs \
"us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/colab-optimized/runtime:latest_obd"
containerd doesn't see overlaybd plugin
# Check plugin is registered
sudo ctr plugin ls | grep overlaybd
# Verify config
grep -A3 'proxy_plugins.overlaybd' /etc/containerd/config.toml
# Restart everything in order
sudo systemctl restart overlaybd-tcmu
sudo systemctl restart overlaybd-snapshotter
sudo systemctl restart containerd
Container fails to start
# Check overlaybd logs
sudo tail -50 /var/log/overlaybd.log
# Check audit log
sudo tail -50 /var/log/overlaybd-audit.log
# Verify image is properly loaded
sudo /opt/overlaybd/snapshotter/ctr image ls
File Layout
overlaybd-deploy/
├── pyproject.toml # Package config (pip install -e .)
├── README.md # This file
├── INSTALL.md # End-to-end deployment guide
├── overlaybd_deploy/ # Python package
│ ├── __init__.py
│ ├── constants.py # Shared paths, URLs, config
│ ├── utils.py # Logging, subprocess wrappers
│ ├── registry.py # Registry/image reference utilities
│ ├── config.py # Bundled config file access
│ ├── data/ # Bundled config templates
│ │ ├── overlaybd.json # TCMU config (registryFsVersion v1)
│ │ └── snapshotter-config.json # Snapshotter config
│ ├── cli.py # Single entry point dispatcher
│ └── commands/ # Subcommand implementations
│ ├── install.py # overlaybd-deploy install
│ ├── setup_credentials.py # overlaybd-deploy setup-credentials
│ ├── pull_image.py # overlaybd-deploy pull-image
│ ├── convert_image.py # overlaybd-deploy convert-image
│ ├── profile_startup.py # overlaybd-deploy profile-startup
│ ├── manage_cache.py # overlaybd-deploy manage-cache
│ └── health_check.py # overlaybd-deploy health-check
└── tests/ # pytest test suite
├── conftest.py
├── test_utils.py
├── test_registry.py
├── test_config.py
└── commands/
└── test_*.py # One test file per command
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file overlaybd_deploy-1.0.0.tar.gz.
File metadata
- Download URL: overlaybd_deploy-1.0.0.tar.gz
- Upload date:
- Size: 34.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1d62e9305d75666dcff12a2718638986aa25c2f7f84e95bced068ae8f486fc1
|
|
| MD5 |
1e5108fc10ef3808a8e62c11ef293d86
|
|
| BLAKE2b-256 |
d33a0fda762d2373250c22004152754eba979c32f0fceaf236667e453c8b5fa9
|
File details
Details for the file overlaybd_deploy-1.0.0-py3-none-any.whl.
File metadata
- Download URL: overlaybd_deploy-1.0.0-py3-none-any.whl
- Upload date:
- Size: 34.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d11f4babb597423f4377ef0834eb23dfeae412f9a0335038eb9b028c1ca626fe
|
|
| MD5 |
316786c47ffbe7ee93880033b3b1a39c
|
|
| BLAKE2b-256 |
a13d5b9e2bb57acbb60728be35cf9a61f31226af6ade1d45b64e7dd9cedda852
|