Dynamo KVBM
Project description
Dynamo KVBM
The Dynamo KVBM is a distributed KV-cache block management system designed for scalable LLM inference. It cleanly separates memory management from inference runtimes (vLLM, TensorRT-LLM, and SGLang), enabling GPU↔CPU↔Disk/Remote tiering, asynchronous block offload/onboard, and efficient block reuse.
Feature Highlights
- Distributed KV-Cache Management: Unified GPU↔CPU↔Disk↔Remote tiering for scalable LLM inference.
- Async Offload & Reuse: Seamlessly move KV blocks between memory tiers using GDS-accelerated transfers powered by NIXL, without recomputation.
- Runtime-Agnostic: Works out-of-the-box with vLLM, TensorRT-LLM, and SGLang via lightweight connectors.
- Memory-Safe & Modular: RAII lifecycle and pluggable design for reliability, portability, and backend extensibility.
Build and Installation
The pip wheel is built through a Docker build process:
# Build the Docker image with KVBM enabled (from the dynamo repo root)
./container/build.sh --framework none --enable-kvbm --tag local-kvbm
Once built, you can either:
Option 1: Run and use the container directly
./container/run.sh --framework none -it
Option 2: Extract the wheel file to your local filesystem
# Create a temporary container from the built image
docker create --name temp-kvbm-container local-kvbm:latest
# Copy the KVBM wheel to your current directory
docker cp temp-kvbm-container:/opt/dynamo/wheelhouse/ ./dynamo_wheelhouse
# Clean up the temporary container
docker rm temp-kvbm-container
# Install the wheel locally
pip install ./dynamo_wheelhouse/kvbm*.whl
Note that the default pip wheel built is not compatible with CUDA 13 at the moment.
Integrations
Environment Variables
| Variable | Description | Default |
|---|---|---|
DYN_KVBM_CPU_CACHE_GB |
CPU pinned memory cache size (GB) | required |
DYN_KVBM_DISK_CACHE_GB |
SSD Disk/Storage system cache size (GB) | optional |
DYN_KVBM_DISK_CACHE_DIR |
Disk cache directory | /tmp/ |
DYN_KVBM_DISK_ZEROFILL_FALLBACK |
Enable zero-fill when fallocate() unsupported (e.g., Lustre) |
false |
DYN_KVBM_DISK_DISABLE_O_DIRECT |
Disable O_DIRECT for disk I/O (debug/compatibility) | false |
DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS |
Timeout (in seconds) for the KVBM leader and worker to synchronize and allocate the required memory and storage. Increase this value if allocating large amounts of memory or storage. | 120 |
DYN_KVBM_METRICS |
Enable metrics endpoint | false |
DYN_KVBM_METRICS_PORT |
Metrics port | 6880 |
DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER |
Disable disk offload filtering to remove SSD lifespan protection | false |
Disk Storage Configuration
Why special configuration may be needed:
Some filesystems (e.g., Lustre, certain network filesystems) don't support fallocate(), which KVBM uses for fast disk space allocation. Additionally, KVBM uses O_DIRECT I/O for GPU DirectStorage (GDS) performance, which requires strict 4096-byte alignment.
Setup for filesystems without fallocate() support:
export DYN_KVBM_DISK_CACHE_DIR=/mnt/storage/kvbm_cache
export DYN_KVBM_DISK_ZEROFILL_FALLBACK=true # Enables zero-fill fallback when fallocate() unsupported
What happens:
- Without
ZEROFILL_FALLBACK=true: Disk cache allocation may fail with "Operation not supported" - With
ZEROFILL_FALLBACK=true: KVBM writes zeros using page-aligned buffers compatible with O_DIRECT requirements
Troubleshooting: If you encounter "write all error" or EINVAL (errno 22), try disabling O_DIRECT: export DYN_KVBM_DISK_DISABLE_O_DIRECT=true
vLLM
DYN_KVBM_CPU_CACHE_GB=100 vllm serve \
--kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both","kv_connector_module_path":"kvbm.vllm_integration.connector"}' \
Qwen/Qwen3-8B
For more detailed integration with dynamo, disaggregated serving support and benchmarking, please check vllm-setup
TensorRT-LLM
cat >/tmp/kvbm_llm_api_config.yaml <<EOF
cuda_graph_config: null
kv_cache_config:
enable_partial_reuse: false
free_gpu_memory_fraction: 0.80
kv_connector_config:
connector_module: kvbm.trtllm_integration.connector
connector_scheduler_class: DynamoKVBMConnectorLeader
connector_worker_class: DynamoKVBMConnectorWorker
EOF
DYN_KVBM_CPU_CACHE_GB=100 trtllm-serve Qwen/Qwen3-8B \
--host localhost --port 8000 \
--backend pytorch \
--extra_llm_api_options /tmp/kvbm_llm_api_config.yaml
For more detailed integration with dynamo and benchmarking, please check trtllm-setup
📚 Docs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kvbm-0.8.1-cp310-abi3-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: kvbm-0.8.1-cp310-abi3-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 10.9 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97621dcc85ac16203fa8e0db5845208d2bf5787fc669c45825aec243d4d3ed28
|
|
| MD5 |
125c26fd86793c557cdb0654c0af08a1
|
|
| BLAKE2b-256 |
9f9b9f249c8b9527fab87956f6e70483e447aac0342fb004b3186dae1b0ee07f
|
File details
Details for the file kvbm-0.8.1-cp310-abi3-manylinux_2_28_aarch64.whl.
File metadata
- Download URL: kvbm-0.8.1-cp310-abi3-manylinux_2_28_aarch64.whl
- Upload date:
- Size: 10.0 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35b439f550e99d8374623f63871ae4e208d4687e7e801a7ccba05cdb7dcb4546
|
|
| MD5 |
30c45438e099d46766bb310d25312929
|
|
| BLAKE2b-256 |
558081e9577eca30d651d2d2456e079cc08d2f9c6e4eab068584c9b7b98caae6
|