Skip to main content

Dynamo KVBM

Project description

Dynamo KVBM

The Dynamo KVBM is a distributed KV-cache block management system designed for scalable LLM inference. It cleanly separates memory management from inference runtimes (vLLM, TensorRT-LLM, and SGLang), enabling GPU↔CPU↔Disk/Remote tiering, asynchronous block offload/onboard, and efficient block reuse.

A block diagram showing a layered architecture view of Dynamo KV Block manager.

Feature Highlights

  • Distributed KV-Cache Management: Unified GPU↔CPU↔Disk↔Remote tiering for scalable LLM inference.
  • Async Offload & Reuse: Seamlessly move KV blocks between memory tiers using GDS-accelerated transfers powered by NIXL, without recomputation.
  • Runtime-Agnostic: Works out-of-the-box with vLLM, TensorRT-LLM, and SGLang via lightweight connectors.
  • Memory-Safe & Modular: RAII lifecycle and pluggable design for reliability, portability, and backend extensibility.

Installation

pip install kvbm

See the support matrix for version compatibility questions.

Build from Source

The pip wheel is built through a Docker build process:

# Render and build the Docker image with KVBM enabled (from the dynamo repo root)
python container/render.py --framework dynamo --target runtime --output-short-filename
docker build --build-arg ENABLE_KVBM="true" -f container/rendered.Dockerfile .

Once built, you can either:

Option 1: Run and use the container directly

./container/run.sh --framework none -it

Option 2: Extract the wheel file to your local filesystem

# Create a temporary container from the built image
docker create --name temp-kvbm-container local-kvbm:latest

# Copy the KVBM wheel to your current directory
docker cp temp-kvbm-container:/opt/dynamo/wheelhouse/ ./dynamo_wheelhouse

# Clean up the temporary container
docker rm temp-kvbm-container

# Install the wheel locally
pip install ./dynamo_wheelhouse/kvbm*.whl

Note that the default pip wheel built is not compatible with CUDA 13 at the moment.

Integrations

Environment Variables

Variable Description Default
DYN_KVBM_CPU_CACHE_GB CPU pinned memory cache size (GB) required
DYN_KVBM_DISK_CACHE_GB SSD Disk/Storage system cache size (GB) optional
DYN_KVBM_DISK_CACHE_DIR Disk cache directory /tmp/
DYN_KVBM_DISK_ZEROFILL_FALLBACK Enable zero-fill when fallocate() unsupported (e.g., Lustre) false
DYN_KVBM_DISK_DISABLE_O_DIRECT Disable O_DIRECT for disk I/O (debug/compatibility) false
DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS Timeout (in seconds) for the KVBM leader and worker to synchronize and allocate the required memory and storage. Increase this value if allocating large amounts of memory or storage. 120
DYN_KVBM_METRICS Enable metrics endpoint false
DYN_KVBM_METRICS_PORT Metrics port 6880
DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER Disable disk offload filtering to remove SSD lifespan protection false
DYN_KVBM_HOST_OFFLOAD_PREFIX_MIN_PRIORITY Minimum priority (0-100) for CPU offload with contiguous (prefix) semantics: offloading stops at the first block below threshold, and all subsequent blocks are also skipped. Used for priority-based filtering. 0 (no filtering)
DYN_KVBM_NCCL_MLA_MODE Enable NCCL replicated mode for MLA (Multi-Layer Attention) models (e.g., DeepSeek). When set to true, rank 0 loads KV blocks from G2/G3 storage and broadcasts to all GPUs via NCCL instead of each GPU loading independently. Requires MPI and optional nccl feature for optimal behavior. false

Disk Storage Configuration

Why special configuration may be needed:

Some filesystems (e.g., Lustre, certain network filesystems) don't support fallocate(), which KVBM uses for fast disk space allocation. Additionally, KVBM uses O_DIRECT I/O for GPU DirectStorage (GDS) performance, which requires strict 4096-byte alignment.

Setup for filesystems without fallocate() support:

export DYN_KVBM_DISK_CACHE_DIR=/mnt/storage/kvbm_cache
export DYN_KVBM_DISK_ZEROFILL_FALLBACK=true  # Enables zero-fill fallback when fallocate() unsupported

What happens:

  • Without ZEROFILL_FALLBACK=true: Disk cache allocation may fail with "Operation not supported"
  • With ZEROFILL_FALLBACK=true: KVBM writes zeros using page-aligned buffers compatible with O_DIRECT requirements

Troubleshooting: If you encounter "write all error" or EINVAL (errno 22), try disabling O_DIRECT: export DYN_KVBM_DISK_DISABLE_O_DIRECT=true

vLLM

DYN_KVBM_CPU_CACHE_GB=100 vllm serve \
  --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both","kv_connector_module_path":"kvbm.vllm_integration.connector"}' \
  Qwen/Qwen3-8B

For more detailed integration with dynamo, disaggregated serving support and benchmarking, please check vllm-setup

TensorRT-LLM

cat >/tmp/kvbm_llm_api_config.yaml <<EOF
cuda_graph_config: null
kv_cache_config:
  enable_partial_reuse: false
  free_gpu_memory_fraction: 0.80
kv_connector_config:
  connector_module: kvbm.trtllm_integration.connector
  connector_scheduler_class: DynamoKVBMConnectorLeader
  connector_worker_class: DynamoKVBMConnectorWorker
EOF

DYN_KVBM_CPU_CACHE_GB=100 trtllm-serve Qwen/Qwen3-8B \
  --host localhost --port 8000 \
  --backend pytorch \
  --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml

For more detailed integration with dynamo and benchmarking, please check trtllm-setup

📚 Docs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

kvbm-1.1.1-cp310-abi3-manylinux_2_28_x86_64.whl (11.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

kvbm-1.1.1-cp310-abi3-manylinux_2_28_aarch64.whl (9.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

File details

Details for the file kvbm-1.1.1-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for kvbm-1.1.1-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 51946d8a197911beeaca185d9e3e4fe95dcc7140cd3f5d33250513f34ca2b297
MD5 16d2d08562a46c53fded115971ecf621
BLAKE2b-256 00ab0d645197d0b8f3ab3b533c5965bd8b7b067a306b49fe5de58aa311d828de

See more details on using hashes here.

File details

Details for the file kvbm-1.1.1-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for kvbm-1.1.1-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 f30bf0ddd6fe4feee2a0c9913058933b63227e4bf09db7c1cc38c09a4e39a9a6
MD5 24b7a8323220bb1b57a3acbddb1c31ed
BLAKE2b-256 59b0db2aecb3beb4a21440414270f52cd4113918cdf9e5b2f15c2d899f8003cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page