Transfer LLM KV-cache between agents instead of regenerating text — 73-78% fewer tokens, 2-4x faster multi-agent pipelines
Project description
Agent Vector Protocol (AVP) — KV-Cache Transfer for Multi-Agent LLMs
Transfer KV-cache between LLM agents instead of regenerating text. Same multi-agent pipeline, 73-78% fewer tokens, 2-4x faster.
How Text Chains Waste Compute
graph LR
subgraph text["Text Chain (today)"]
direction LR
A1["Agent A<br/>generates text"] -->|"serialize to text<br/>re-tokenize everything"| B1["Agent B<br/>re-processes from scratch"]
end
subgraph avp["AVP Latent Transfer"]
direction LR
A2["Agent A<br/>generates KV-cache"] -->|"binary transfer<br/>28-130 MB"| B2["Agent B<br/>picks up where A left off"]
end
style text fill:#fff3f3,stroke:#d44,stroke-width:2px
style avp fill:#f3fff3,stroke:#4a4,stroke-width:2px
Every multi-agent framework today — LangChain, CrewAI, AutoGen, OpenAI Swarm — copies text between agents. Each agent re-tokenizes and re-processes everything prior agents already computed. Our benchmarks show 47-53% of all tokens in text chains are redundant re-processing.
AVP eliminates this by transferring the KV-cache (the computed attention states) directly. The receiving agent reads prior reasoning from attention states instead of re-computing it from text.
Key Results
| Metric | Value |
|---|---|
| Token savings vs text chains | 57-78% across 4 benchmarks |
| Speed improvement | 2-4.3x faster |
| HotpotQA: latent beats text AND direct | 35% EM / 0.54 F1 (vs 30% / 20%) |
| Models validated | Qwen2.5, DeepSeek-R1, Llama 3.2 |
| Tests | 288 passing (276 unit + 12 integration) |
Full results: docs/BENCHMARKS.md
Quick Start
High-level API (5 lines):
from avp import HuggingFaceConnector
connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
# Agent A: latent reasoning (no text output, builds KV-cache)
context = connector.think("Analyze this math problem: 24 * 17 + 3", steps=20)
# Agent B: generate with Agent A's context
answer = connector.generate("Now compute the final answer.", context=context)
Cross-process transfer:
# Process A: serialize context
wire_bytes = context.to_bytes(session_id="s1", source_agent_id="agent-a")
# Process B: restore and generate
from avp import AVPContext
restored = AVPContext.from_bytes(wire_bytes, device="cuda")
answer = connector.generate("Solve it.", context=restored)
Production serving (vLLM):
vLLM can't expose per-step hidden states, so latent transfer happens at the engine level via a KV connector plugin — transparent to your application code:
# Launch vLLM with AVP KV connector
vllm serve Qwen/Qwen2.5-7B-Instruct \
--kv-connector AVPKVConnectorV1Dynamic \
--kv-connector-module-path avp.connectors.vllm_kv_connector
# Application code stays simple — KV transfer happens behind the scenes
from avp import VLLMConnector
connector = VLLMConnector(model_id="Qwen/Qwen2.5-7B-Instruct")
answer = connector.generate("Analyze and solve: 24 * 17 + 3")
The AVPKVConnectorV1Dynamic plugin saves/loads KV-cache between vLLM instances via a file-based store, so agents on the same machine share computed attention states without re-processing.
Check model compatibility (low-level):
from avp import extract_model_identity, CompatibilityResolver
local = extract_model_identity(model_a)
remote = extract_model_identity(model_b)
session = CompatibilityResolver.resolve(local, remote)
# session.mode → LATENT (same model) or JSON (different)
Requirements
- Self-hosted models only. AVP needs direct access to model weights, KV-cache, and hidden states. Cloud APIs (OpenAI, Anthropic, Google) don't expose these internals — AVP cannot work with them.
- Same model on all agents for latent transfer. Same-family models (e.g. Qwen2.5-1.5B ↔ 0.5B) are supported via cross-model projection. Different families fall back to JSON automatically.
- GPU recommended. Benchmarks run on NVIDIA RTX 3070 Ti (8GB VRAM). CPU works but is significantly slower.
- Datacenter bandwidth for cross-machine transfer (28-130 MB per hop at fp32). Same-machine or shared-memory is ideal.
- Python 3.9+, PyTorch 2.0+, HuggingFace Transformers 4.36+ (for latent features).
When NOT to Use AVP
- Cloud API models (OpenAI
gpt-4o, Anthropicclaude, Googlegemini) — no KV-cache access available - Single-agent applications — no inter-agent communication to optimize
- Different model families without shared tokenizer (e.g. Llama ↔ Qwen) — falls back to JSON, no latent benefit
- Low-bandwidth cross-machine links (<1 Gbps) — 28-130 MB per hop makes latent transfer impractical over the internet
- Edge or mobile deployment — requires GPU and significant VRAM (1-2 GB for 1.5B-3B models)
How It Works
AVP defines a binary format, handshake, and codec — not the transport. It works alongside any agent protocol (LangChain, CrewAI, AutoGen, or custom).
┌──────────────────────────────────────────────────────────────┐
│ Your Orchestrator (LangChain / CrewAI / AutoGen / custom) │
│ │
│ Agent A Agent B │
│ │ ▲ │
│ │ connector.think() ──► │ connector.generate() │
│ │ AVPContext │ with context=... │
│ │ │ │
│ │ context.to_bytes() │ AVPContext.from_bytes()│
│ ▼ │ │
│ ┌────────────────────────────────────────────┐ │
│ │ AVP (this library) │ │
│ │ • Handshake — resolves LATENT/JSON mode │ │
│ │ • Codec — serialize/deserialize KV-cache │ │
│ │ • Session — TTL, thread safety │ │
│ └────────────────────────────────────────────┘ │
│ │ │
│ Transport: HTTP/2, gRPC, shared memory, file, any │
└──────────────────────────────────────────────────────────────┘
Three communication modes, auto-negotiated via handshake:
| Mode | When | What Happens |
|---|---|---|
| Latent | Same model | KV-cache + hidden state transfer, zero re-processing |
| Cross-model | Same family (e.g. Qwen2.5-1.5B ↔ 0.5B) | Vocabulary-mediated projection (Rosetta Stone v2), no training needed |
| JSON fallback | Incompatible models | Standard text, auto-negotiated |
Transport-agnostic: HTTP/2 (reference), gRPC, A2A, MCP, WebSockets, shared memory. AVP handles the latent communication layer — not discovery, routing, or orchestration.
Features
Protocol
- Binary codec with 12-byte header + protobuf metadata
- KV-cache serialization (DynamicCache, tuple format)
- Session management with TTL and thread safety
- zstd compression
Connectors
- HuggingFace Transformers (full hidden state + KV-cache access)
- vLLM (KVConnectorBase_V1 plugin + SDK wrapper + PagedAttention conversion)
Cross-Model (Rosetta Stone v2)
- Vocabulary-mediated projection for same-family models
- Two-tier projection validation (cosine similarity + pseudo-perplexity)
- HYBRID mode (KV-cache + text summary fallback)
Benchmarks
- GSM8K 4-agent chain (3 model families)
- 2-agent handoff (most common real-world pattern)
- HotpotQA multi-hop QA (reading comprehension transfer)
- Fan-out aggregation (parallel specialists)
Roadmap
- CacheGen-style compression (3-4x wire size reduction)
- SGLang connector
- Larger model validation (7B+)
Works With
- vLLM — KVConnectorBase_V1 plugin for production serving
- HuggingFace Transformers — Full hidden state and KV-cache access
- A2A — Transport binding via
multipart/relatedwith binary payloads - MCP — Complementary: MCP handles tools and context, AVP handles tensor transfer
API Reference
High-Level API (most users)
| Import | What It Does |
|---|---|
HuggingFaceConnector |
Main connector. think() builds KV-cache (returns AVPContext), generate() produces text. from_pretrained() for easy setup. |
VLLMConnector |
Production connector. generate() returns text. Latent transfer happens at engine level via KV connector plugin. |
AVPContext |
Wraps KV-cache + model metadata. Pass between think() and generate(), or serialize with to_bytes() / from_bytes() for cross-process transfer. |
Protocol Layer
| Import | What It Does |
|---|---|
encode / decode |
Binary codec for hidden states, KV-cache, and hybrid payloads. |
extract_model_identity |
Extract ModelIdentity (family, dimensions, hash) from a HuggingFace model. |
CompatibilityResolver.resolve() |
Handshake: compares two ModelIdentity objects, returns LATENT, HYBRID, or JSON mode. |
SessionManager |
Manage communication sessions with TTL and thread safety. |
AVPClient / AVPAsyncClient |
HTTP/2 client (sync and async) for sending AVP messages over the network. |
create_app |
Create a FastAPI server that receives AVP messages. |
Cross-Model (Rosetta Stone)
| Import | What It Does |
|---|---|
calibrate |
Build a projection map (AVPMap) between two models for cross-model transfer. |
vocabulary_mediated_projection |
Project hidden states from source model to target model using shared vocabulary as a bridge. |
validate_projection |
Quality gate: cosine similarity (fast) + pseudo-perplexity (thorough). Returns LATENT/HYBRID/JSON recommendation. |
save_map / load_map / find_map |
Persist and retrieve .avp-map files for reuse. |
Error Types
All errors inherit from AVPError. Key types: IncompatibleModelsError, HandshakeError, DecodeError, ShapeMismatchError, RealignmentError, SessionExpiredError, EngineNotAvailableError, FallbackRequested.
Benchmarks
| Benchmark | Latent Accuracy | Text Accuracy | Token Savings | Speed vs Text |
|---|---|---|---|---|
| GSM8K 4-agent (Llama 3.2-3B) | 70% | 65% | 74% | 2.1x |
| 2-agent handoff (Qwen 1.5B) | 55% | 55% | 57% | 2.0x |
| HotpotQA (Qwen 1.5B) | 35% EM | 20% EM | 19% | 4.3x |
| Fan-out (Qwen 1.5B) | 30% | 60% | 62% | 2.2x |
HotpotQA is the standout: latent transfer preserves reading comprehension better than text summaries, beating both text chains and single-agent direct on exact match and F1.
Full methodology, per-hop analysis, cost projections, and raw data: docs/BENCHMARKS.md
Install
# Core SDK (codec, handshake, session, fallback)
pip install avp
# With latent communication (realignment, KV-cache, HuggingFace connector)
pip install "avp[latent]"
# With HTTP/2 transport server
pip install "avp[server]"
# Everything including dev tools
pip install "avp[all]"
From source:
git clone https://github.com/VectorArc/avp-python.git
cd avp-python
pip install -e ".[all]"
Documentation
- AVP Specification — Binary format, handshake, transport, security, test vectors
- Benchmark Results — Full results across 4 benchmarks and 3 model families
- Examples — Agent demo, mixed-model demo, quickstart
- Contributing — Dev setup, tests, code style
Key Concepts
| Term | What It Means |
|---|---|
| KV-cache | During text generation, each transformer layer computes key and value vectors for the attention mechanism. These are cached so they don't need to be recomputed for each new token. AVP transfers this cache between agents so the receiving agent doesn't recompute what the sender already processed. |
| Hidden states | The internal vector representations at each transformer layer — the model's "understanding" of the input at that point in the network. Richer than text because they carry information that gets lost when converting to tokens. |
| Latent transfer | Sending KV-cache or hidden states (the "latent" internal representations) instead of converting to text and back. Avoids the lossy text bottleneck. |
| Realignment | Normalizing hidden states before injecting them into another model instance, so they match the expected input distribution. Required because hidden state magnitudes can drift. |
| Tied weights | When a model reuses the same weight matrix for both input embeddings and output projection (common in smaller models like Qwen <=3B, Llama 3.2 <=3B). Requires a special softmax-based projection instead of simple normalization. |
| Vocabulary-mediated projection | Cross-model transfer technique: convert source hidden states to token probabilities using the source model's output head, then reconstruct target-compatible representations using the target model's input embeddings. Works for same-family models that share a tokenizer. |
| PagedAttention | vLLM's memory management for KV-cache — stores cache in non-contiguous pages. AVP's page_convert module handles conversion between paged and contiguous formats. |
Research Foundation
AVP builds on LatentMAS: Latent Collaboration in Multi-Agent Systems (Gen-Verse, 2025), which demonstrated same-model latent communication via hidden state transfer and KV-cache sharing. AVP productionizes this into a transport-agnostic binary protocol with cross-model support, compression, and engine connectors.
License
Apache 2.0 — see LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file avp-0.2.1.tar.gz.
File metadata
- Download URL: avp-0.2.1.tar.gz
- Upload date:
- Size: 151.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad338791bfaf0d948c3e7d7e429649083a74c9dcf419b511a3bc6c2ef7275bcd
|
|
| MD5 |
f1763ecafd28588f8723e9fc7382380e
|
|
| BLAKE2b-256 |
02acb66ebaf18fe7b57f0c89ff74e069984ecef9351f1cdd5e5951bcc1f74826
|
Provenance
The following attestation bundles were made for avp-0.2.1.tar.gz:
Publisher:
publish.yml on VectorArc/avp-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
avp-0.2.1.tar.gz -
Subject digest:
ad338791bfaf0d948c3e7d7e429649083a74c9dcf419b511a3bc6c2ef7275bcd - Sigstore transparency entry: 991029995
- Sigstore integration time:
-
Permalink:
VectorArc/avp-python@8f6c87c5c0bb41a95c5305fefe54dfbae48d3fb4 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/VectorArc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8f6c87c5c0bb41a95c5305fefe54dfbae48d3fb4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file avp-0.2.1-py3-none-any.whl.
File metadata
- Download URL: avp-0.2.1-py3-none-any.whl
- Upload date:
- Size: 67.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d598b50e83b64ea4f68ddcda3b4df4727681622dff0d30414f493fb629d2bdb4
|
|
| MD5 |
a72b0142d69d625a6998608221180dc9
|
|
| BLAKE2b-256 |
13d5ab786ff1eb26a8b1cd439f9c137758d71a397802cef9b6aa2efbbc663b63
|
Provenance
The following attestation bundles were made for avp-0.2.1-py3-none-any.whl:
Publisher:
publish.yml on VectorArc/avp-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
avp-0.2.1-py3-none-any.whl -
Subject digest:
d598b50e83b64ea4f68ddcda3b4df4727681622dff0d30414f493fb629d2bdb4 - Sigstore transparency entry: 991029999
- Sigstore integration time:
-
Permalink:
VectorArc/avp-python@8f6c87c5c0bb41a95c5305fefe54dfbae48d3fb4 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/VectorArc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8f6c87c5c0bb41a95c5305fefe54dfbae48d3fb4 -
Trigger Event:
push
-
Statement type: