Multi-agent text handoffs discard KV-cache and attention state. AVP transfers that state directly — 51-78% fewer tokens, 1.5-5x faster.
Project description
Agent Vector Protocol (AVP) — KV-Cache Transfer for Multi-Agent LLMs
Multi-agent text handoffs discard KV-cache and attention state. AVP transfers that state directly — 46-78% fewer tokens, 2-4x faster, across models and families. Built on LatentMAS (2025).
pip install avp
Self-hosted models on GPUs only. AVP needs access to model internals (KV-cache, hidden states) that cloud APIs don't expose. If you use OpenAI, Anthropic, or Google APIs — AVP can't help you. Good fit: multi-agent pipelines on vLLM or HuggingFace Transformers with datacenter or same-machine connectivity.
Quick Start
from avp import HuggingFaceConnector
connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
prompt = "Analyze this math problem: 24 * 17 + 3"
# Agent A: latent reasoning (no text output, builds KV-cache)
context = connector.think(prompt, steps=10)
# Agent B: generate with Agent A's context
answer = connector.generate(prompt, context=context)
Results
| Direct | Latent (AVP) | Text | |
|---|---|---|---|
| HumanEval (Qwen 7B, n=164) | 58.5% | 67.1% | 53.0% |
| GSM8K (Qwen 7B, n=200) | 91.0% | 90.5% | 87.0% |
| DebugBench (Qwen 7B, n=100) | 50.0% | 51.0% | 49.0% |
| GSM8K (Llama 3B, n=200) | 75.0% | 78.0% | 75.5% |
+8.6pp on code generation (p=0.029). 46-78% fewer tokens. 2-4x faster. Tested on NVIDIA A100.
Cross-model (zero training, 6 KB wire):
| Source → Target | GSM8K | HumanEval |
|---|---|---|
| Qwen 7B → Llama 3B | 74.5% | 47.0% |
| Llama 3B → Qwen 7B | 90.0% | 79.3% |
Full results: Benchmarks — 8 benchmarks, 5 models, 2 families.
How It Works
graph LR
subgraph text["Text Chain (today)"]
direction LR
A1["Agent A<br/>generates text"] -->|"serialize to text<br/>re-tokenize everything"| B1["Agent B<br/>re-processes from scratch"]
end
subgraph avp["AVP Latent Transfer"]
direction LR
A2["Agent A<br/>generates KV-cache"] -->|"binary transfer<br/>28-130 MB"| B2["Agent B<br/>picks up where A left off"]
end
style text fill:#fff3f3,stroke:#d44,stroke-width:2px
style avp fill:#f3fff3,stroke:#4a4,stroke-width:2px
AVP transfers the KV-cache (computed attention states) directly between agents. The receiving agent reads prior reasoning from attention states instead of re-computing it from text. Three modes, auto-negotiated:
| Mode | When | What Happens |
|---|---|---|
| Latent | Same model | KV-cache transfer, zero re-processing |
| Cross-model | Same or different family | Vocabulary-mediated projection, zero training |
| JSON fallback | No compatible path | Standard text, auto-negotiated |
Cross-model transfer
from avp import HuggingFaceConnector
researcher = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
solver = HuggingFaceConnector.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
prompt = "Solve step by step: 24 * 17 + 3"
context = researcher.think(prompt, steps=10)
answer = solver.generate(prompt, context=context, source=researcher)
Cross-model calibration is one-time per model pair (~0.5-2s), cached to ~/.avp/maps/.
Easy API (convenience wrappers)
import avp
# One-liner: think + generate
answer = avp.generate("Solve: 24 * 17 + 3", model="Qwen/Qwen2.5-7B-Instruct")
# Cross-model
answer = avp.generate("Solve: 24 * 17 + 3",
model="meta-llama/Llama-3.2-3B-Instruct",
source_model="Qwen/Qwen2.5-7B-Instruct")
vLLM integration (experimental)
Status: Experimental.
VLLMConnectorworks for text generation and identity extraction. The KV connector plugin (AVPKVConnectorV1Dynamic) for latent KV-cache transfer between vLLM instances has not been validated end-to-end and has known issues with PagedAttention format conversion. UseHuggingFaceConnectorfor production latent transfer. See CHANGELOG for details.
from avp import VLLMConnector
connector = VLLMConnector(model_id="Qwen/Qwen2.5-7B-Instruct")
answer = connector.generate("Analyze and solve: 24 * 17 + 3")
Cross-process transfer
# Process A: serialize context
wire_bytes = context.to_bytes(session_id="s1", source_agent_id="agent-a")
# Process B: restore and generate
from avp import AVPContext, HuggingFaceConnector
connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
restored = AVPContext.from_bytes(wire_bytes, device="cuda")
answer = connector.generate(prompt, context=restored)
Works With
AVP works with your orchestration framework, not instead of it. Replace llm.invoke() with avp.generate() — your framework sees text in, text out.
| Framework | Integration |
|---|---|
| LangGraph | Graph node — avp.generate() replaces LLM call |
| CrewAI | BaseLLM.call() override |
| PydanticAI | FunctionModel callback |
| LlamaIndex | CustomLLM.complete() override |
| vLLM | KVConnectorBase_V1 plugin (experimental — text generation works, latent transfer in progress) |
| HuggingFace | Full hidden state and KV-cache access |
| A2A / MCP | Complementary — AVP handles tensor transfer |
See Framework Integration Guide for examples.
Roadmap
- Bidirectional latent communication (A→B + B→A latent)
- vLLM serving throughput benchmarks
- CacheGen-style compression (3-4x KV-cache size reduction)
Documentation
- AVP Specification — Binary format, handshake, transport
- Benchmarks — 8 benchmarks, 5 models, 2 families
- Framework Integration — LangGraph, CrewAI, cross-model examples
- Examples — Quickstart and agent demos
License
Apache 2.0 — see LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file avp-0.3.0.tar.gz.
File metadata
- Download URL: avp-0.3.0.tar.gz
- Upload date:
- Size: 247.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d48ca0b7cb5363060f855686ab54f17b94295228e816bc3b251de795e19345f8
|
|
| MD5 |
2c34b598eb3353c0f29d0bcf5025cd39
|
|
| BLAKE2b-256 |
c3f854fd55942bdbb2129e82776c45e9f1530e2adc0a98cc5937f34580c4928d
|
Provenance
The following attestation bundles were made for avp-0.3.0.tar.gz:
Publisher:
publish.yml on VectorArc/avp-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
avp-0.3.0.tar.gz -
Subject digest:
d48ca0b7cb5363060f855686ab54f17b94295228e816bc3b251de795e19345f8 - Sigstore transparency entry: 1058208351
- Sigstore integration time:
-
Permalink:
VectorArc/avp-python@33e871013ba8abc635217b3e7d4d193e43cf7ac8 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/VectorArc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@33e871013ba8abc635217b3e7d4d193e43cf7ac8 -
Trigger Event:
push
-
Statement type:
File details
Details for the file avp-0.3.0-py3-none-any.whl.
File metadata
- Download URL: avp-0.3.0-py3-none-any.whl
- Upload date:
- Size: 81.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0e0516bb10d55cd5ddd4142a8f36715732d0b11ff30c898983d5dc65d0cac22
|
|
| MD5 |
612210386038403a503f3f4c70f9e4c5
|
|
| BLAKE2b-256 |
c8f95d21045541382fddcb9bfc35c205788e3690f71bd9b0806a17fd41544823
|
Provenance
The following attestation bundles were made for avp-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on VectorArc/avp-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
avp-0.3.0-py3-none-any.whl -
Subject digest:
e0e0516bb10d55cd5ddd4142a8f36715732d0b11ff30c898983d5dc65d0cac22 - Sigstore transparency entry: 1058208536
- Sigstore integration time:
-
Permalink:
VectorArc/avp-python@33e871013ba8abc635217b3e7d4d193e43cf7ac8 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/VectorArc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@33e871013ba8abc635217b3e7d4d193e43cf7ac8 -
Trigger Event:
push
-
Statement type: