Multi-agent text handoffs discard KV-cache and attention state. AVP transfers that state directly — 51-78% fewer tokens, 1.5-5x faster.
Project description
AVP – Agents Share Thoughts, Not Text
When LLM agents hand off work as text, the next agent re-processes everything from scratch. AVP (Agent Vector Protocol) transfers the actual computation (KV-cache, hidden states, attention) so the receiving agent picks up where the sender left off. Zero tokens between agents, 2-3x faster pipelines, same or better accuracy. Built on LatentMAS, extended with cross-model vocabulary-mediated projection. Zero training, works across model families.
pip install avp[hf]
Requires self-hosted models on GPUs. AVP accesses model internals (KV-cache, hidden states) that cloud APIs don't expose. Other engines:
avp[ollama],avp[llamacpp],avp[vllm]– see Works With.
Quick Start
Same model – two agents share a KV-cache:
from avp import HuggingFaceConnector
connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
# Agent A thinks (builds KV-cache, no text output)
context = connector.think("Analyze this math problem: 24 * 17 + 3", steps=20)
# Agent B generates using Agent A's KV-cache
answer = connector.generate("Solve step by step: 24 * 17 + 3", context=context)
Cross-model – different architectures, zero training:
researcher = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
solver = HuggingFaceConnector.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
context = researcher.think("Analyze this problem", steps=20)
answer = solver.generate("Solve it", context=context, source=researcher, cross_model=True)
Cross-process – serialize context over any transport:
# Process A
wire_bytes = context.to_bytes(session_id="s1", source_agent_id="agent-a")
# Process B
restored = AVPContext.from_bytes(wire_bytes, device="cuda")
answer = connector.generate(prompt, context=restored)
You don't choose the transfer mode. The handshake auto-negotiates based on model compatibility: same model → full KV-cache, different models → vocabulary-mediated projection (~6 KB), incompatible models → JSON text fallback.
Results
Direct = single model, no pipeline. Latent = AVP transfer. Text Chain = standard text handoff between agents.
| Direct | Latent (AVP) | Text Chain | |
|---|---|---|---|
| HumanEval (Qwen 7B, n=164) | 58.5% | 67.1% | 53.0% |
| GSM8K (Qwen 7B, n=200) | 91.0% | 90.5% | 87.0% |
| DebugBench (Qwen 7B, n=100) | 50.0% | 51.0% | 49.0% |
| GSM8K (Llama 3B, n=200) | 74.5% | 76.0% | 79.0% |
HumanEval: +12.4pp vs text across 4 seeds (p=0.004). GSM8K and DebugBench: neutral across all modes, but the pipeline runs 3x faster (7.6s vs 22.8s end-to-end on DebugBench). Llama 3B: text wins on GSM8K; latent overhead has more impact on smaller models. All benchmarks used steps=20 on NVIDIA A100.
Trade-off: 20 latent steps cost ~0.9s on A100. If Agent A would normally generate 22+ tokens of text, latent is faster.
Cross-model (zero training):
| Source → Target | GSM8K (Rosetta / Text) | HumanEval (Rosetta / Text) |
|---|---|---|
| Qwen 7B → Qwen 3B | 82.5% / 88.5% | 66.5% / 62.2% |
| Qwen 7B → Llama 3B | 77.0% / 86.5% | 47.0% / 57.9% |
| Llama 3B → Qwen 7B | 90.0% / 82.0% | 79.3% / 61.6% |
Target solo baselines: Qwen 3B = 82.5% / 61.0%, Llama 3B = 76.0% / 50.6%, Qwen 7B = 91.0% / 58.5%.
Full results: Benchmarks – 7 benchmarks, 5 models, 2 families, reproducible.
How It Works
AVP auto-negotiates the transfer mode via a handshake at connection time. You write the same think() / generate() code regardless of which mode is selected:
| Mode | When | What transfers | Size |
|---|---|---|---|
| Latent | Same model | Full KV-cache | ~390 MB for 7B |
| Cross-model | Different model or family | Projected hidden state via shared vocabulary | ~6 KB |
| JSON fallback | No compatible projection path | Plain text | Varies |
The handshake checks model hash → structural match → shared tokenizer → vocabulary overlap (≥100 BPE tokens) → JSON. You never configure this manually.
Works With
Engines
| Engine | Latent Pipeline | Cross-model |
|---|---|---|
HuggingFace avp[hf] |
Full think/generate | Yes |
Ollama avp[ollama] |
Full think/generate, auto-resolves GGUF | Yes |
llama.cpp avp[llamacpp] |
Full think/generate on GGUF | Yes |
vLLM avp[vllm] |
KV connector + model plugin | Yes |
Frameworks
| Framework | Integration | Extra |
|---|---|---|
| LangChain | ChatAVP BaseChatModel |
avp[langchain] |
| CrewAI | AVPLLM BaseLLM |
avp[crewai] |
| AutoGen | AVPChatCompletionClient |
avp[autogen] |
| A2A / MCP | Complementary: AVP handles tensor transfer, they handle routing | – |
See Framework Integration Guide for per-engine code examples.
Roadmap
- Bidirectional latent communication (both agents share thinking, not just one)
- CacheGen-style KV-cache compression (3-4x reduction)
Documentation
- AVP Specification – binary format, handshake, transport
- Benchmarks – 7 benchmarks, 5 models, 2 families
- Framework Integration – engines, frameworks, per-engine examples
- Examples – quickstart, cross-model, and agent demos
- CHANGELOG
License
Apache 2.0 – see LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file avp-0.4.0.tar.gz.
File metadata
- Download URL: avp-0.4.0.tar.gz
- Upload date:
- Size: 325.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33959c5cabf3e19c7c64fe919c8690b517512895e759d249a05c44e67645fa83
|
|
| MD5 |
5c2f813d2ee77243ca7fa7d522804453
|
|
| BLAKE2b-256 |
92472d76346a33a2204e8c187ad071ade34c05fcb4685eccbe653ff2c272903f
|
Provenance
The following attestation bundles were made for avp-0.4.0.tar.gz:
Publisher:
publish.yml on VectorArc/avp-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
avp-0.4.0.tar.gz -
Subject digest:
33959c5cabf3e19c7c64fe919c8690b517512895e759d249a05c44e67645fa83 - Sigstore transparency entry: 1157444410
- Sigstore integration time:
-
Permalink:
VectorArc/avp-python@6f355d36ac9c374e824d42d2fe518b796dc755ad -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/VectorArc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6f355d36ac9c374e824d42d2fe518b796dc755ad -
Trigger Event:
push
-
Statement type:
File details
Details for the file avp-0.4.0-py3-none-any.whl.
File metadata
- Download URL: avp-0.4.0-py3-none-any.whl
- Upload date:
- Size: 114.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c5f0932557370e3e508c6420d9dd77ea48c663bc4dc13fda861b96bebd334c6
|
|
| MD5 |
b2e23a62d04faedc1db6f81a4b507dac
|
|
| BLAKE2b-256 |
07fc3d07fda872aa2a25c0fc9632c24a898e3f03c30bca86dae3e37777715b23
|
Provenance
The following attestation bundles were made for avp-0.4.0-py3-none-any.whl:
Publisher:
publish.yml on VectorArc/avp-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
avp-0.4.0-py3-none-any.whl -
Subject digest:
7c5f0932557370e3e508c6420d9dd77ea48c663bc4dc13fda861b96bebd334c6 - Sigstore transparency entry: 1157444472
- Sigstore integration time:
-
Permalink:
VectorArc/avp-python@6f355d36ac9c374e824d42d2fe518b796dc755ad -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/VectorArc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6f355d36ac9c374e824d42d2fe518b796dc755ad -
Trigger Event:
push
-
Statement type: