Skip to main content

Multi-agent text handoffs discard KV-cache and attention state. AVP transfers that state directly — 51-78% fewer tokens, 1.5-5x faster.

Project description

AVP – Agents Share Thoughts, Not Text

PyPI CI License Python Spec Open In Colab

When LLM agents hand off work as text, the next agent re-processes everything from scratch. AVP (Agent Vector Protocol) transfers the actual computation (KV-cache, hidden states, attention) so the receiving agent picks up where the sender left off. Zero tokens between agents, 2-3x faster pipelines, same or better accuracy. Built on LatentMAS, extended with cross-model vocabulary-mediated projection. Zero training, works across model families.

pip install avp[hf]

Requires self-hosted models on GPUs. AVP accesses model internals (KV-cache, hidden states) that cloud APIs don't expose. Other engines: avp[ollama], avp[llamacpp], avp[vllm] – see Works With.

Quick Start

Same model – two agents share a KV-cache:

from avp import HuggingFaceConnector

connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Agent A thinks (builds KV-cache, no text output)
context = connector.think("Analyze this math problem: 24 * 17 + 3", steps=20)

# Agent B generates using Agent A's KV-cache
answer = connector.generate("Solve step by step: 24 * 17 + 3", context=context)

Cross-model – different architectures, zero training:

researcher = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
solver = HuggingFaceConnector.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

context = researcher.think("Analyze this problem", steps=20)
answer = solver.generate("Solve it", context=context, source=researcher, cross_model=True)

Cross-process – serialize context over any transport:

# Process A
wire_bytes = context.to_bytes(session_id="s1", source_agent_id="agent-a")

# Process B
restored = AVPContext.from_bytes(wire_bytes, device="cuda")
answer = connector.generate(prompt, context=restored)

You don't choose the transfer mode. The handshake auto-negotiates based on model compatibility: same model → full KV-cache, different models → vocabulary-mediated projection (~6 KB), incompatible models → JSON text fallback.

Results

Direct = single model, no pipeline. Latent = AVP transfer. Text Chain = standard text handoff between agents.

Direct Latent (AVP) Text Chain
HumanEval (Qwen 7B, n=164) 58.5% 67.1% 53.0%
GSM8K (Qwen 7B, n=200) 91.0% 90.5% 87.0%
DebugBench (Qwen 7B, n=100) 50.0% 51.0% 49.0%
GSM8K (Llama 3B, n=200) 74.5% 76.0% 79.0%

HumanEval: +12.4pp vs text across 4 seeds (p=0.004). GSM8K and DebugBench: neutral across all modes, but the pipeline runs 3x faster (7.6s vs 22.8s end-to-end on DebugBench). Llama 3B: text wins on GSM8K; latent overhead has more impact on smaller models. All benchmarks used steps=20 on NVIDIA A100.

Trade-off: 20 latent steps cost ~0.9s on A100. If Agent A would normally generate 22+ tokens of text, latent is faster.

Cross-model (zero training):

Source → Target GSM8K (Rosetta / Text) HumanEval (Rosetta / Text)
Qwen 7B → Qwen 3B 82.5% / 88.5% 66.5% / 62.2%
Qwen 7B → Llama 3B 77.0% / 86.5% 47.0% / 57.9%
Llama 3B → Qwen 7B 90.0% / 82.0% 79.3% / 61.6%

Target solo baselines: Qwen 3B = 82.5% / 61.0%, Llama 3B = 76.0% / 50.6%, Qwen 7B = 91.0% / 58.5%.

Full results: Benchmarks – 7 benchmarks, 5 models, 2 families, reproducible.

How It Works

How AVP works

AVP auto-negotiates the transfer mode via a handshake at connection time. You write the same think() / generate() code regardless of which mode is selected:

Mode When What transfers Size
Latent Same model Full KV-cache ~390 MB for 7B
Cross-model Different model or family Projected hidden state via shared vocabulary ~6 KB
JSON fallback No compatible projection path Plain text Varies

The handshake checks model hash → structural match → shared tokenizer → vocabulary overlap (≥100 BPE tokens) → JSON. You never configure this manually.

Works With

Engines

Engine Latent Pipeline Cross-model
HuggingFace avp[hf] Full think/generate Yes
Ollama avp[ollama] Full think/generate, auto-resolves GGUF Yes
llama.cpp avp[llamacpp] Full think/generate on GGUF Yes
vLLM avp[vllm] KV connector + model plugin Yes

Frameworks

Framework Integration Extra
LangChain ChatAVP BaseChatModel avp[langchain]
CrewAI AVPLLM BaseLLM avp[crewai]
AutoGen AVPChatCompletionClient avp[autogen]
A2A / MCP Complementary: AVP handles tensor transfer, they handle routing

See Framework Integration Guide for per-engine code examples.

Roadmap

  • Bidirectional latent communication (both agents share thinking, not just one)
  • CacheGen-style KV-cache compression (3-4x reduction)

Documentation

License

Apache 2.0 – see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

avp-0.4.0.tar.gz (325.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

avp-0.4.0-py3-none-any.whl (114.8 kB view details)

Uploaded Python 3

File details

Details for the file avp-0.4.0.tar.gz.

File metadata

  • Download URL: avp-0.4.0.tar.gz
  • Upload date:
  • Size: 325.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for avp-0.4.0.tar.gz
Algorithm Hash digest
SHA256 33959c5cabf3e19c7c64fe919c8690b517512895e759d249a05c44e67645fa83
MD5 5c2f813d2ee77243ca7fa7d522804453
BLAKE2b-256 92472d76346a33a2204e8c187ad071ade34c05fcb4685eccbe653ff2c272903f

See more details on using hashes here.

Provenance

The following attestation bundles were made for avp-0.4.0.tar.gz:

Publisher: publish.yml on VectorArc/avp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file avp-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: avp-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 114.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for avp-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7c5f0932557370e3e508c6420d9dd77ea48c663bc4dc13fda861b96bebd334c6
MD5 b2e23a62d04faedc1db6f81a4b507dac
BLAKE2b-256 07fc3d07fda872aa2a25c0fc9632c24a898e3f03c30bca86dae3e37777715b23

See more details on using hashes here.

Provenance

The following attestation bundles were made for avp-0.4.0-py3-none-any.whl:

Publisher: publish.yml on VectorArc/avp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page