Skip to main content

Multi-agent text handoffs discard KV-cache and attention state. AVP transfers that state directly — 51-78% fewer tokens, 1.5-5x faster.

Project description

AVP – Agents Share Thoughts, Not Text

PyPI CI License Python Spec Open In Colab

When LLM agents hand off work as text, the next agent re-processes everything from scratch. AVP (Agent Vector Protocol) transfers the actual computation (KV-cache, hidden states, attention) so the receiving agent picks up where the sender left off. Zero tokens between agents, 2-3x faster pipelines, same or better accuracy. Built on LatentMAS, extended with cross-model vocabulary-mediated projection. Zero training, works across model families.

pip install avp[hf]

Requires self-hosted models on GPUs. AVP accesses model internals (KV-cache, hidden states) that cloud APIs don't expose. Other engines: avp[ollama], avp[llamacpp], avp[vllm] – see Works With.

Quick Start

Same model – two agents share a KV-cache:

from avp import HuggingFaceConnector

connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Agent A thinks (builds KV-cache, no text output)
context = connector.think("Analyze this math problem: 24 * 17 + 3", steps=20)

# Agent B generates using Agent A's KV-cache
answer = connector.generate("Solve step by step: 24 * 17 + 3", context=context)

Cross-model – different architectures, zero training:

researcher = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
solver = HuggingFaceConnector.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

context = researcher.think("Analyze this problem", steps=20)
answer = solver.generate("Solve it", context=context, source=researcher, cross_model=True)

Cross-process – serialize context over any transport:

# Process A
wire_bytes = context.to_bytes(session_id="s1", source_agent_id="agent-a")

# Process B
restored = AVPContext.from_bytes(wire_bytes, device="cuda")
answer = connector.generate(prompt, context=restored)

You don't choose the transfer mode. The handshake auto-negotiates based on model compatibility: same model → full KV-cache, different models → vocabulary-mediated projection (~6 KB), incompatible models → JSON text fallback.

Results

Direct = single model, no pipeline. Latent = AVP transfer. Text Chain = standard text handoff between agents.

Direct Latent (AVP) Text Chain
HumanEval (Qwen 7B, n=164) 58.5% 67.1% 53.0%
GSM8K (Qwen 7B, n=200) 91.0% 90.5% 87.0%
DebugBench (Qwen 7B, n=100) 50.0% 51.0% 49.0%
GSM8K (Llama 3B, n=200) 74.5% 76.0% 79.0%

HumanEval: +12.4pp vs text across 4 seeds (p=0.004). GSM8K and DebugBench: neutral across all modes, but the pipeline runs 3x faster (7.6s vs 22.8s end-to-end on DebugBench). Llama 3B: text wins on GSM8K; latent overhead has more impact on smaller models. All benchmarks used steps=20 on NVIDIA A100.

Trade-off: 20 latent steps cost ~0.9s on A100. If Agent A would normally generate 22+ tokens of text, latent is faster.

Cross-model (zero training):

Source → Target GSM8K (Rosetta / Text) HumanEval (Rosetta / Text)
Qwen 7B → Qwen 3B 82.5% / 88.5% 66.5% / 62.2%
Qwen 7B → Llama 3B 77.0% / 86.5% 47.0% / 57.9%
Llama 3B → Qwen 7B 90.0% / 82.0% 79.3% / 61.6%

Target solo baselines: Qwen 3B = 82.5% / 61.0%, Llama 3B = 76.0% / 50.6%, Qwen 7B = 91.0% / 58.5%.

Full results: Benchmarks – 7 benchmarks, 5 models, 2 families, reproducible.

How It Works

How AVP works

AVP auto-negotiates the transfer mode via a handshake at connection time. You write the same think() / generate() code regardless of which mode is selected:

Mode When What transfers Size
Latent Same model Full KV-cache ~390 MB for 7B
Cross-model Different model or family Projected hidden state via shared vocabulary ~6 KB
JSON fallback No compatible projection path Plain text Varies

The handshake checks model hash → structural match → shared tokenizer → vocabulary overlap (≥100 BPE tokens) → JSON. You never configure this manually.

Works With

Engines

Engine Latent Pipeline Cross-model
HuggingFace avp[hf] Full think/generate Yes
Ollama avp[ollama] Full think/generate, auto-resolves GGUF Yes
llama.cpp avp[llamacpp] Full think/generate on GGUF Yes
vLLM avp[vllm] KV connector + model plugin Yes

Frameworks

Framework Integration Extra
LangChain ChatAVP BaseChatModel avp[langchain]
CrewAI AVPLLM BaseLLM avp[crewai]
AutoGen AVPChatCompletionClient avp[autogen]
A2A / MCP Complementary: AVP handles tensor transfer, they handle routing

See Framework Integration Guide for per-engine code examples.

Roadmap

  • Bidirectional latent communication (both agents share thinking, not just one)
  • CacheGen-style KV-cache compression (3-4x reduction)

Documentation

License

Apache 2.0 – see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

avp-0.6.1.tar.gz (347.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

avp-0.6.1-py3-none-any.whl (128.2 kB view details)

Uploaded Python 3

File details

Details for the file avp-0.6.1.tar.gz.

File metadata

  • Download URL: avp-0.6.1.tar.gz
  • Upload date:
  • Size: 347.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for avp-0.6.1.tar.gz
Algorithm Hash digest
SHA256 3b8febf0a4c194f6b79071c0c36fce2221e07ebb14af9bf0d0c7729b78dfbcc8
MD5 a29d8f7b7d9b4d25865d4e9f609745b3
BLAKE2b-256 248cda8b57fae7cce171bf6721c4b187b3390769989b268e0212402b31c1e14e

See more details on using hashes here.

Provenance

The following attestation bundles were made for avp-0.6.1.tar.gz:

Publisher: publish.yml on VectorArc/avp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file avp-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: avp-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 128.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for avp-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f090661e4183b9101b24bb7155333d48515545a09068a2503c9dc92dd632142c
MD5 ccf61f27685a504a0257e514b07460e2
BLAKE2b-256 127466e854cb2c297de221c5fa2bed0bc50653b91097914ffa998dba1b8cc1f8

See more details on using hashes here.

Provenance

The following attestation bundles were made for avp-0.6.1-py3-none-any.whl:

Publisher: publish.yml on VectorArc/avp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page