Skip to main content

Multi-agent text handoffs discard KV-cache and attention state. AVP transfers that state directly — 51-78% fewer tokens, 1.5-5x faster.

Project description

AVP – Agents Share Thoughts, Not Text

PyPI CI License Python Spec Open In Colab

When LLM agents hand off work as text, the next agent re-processes everything from scratch. AVP (Agent Vector Protocol) transfers the actual computation (KV-cache, hidden states, attention) so the receiving agent picks up where the sender left off. Zero tokens between agents, 2-3x faster pipelines, same or better accuracy. Built on LatentMAS, extended with cross-model vocabulary-mediated projection. Zero training, works across model families.

pip install avp[hf]

Requires self-hosted models on GPUs. AVP accesses model internals (KV-cache, hidden states) that cloud APIs don't expose. Other engines: avp[ollama], avp[llamacpp], avp[vllm] – see Works With.

Quick Start

Same model – two agents share a KV-cache:

from avp import HuggingFaceConnector

connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Agent A thinks (builds KV-cache, no text output)
context = connector.think("Analyze this math problem: 24 * 17 + 3", steps=20)

# Agent B generates using Agent A's KV-cache
answer = connector.generate("Solve step by step: 24 * 17 + 3", context=context)

Cross-model – different architectures, zero training:

researcher = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
solver = HuggingFaceConnector.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

context = researcher.think("Analyze this problem", steps=20)
answer = solver.generate("Solve it", context=context, source=researcher, cross_model=True)

Cross-process – serialize context over any transport:

# Process A
wire_bytes = context.to_bytes(session_id="s1", source_agent_id="agent-a")

# Process B
restored = AVPContext.from_bytes(wire_bytes, device="cuda")
answer = connector.generate(prompt, context=restored)

You don't choose the transfer mode. The handshake auto-negotiates based on model compatibility: same model → full KV-cache, different models → vocabulary-mediated projection (~6 KB), incompatible models → JSON text fallback.

Results

Direct = single model, no pipeline. Latent = AVP transfer. Text Chain = standard text handoff between agents.

Direct Latent (AVP) Text Chain
HumanEval (Qwen 7B, n=164) 58.5% 67.1% 53.0%
GSM8K (Qwen 7B, n=200) 91.0% 90.5% 87.0%
DebugBench (Qwen 7B, n=100) 50.0% 51.0% 49.0%
GSM8K (Llama 3B, n=200) 74.5% 76.0% 79.0%

HumanEval: +12.4pp vs text across 4 seeds (p=0.004). GSM8K and DebugBench: neutral across all modes, but the pipeline runs 3x faster (7.6s vs 22.8s end-to-end on DebugBench). Llama 3B: text wins on GSM8K; latent overhead has more impact on smaller models. All benchmarks used steps=20 on NVIDIA A100.

Trade-off: 20 latent steps cost ~0.9s on A100. If Agent A would normally generate 22+ tokens of text, latent is faster.

Cross-model (zero training):

Source → Target GSM8K (Rosetta / Text) HumanEval (Rosetta / Text)
Qwen 7B → Qwen 3B 82.5% / 88.5% 66.5% / 62.2%
Qwen 7B → Llama 3B 77.0% / 86.5% 47.0% / 57.9%
Llama 3B → Qwen 7B 90.0% / 82.0% 79.3% / 61.6%

Target solo baselines: Qwen 3B = 82.5% / 61.0%, Llama 3B = 76.0% / 50.6%, Qwen 7B = 91.0% / 58.5%.

Full results: Benchmarks – 7 benchmarks, 5 models, 2 families, reproducible.

How It Works

How AVP works

AVP auto-negotiates the transfer mode via a handshake at connection time. You write the same think() / generate() code regardless of which mode is selected:

Mode When What transfers Size
Latent Same model Full KV-cache ~390 MB for 7B
Cross-model Different model or family Projected hidden state via shared vocabulary ~6 KB
JSON fallback No compatible projection path Plain text Varies

The handshake checks model hash → structural match → shared tokenizer → vocabulary overlap (≥100 BPE tokens) → JSON. You never configure this manually.

Works With

Engines

Engine Latent Pipeline Cross-model
HuggingFace avp[hf] Full think/generate Yes
Ollama avp[ollama] Full think/generate, auto-resolves GGUF Yes
llama.cpp avp[llamacpp] Full think/generate on GGUF Yes
vLLM avp[vllm] KV connector + model plugin Yes

Frameworks

Framework Integration Extra
LangChain ChatAVP BaseChatModel avp[langchain]
CrewAI AVPLLM BaseLLM avp[crewai]
AutoGen AVPChatCompletionClient avp[autogen]
A2A / MCP Complementary: AVP handles tensor transfer, they handle routing

See Framework Integration Guide for per-engine code examples.

Roadmap

  • Bidirectional latent communication (both agents share thinking, not just one)
  • CacheGen-style KV-cache compression (3-4x reduction)

Documentation

License

Apache 2.0 – see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

avp-0.4.2.tar.gz (335.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

avp-0.4.2-py3-none-any.whl (119.6 kB view details)

Uploaded Python 3

File details

Details for the file avp-0.4.2.tar.gz.

File metadata

  • Download URL: avp-0.4.2.tar.gz
  • Upload date:
  • Size: 335.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for avp-0.4.2.tar.gz
Algorithm Hash digest
SHA256 157496e4eb84111bacc6809a8c00452f9f516e983a6413126834a6c8667d0336
MD5 08cfd1b7542f371a319bc6ee435d5f0f
BLAKE2b-256 c7587379dfd4a67a058937fd1702355ec02ba896dfbdf812165fbdd0e08f406b

See more details on using hashes here.

Provenance

The following attestation bundles were made for avp-0.4.2.tar.gz:

Publisher: publish.yml on VectorArc/avp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file avp-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: avp-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 119.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for avp-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 047944cf68262819afb4b6667426018fd68c1d47c14cd26c3903db44d606f5ad
MD5 4db02459047ce4c68f77b6a9965421e4
BLAKE2b-256 0be4f798d3c4ba76892638c3e63770984384cba458ef1eefa68b515cfb5681b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for avp-0.4.2-py3-none-any.whl:

Publisher: publish.yml on VectorArc/avp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page