Skip to main content

Multi-agent text handoffs discard KV-cache and attention state. AVP transfers that state directly — 51-78% fewer tokens, 1.5-5x faster.

Project description

AVP – Agents Share Thoughts, Not Text

PyPI CI License Python Spec Open In Colab

When LLM agents hand off work as text, the next agent re-processes everything from scratch. AVP (Agent Vector Protocol) transfers the actual computation (KV-cache, hidden states, attention) so the receiving agent picks up where the sender left off. Zero tokens between agents, 2-3x faster pipelines, same or better accuracy. Built on LatentMAS, extended with cross-model vocabulary-mediated projection. Zero training, works across model families.

pip install avp[hf]

Requires self-hosted models on GPUs. AVP accesses model internals (KV-cache, hidden states) that cloud APIs don't expose. Other engines: avp[ollama], avp[llamacpp], avp[vllm] – see Works With.

Quick Start

Same model – two agents share a KV-cache:

from avp import HuggingFaceConnector

connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Agent A thinks (builds KV-cache, no text output)
context = connector.think("Analyze this math problem: 24 * 17 + 3", steps=20)

# Agent B generates using Agent A's KV-cache
answer = connector.generate("Solve step by step: 24 * 17 + 3", context=context)

Cross-model – different architectures, zero training:

researcher = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
solver = HuggingFaceConnector.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

context = researcher.think("Analyze this problem", steps=20)
answer = solver.generate("Solve it", context=context, source=researcher, cross_model=True)

Cross-process – serialize context over any transport:

# Process A
wire_bytes = context.to_bytes(session_id="s1", source_agent_id="agent-a")

# Process B
restored = AVPContext.from_bytes(wire_bytes, device="cuda")
answer = connector.generate(prompt, context=restored)

You don't choose the transfer mode. The handshake auto-negotiates based on model compatibility: same model → full KV-cache, different models → vocabulary-mediated projection (~6 KB), incompatible models → JSON text fallback.

Results

Direct = single model, no pipeline. Latent = AVP transfer. Text Chain = standard text handoff between agents.

Direct Latent (AVP) Text Chain
HumanEval (Qwen 7B, n=164) 58.5% 67.1% 53.0%
GSM8K (Qwen 7B, n=200) 91.0% 90.5% 87.0%
DebugBench (Qwen 7B, n=100) 50.0% 51.0% 49.0%
GSM8K (Llama 3B, n=200) 74.5% 76.0% 79.0%

HumanEval: +12.4pp vs text across 4 seeds (p=0.004). GSM8K and DebugBench: neutral across all modes, but the pipeline runs 3x faster (7.6s vs 22.8s end-to-end on DebugBench). Llama 3B: text wins on GSM8K; latent overhead has more impact on smaller models. All benchmarks used steps=20 on NVIDIA A100.

Trade-off: 20 latent steps cost ~0.9s on A100. If Agent A would normally generate 22+ tokens of text, latent is faster.

Cross-model (zero training):

Source → Target GSM8K (Rosetta / Text) HumanEval (Rosetta / Text)
Qwen 7B → Qwen 3B 82.5% / 88.5% 66.5% / 62.2%
Qwen 7B → Llama 3B 77.0% / 86.5% 47.0% / 57.9%
Llama 3B → Qwen 7B 90.0% / 82.0% 79.3% / 61.6%

Target solo baselines: Qwen 3B = 82.5% / 61.0%, Llama 3B = 76.0% / 50.6%, Qwen 7B = 91.0% / 58.5%.

Full results: Benchmarks – 7 benchmarks, 5 models, 2 families, reproducible.

How It Works

How AVP works

AVP auto-negotiates the transfer mode via a handshake at connection time. You write the same think() / generate() code regardless of which mode is selected:

Mode When What transfers Size
Latent Same model Full KV-cache ~390 MB for 7B
Cross-model Different model or family Projected hidden state via shared vocabulary ~6 KB
JSON fallback No compatible projection path Plain text Varies

The handshake checks model hash → structural match → shared tokenizer → vocabulary overlap (≥100 BPE tokens) → JSON. You never configure this manually.

Works With

Engines

Engine Latent Pipeline Cross-model
HuggingFace avp[hf] Full think/generate Yes
Ollama avp[ollama] Full think/generate, auto-resolves GGUF Yes
llama.cpp avp[llamacpp] Full think/generate on GGUF Yes
vLLM avp[vllm] KV connector + model plugin Yes

Frameworks

Framework Integration Extra
LangChain ChatAVP BaseChatModel avp[langchain]
CrewAI AVPLLM BaseLLM avp[crewai]
AutoGen AVPChatCompletionClient avp[autogen]
A2A / MCP Complementary: AVP handles tensor transfer, they handle routing

See Framework Integration Guide for per-engine code examples.

Roadmap

  • Bidirectional latent communication (both agents share thinking, not just one)
  • CacheGen-style KV-cache compression (3-4x reduction)

Documentation

License

Apache 2.0 – see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

avp-0.6.2.tar.gz (348.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

avp-0.6.2-py3-none-any.whl (128.7 kB view details)

Uploaded Python 3

File details

Details for the file avp-0.6.2.tar.gz.

File metadata

  • Download URL: avp-0.6.2.tar.gz
  • Upload date:
  • Size: 348.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for avp-0.6.2.tar.gz
Algorithm Hash digest
SHA256 179a4b2a4598940561a243f06a909ab8da771ac4faa8aff50f52bd672fa257a3
MD5 33042d68d632b7583b3ada93efeaac08
BLAKE2b-256 f2dd082f18ea6ed64eca3413bc2ba63953f51efbbf32cf67aaa48a7d5cf12a7f

See more details on using hashes here.

Provenance

The following attestation bundles were made for avp-0.6.2.tar.gz:

Publisher: publish.yml on VectorArc/avp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file avp-0.6.2-py3-none-any.whl.

File metadata

  • Download URL: avp-0.6.2-py3-none-any.whl
  • Upload date:
  • Size: 128.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for avp-0.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ffd1b1bc402efc9052620b4fe08c24447a701c8f3962cb219eb48403e0b24b59
MD5 0944f68151aef8f7d0e79705a4aea17e
BLAKE2b-256 a3f37570763cf0ed11e78f582208a4c103f3f2a8fe12643e91729c88d1a98bf3

See more details on using hashes here.

Provenance

The following attestation bundles were made for avp-0.6.2-py3-none-any.whl:

Publisher: publish.yml on VectorArc/avp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page