Skip to main content

Multi-agent text handoffs discard KV-cache and attention state. AVP transfers that state directly — 51-78% fewer tokens, 1.5-5x faster.

Project description

Agent Vector Protocol (AVP) — KV-Cache Transfer for Multi-Agent LLMs

PyPI CI License Python Spec

Multi-agent text handoffs discard KV-cache and attention state. AVP transfers that state directly — 46-78% fewer tokens, 2-4x faster, across models and families. Built on LatentMAS (2025).

pip install avp

Self-hosted models on GPUs only. AVP needs access to model internals (KV-cache, hidden states) that cloud APIs don't expose. If you use OpenAI, Anthropic, or Google APIs — AVP can't help you. Good fit: multi-agent pipelines on vLLM or HuggingFace Transformers with datacenter or same-machine connectivity.

Quick Start

from avp import HuggingFaceConnector

connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
prompt = "Analyze this math problem: 24 * 17 + 3"

# Agent A: latent reasoning (no text output, builds KV-cache)
context = connector.think(prompt, steps=10)

# Agent B: generate with Agent A's context
answer = connector.generate(prompt, context=context)

Results

Direct Latent (AVP) Text
HumanEval (Qwen 7B, n=164) 58.5% 67.1% 53.0%
GSM8K (Qwen 7B, n=200) 91.0% 90.5% 87.0%
DebugBench (Qwen 7B, n=100) 50.0% 51.0% 49.0%
GSM8K (Llama 3B, n=200) 75.0% 78.0% 75.5%

+8.6pp on code generation (p=0.029). 46-78% fewer tokens. 2-4x faster. Tested on NVIDIA A100.

Cross-model (zero training, 6 KB wire):

Source → Target GSM8K HumanEval
Qwen 7B → Llama 3B 74.5% 47.0%
Llama 3B → Qwen 7B 90.0% 79.3%

Full results: Benchmarks — 8 benchmarks, 5 models, 2 families.

How It Works

graph LR
    subgraph text["Text Chain (today)"]
        direction LR
        A1["Agent A<br/>generates text"] -->|"serialize to text<br/>re-tokenize everything"| B1["Agent B<br/>re-processes from scratch"]
    end

    subgraph avp["AVP Latent Transfer"]
        direction LR
        A2["Agent A<br/>generates KV-cache"] -->|"binary transfer<br/>28-130 MB"| B2["Agent B<br/>picks up where A left off"]
    end

    style text fill:#fff3f3,stroke:#d44,stroke-width:2px
    style avp fill:#f3fff3,stroke:#4a4,stroke-width:2px

AVP transfers the KV-cache (computed attention states) directly between agents. The receiving agent reads prior reasoning from attention states instead of re-computing it from text. Three modes, auto-negotiated:

Mode When What Happens
Latent Same model KV-cache transfer, zero re-processing
Cross-model Same or different family Vocabulary-mediated projection, zero training
JSON fallback No compatible path Standard text, auto-negotiated
Cross-model transfer
from avp import HuggingFaceConnector

researcher = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
solver = HuggingFaceConnector.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

prompt = "Solve step by step: 24 * 17 + 3"
context = researcher.think(prompt, steps=10)
answer = solver.generate(prompt, context=context, source=researcher)

Cross-model calibration is one-time per model pair (~0.5-2s), cached to ~/.avp/maps/.

Easy API (convenience wrappers)
import avp

# One-liner: think + generate
answer = avp.generate("Solve: 24 * 17 + 3", model="Qwen/Qwen2.5-7B-Instruct")

# Cross-model
answer = avp.generate("Solve: 24 * 17 + 3",
                       model="meta-llama/Llama-3.2-3B-Instruct",
                       source_model="Qwen/Qwen2.5-7B-Instruct")
vLLM integration (experimental)

Status: Experimental. VLLMConnector works for text generation and identity extraction. The KV connector plugin (AVPKVConnectorV1Dynamic) for latent KV-cache transfer between vLLM instances has not been validated end-to-end and has known issues with PagedAttention format conversion. Use HuggingFaceConnector for production latent transfer. See CHANGELOG for details.

from avp import VLLMConnector

connector = VLLMConnector(model_id="Qwen/Qwen2.5-7B-Instruct")
answer = connector.generate("Analyze and solve: 24 * 17 + 3")
Cross-process transfer
# Process A: serialize context
wire_bytes = context.to_bytes(session_id="s1", source_agent_id="agent-a")

# Process B: restore and generate
from avp import AVPContext, HuggingFaceConnector
connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
restored = AVPContext.from_bytes(wire_bytes, device="cuda")
answer = connector.generate(prompt, context=restored)

Works With

AVP works with your orchestration framework, not instead of it. Replace llm.invoke() with avp.generate() — your framework sees text in, text out.

Framework Integration
LangGraph Graph node — avp.generate() replaces LLM call
CrewAI BaseLLM.call() override
PydanticAI FunctionModel callback
LlamaIndex CustomLLM.complete() override
vLLM KVConnectorBase_V1 plugin (experimental — text generation works, latent transfer in progress)
HuggingFace Full hidden state and KV-cache access
A2A / MCP Complementary — AVP handles tensor transfer

See Framework Integration Guide for examples.

Roadmap

  • Bidirectional latent communication (A→B + B→A latent)
  • vLLM serving throughput benchmarks
  • CacheGen-style compression (3-4x KV-cache size reduction)

Documentation

License

Apache 2.0 — see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

avp-0.3.0.tar.gz (247.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

avp-0.3.0-py3-none-any.whl (81.2 kB view details)

Uploaded Python 3

File details

Details for the file avp-0.3.0.tar.gz.

File metadata

  • Download URL: avp-0.3.0.tar.gz
  • Upload date:
  • Size: 247.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for avp-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d48ca0b7cb5363060f855686ab54f17b94295228e816bc3b251de795e19345f8
MD5 2c34b598eb3353c0f29d0bcf5025cd39
BLAKE2b-256 c3f854fd55942bdbb2129e82776c45e9f1530e2adc0a98cc5937f34580c4928d

See more details on using hashes here.

Provenance

The following attestation bundles were made for avp-0.3.0.tar.gz:

Publisher: publish.yml on VectorArc/avp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file avp-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: avp-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 81.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for avp-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e0e0516bb10d55cd5ddd4142a8f36715732d0b11ff30c898983d5dc65d0cac22
MD5 612210386038403a503f3f4c70f9e4c5
BLAKE2b-256 c8f95d21045541382fddcb9bfc35c205788e3690f71bd9b0806a17fd41544823

See more details on using hashes here.

Provenance

The following attestation bundles were made for avp-0.3.0-py3-none-any.whl:

Publisher: publish.yml on VectorArc/avp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page