Transfer LLM KV-cache between agents instead of regenerating text — 73-78% fewer tokens, 2-4x faster multi-agent pipelines

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sstas

These details have not been verified by PyPI

Project description

Agent Vector Protocol (AVP) — KV-Cache Transfer for Multi-Agent LLMs

Transfer KV-cache between LLM agents instead of regenerating text. Same multi-agent pipeline, 51-78% fewer tokens, 1.5-5x faster. Cross-model projection with zero training.

pip install avp

Who This Is For

Self-hosted models on GPUs. AVP needs access to model internals (KV-cache, hidden states) that cloud APIs don't expose. If you use OpenAI, Anthropic, or Google APIs — AVP can't help you.

Good fit: Multi-agent pipelines on self-hosted models (vLLM, HuggingFace Transformers) with datacenter or same-machine connectivity.

Not a fit: Cloud API models, single-agent apps, edge/mobile, cross-internet links (<1 Gbps).

Install

Requires: Python 3.9+, PyTorch >= 2.0. For vLLM integration: vLLM >= 0.15.

# Core SDK (codec, handshake, session, fallback)
pip install avp

# With latent communication (realignment, KV-cache, HuggingFace connector)
pip install "avp[latent]"

# With HTTP/2 transport server
pip install "avp[server]"

# Everything including dev tools
pip install "avp[all]"

From source:

git clone https://github.com/VectorArc/avp-python.git
cd avp-python
pip install -e ".[all]"

Quick Start

Layer 0 — JSON messaging (no GPU, no model download):

import avp

msg = avp.pack("Analyze this math problem: 24 * 17 + 3")
text = avp.unpack(msg)  # → "Analyze this math problem: 24 * 17 + 3"

wire = msg.to_bytes()   # send over any transport

Layer 1 — Add model identity (downloads config only, ~1 KB):

msg = avp.pack("Analyze this", model="Qwen/Qwen2.5-7B-Instruct")
# msg.identity contains model_id; with `transformers` installed,
# also includes model family, dimensions, and hash
# Receiving agent can check compatibility before GPU work

Layer 2 — Latent transfer (requires GPU + model):

msg = avp.pack("Analyze this math problem: 24 * 17 + 3",
               model="Qwen/Qwen2.5-1.5B-Instruct", think_steps=20)
answer = avp.unpack(msg, model="Qwen/Qwen2.5-1.5B-Instruct")

Key Results

Metric	Value
Token savings vs text chains	51-78% across 4 benchmarks, 5 models
Speed improvement	1.5-5x faster (model and task dependent)
Cross-model (zero training)	72% GSM8K accuracy, Qwen 7B to Llama 3B, 6 KB wire
Models validated	Qwen2.5 (1.5B, 7B), DeepSeek-R1 (1.5B), Llama 3.2 (1B, 3B)
Hardware	A100 (cloud), RTX 3070 Ti (local)

Same-model: Latent matches direct accuracy on Qwen 7B (85%) and beats text by 10pp. Cross-model: Zero-training vocabulary projection hits solver ceiling on structured tasks (math: 72%), but fails on comprehension (HotpotQA: 8%). See full results.

How It Works

graph LR
    subgraph text["Text Chain (today)"]
        direction LR
        A1["Agent A<br/>generates text"] -->|"serialize to text<br/>re-tokenize everything"| B1["Agent B<br/>re-processes from scratch"]
    end

    subgraph avp["AVP Latent Transfer"]
        direction LR
        A2["Agent A<br/>generates KV-cache"] -->|"binary transfer<br/>28-130 MB"| B2["Agent B<br/>picks up where A left off"]
    end

    style text fill:#fff3f3,stroke:#d44,stroke-width:2px
    style avp fill:#f3fff3,stroke:#4a4,stroke-width:2px

Every multi-agent framework today — LangChain, CrewAI, AutoGen, OpenAI Swarm — copies text between agents. Each agent re-tokenizes and re-processes everything prior agents already computed. Our benchmarks show 47-53% of all tokens in text chains are redundant re-processing. (See Works With for integration examples.)

AVP eliminates this by transferring the KV-cache (the computed attention states) directly. The receiving agent reads prior reasoning from attention states instead of re-computing it from text.

AVP defines a binary format, handshake, and codec — not the transport. It works alongside any agent framework or protocol.

┌──────────────────────────────────────────────────────────────┐
│  Your Orchestrator (LangGraph / CrewAI / PydanticAI / any)    │
│                                                              │
│  Agent A                          Agent B                    │
│    │                                ▲                        │
│    │  connector.think() ──►         │  connector.generate()  │
│    │  AVPContext                     │  with context=...      │
│    │                                │                        │
│    │    context.to_bytes()          │  AVPContext.from_bytes()│
│    ▼                                │                        │
│  ┌────────────────────────────────────────────┐              │
│  │  AVP (this library)                        │              │
│  │  • Handshake — resolves LATENT/JSON mode   │              │
│  │  • Codec — serialize/deserialize KV-cache  │              │
│  │  • Session — TTL, thread safety            │              │
│  └────────────────────────────────────────────┘              │
│         │                                                    │
│    Transport: HTTP/2, gRPC, shared memory, file, any         │
└──────────────────────────────────────────────────────────────┘

Three communication modes, auto-negotiated via handshake:

Mode	When	What Happens
Latent	Same model	KV-cache + hidden state transfer, zero re-processing
Cross-model	Same or different family (e.g. Qwen 7B to Llama 3B)	Vocabulary-mediated projection, zero training needed
JSON fallback	No compatible projection path	Standard text, auto-negotiated

Transport-agnostic: HTTP/2 (reference), gRPC, A2A, MCP, WebSockets, shared memory. AVP handles the latent communication layer — not discovery, routing, or orchestration.

Connector API

For full control over model loading, device placement, and context serialization:

High-level API (5 lines):

from avp import HuggingFaceConnector

connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

prompt = "Analyze this math problem: 24 * 17 + 3"

# Agent A: latent reasoning (no text output, builds KV-cache)
context = connector.think(prompt, steps=20)

# Agent B: generate with Agent A's context (same prompt — KV-cache continues from it)
answer = connector.generate(prompt, context=context)

Cross-process transfer:

# Process A: serialize context
wire_bytes = context.to_bytes(session_id="s1", source_agent_id="agent-a")

# Process B: restore and generate
from avp import AVPContext, HuggingFaceConnector
connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
restored = AVPContext.from_bytes(wire_bytes, device="cuda")
answer = connector.generate(prompt, context=restored)

Check model compatibility:

from avp import extract_model_identity, CompatibilityResolver

# Pass loaded HuggingFace model objects (not strings)
local = extract_model_identity(model_a)
remote = extract_model_identity(model_b)
session = CompatibilityResolver.resolve(local, remote)
# session.mode → LATENT (same model) or JSON (different)

Production: vLLM Integration

vLLM can't expose per-step hidden states, so latent transfer happens at the engine level via a KV connector plugin — transparent to your application code:

# Launch vLLM with AVP KV connector
vllm serve Qwen/Qwen2.5-7B-Instruct \
    --kv-connector AVPKVConnectorV1Dynamic \
    --kv-connector-module-path avp.connectors.vllm_kv_connector

# Application code stays simple — KV transfer happens behind the scenes
from avp import VLLMConnector

connector = VLLMConnector(model_id="Qwen/Qwen2.5-7B-Instruct")
answer = connector.generate("Analyze and solve: 24 * 17 + 3")

The AVPKVConnectorV1Dynamic plugin saves/loads KV-cache between vLLM instances via a file-based store, so agents on the same machine share computed attention states without re-processing.

API Reference

Easy API (start here)

Import	What It Does
`pack(content, *, model=, think_steps=)`	Pack text for transfer. Layer 0: JSON. Layer 1: + model identity. Layer 2: + latent context. Returns `PackedMessage`.
`unpack(data, *, model=)`	Unpack any AVP format to text. With `model=`, generates a response using latent context.
`generate(prompt, *, model=, context=)`	Generate text with optional latent context. Shortcut for connector setup + generation.
`PackedMessage`	Result of `pack()`. `str(msg)` for text, `msg.to_bytes()` for wire format, `.identity` for model info, `.context` for latent data.

Connector API (advanced)

Import	What It Does
`HuggingFaceConnector`	Main connector. `think()` builds KV-cache (returns `AVPContext`), `generate()` produces text. `from_pretrained()` for easy setup.
`VLLMConnector`	Production connector. `generate()` returns text. Latent transfer happens at engine level via KV connector plugin.
`AVPContext`	Wraps KV-cache + model metadata. Pass between `think()` and `generate()`, or serialize with `to_bytes()` / `from_bytes()` for cross-process transfer.
`ContextStore`	In-memory store for sharing `AVPContext` objects between agents by session ID. Thread-safe.

Protocol Layer

Import	What It Does
`encode` / `decode`	Binary codec for hidden states, KV-cache, and hybrid payloads.
`extract_model_identity`	Extract `ModelIdentity` (family, dimensions, hash) from a HuggingFace model.
`CompatibilityResolver.resolve()`	Handshake: compares two `ModelIdentity` objects, returns LATENT, HYBRID, or JSON mode.
`SessionManager`	Manage communication sessions with TTL and thread safety.
`AVPClient` / `AVPAsyncClient`	HTTP/2 client (sync and async) for sending AVP messages over the network.
`create_app`	Create a FastAPI server that receives AVP messages.

Cross-Model (Rosetta Stone)

Import	What It Does
`calibrate`	Build a projection map (`AVPMap`) between two models for cross-model transfer. Auto-detects same-family (vocab mediated) vs cross-family (vocab overlap).
`vocabulary_mediated_projection`	Project hidden states via shared vocabulary (same tokenizer).
`vocab_overlap_projection`	Project hidden states via overlapping BPE tokens (different tokenizers).
`validate_projection`	Quality gate: cosine similarity (fast) + pseudo-perplexity (thorough). Returns LATENT/HYBRID/JSON recommendation.
`save_map` / `load_map` / `find_map`	Persist and retrieve `.avp-map` files for reuse.

Error Types

All errors inherit from AVPError. Key types: IncompatibleModelsError, HandshakeError, DecodeError, ShapeMismatchError, RealignmentError, SessionExpiredError, EngineNotAvailableError, FallbackRequested.

Roadmap

Compact hidden state mode (same-model, ~60x smaller wire than full KV-cache)
Bidirectional latent communication (A→B + B→A latent, not just one-way)
CacheGen-style compression (3-4x KV-cache wire size reduction)
vLLM serving throughput benchmarks

Works With

Agent Frameworks

AVP works with your orchestration framework, not instead of it. Your framework handles routing, state, and agent lifecycle. AVP handles the communication primitive between agents.

The integration pattern is the same across all six tested frameworks:

# Agent A's output → AVP → Agent B's input
packed = avp.pack(agent_a_output, model="Qwen/Qwen2.5-7B-Instruct")
agent_b_input = str(packed)  # works as plain text in any framework

Framework	Layer 0/1 Friction	Integration Point
PydanticAI	Zero — plain strings	`FunctionModel` callback
LangGraph	Low — wrap in `AIMessage`	Graph node function
CrewAI	Zero — plain strings	`BaseLLM.call()` override
LlamaIndex	Zero — plain strings	`CustomLLM.complete()` override
OpenAI Agents SDK	Low — custom `Model` class	`Model.get_response()` override
Google ADK	Low — async generator	`BaseLlm.generate_content_async()` override

Layer 2 (latent transfer) works in all six but requires a side-channel dict for KV-cache context — no framework natively supports binary tensor data between agents.

Infrastructure & Protocols

vLLM — KVConnectorBase_V1 plugin for production serving
HuggingFace Transformers — Full hidden state and KV-cache access
A2A — Transport binding via multipart/related with binary payloads
MCP — Complementary: MCP handles tools and context, AVP handles tensor transfer

Key Concepts

Term	What It Means
KV-cache	During text generation, each transformer layer computes key and value vectors for the attention mechanism. These are cached so they don't need to be recomputed for each new token. AVP transfers this cache between agents so the receiving agent doesn't recompute what the sender already processed.
Hidden states	The internal vector representations at each transformer layer — the model's "understanding" of the input at that point in the network. Richer than text because they carry information that gets lost when converting to tokens.
Latent transfer	Sending KV-cache or hidden states (the "latent" internal representations) instead of converting to text and back. Avoids the lossy text bottleneck.
Realignment	Normalizing hidden states before injecting them into another model instance, so they match the expected input distribution. Required because hidden state magnitudes can drift.
Tied weights	When a model reuses the same weight matrix for both input embeddings and output projection (common in smaller models like Qwen <=3B, Llama 3.2 <=3B). Requires a special softmax-based projection instead of simple normalization.
Vocabulary-mediated projection	Cross-model transfer technique: convert source hidden states to token probabilities using the source model's output head, then reconstruct target-compatible representations using the target model's input embeddings. Works across families — when tokenizers differ, AVP projects through overlapping vocabulary tokens (~85% overlap for Qwen/Llama).
PagedAttention	vLLM's memory management for KV-cache — stores cache in non-contiguous pages. AVP's `page_convert` module handles conversion between paged and contiguous formats.

Documentation

AVP Specification — Binary format, handshake, transport, security, test vectors
Benchmark Results — Full results: 4 benchmarks, 5 models, same-model + cross-model
Examples — Quickstart, agent demo, mixed-model demo, pack/unpack
Contributing — Dev setup, tests, code style

Research Foundation

AVP builds on LatentMAS: Latent Collaboration in Multi-Agent Systems (Gen-Verse, 2025), which demonstrated same-model latent communication via hidden state transfer and KV-cache sharing. AVP productionizes this into a transport-agnostic binary protocol with cross-model support, compression, and engine connectors.

License

Apache 2.0 — see LICENSE

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sstas

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.1

Apr 5, 2026

0.6.0

Apr 4, 2026

0.5.1

Apr 3, 2026

0.5.0

Apr 3, 2026

0.4.2

Mar 30, 2026

0.4.1

Mar 26, 2026

0.4.0

Mar 23, 2026

0.3.2

Mar 13, 2026

0.3.1

Mar 8, 2026

0.3.0

Mar 7, 2026

This version

0.2.3

Mar 2, 2026

0.2.2

Feb 28, 2026

0.2.1

Feb 25, 2026

0.2.0

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

avp-0.2.3.tar.gz (205.5 kB view details)

Uploaded Mar 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

avp-0.2.3-py3-none-any.whl (78.8 kB view details)

Uploaded Mar 2, 2026 Python 3

File details

Details for the file avp-0.2.3.tar.gz.

File metadata

Download URL: avp-0.2.3.tar.gz
Upload date: Mar 2, 2026
Size: 205.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for avp-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`1d1e4b487b79f441d0aba83105ec3b13e76f2099dd7649f0d0650e751bb78ea5`
MD5	`26bc5791644ca887f66433ac1d002fe9`
BLAKE2b-256	`981f36cb55c7a19cc2e311339a2045eb8fefb3ce020cf4e2400f0a13b4c8d3c5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for avp-0.2.3.tar.gz:

Publisher: publish.yml on VectorArc/avp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: avp-0.2.3.tar.gz
- Subject digest: 1d1e4b487b79f441d0aba83105ec3b13e76f2099dd7649f0d0650e751bb78ea5
- Sigstore transparency entry: 1008303687
- Sigstore integration time: Mar 2, 2026
Source repository:
- Permalink: VectorArc/avp-python@ac1c6ede38cd6700989c6aeae36acfc34c9a9620
- Branch / Tag: refs/tags/v0.2.3
- Owner: https://github.com/VectorArc
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ac1c6ede38cd6700989c6aeae36acfc34c9a9620
- Trigger Event: push

File details

Details for the file avp-0.2.3-py3-none-any.whl.

File metadata

Download URL: avp-0.2.3-py3-none-any.whl
Upload date: Mar 2, 2026
Size: 78.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for avp-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9d5b99b4d2f2e826e523cf96e41454f8c38f3a74c656bde6bc1474dcd44142f5`
MD5	`c9b3f6053ef4680de6e9d095a6ebe9cd`
BLAKE2b-256	`ce1f56cd390d29874db2abd9a832c2ba4de5097379244c50bcba3c44c1af310f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for avp-0.2.3-py3-none-any.whl:

Publisher: publish.yml on VectorArc/avp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: avp-0.2.3-py3-none-any.whl
- Subject digest: 9d5b99b4d2f2e826e523cf96e41454f8c38f3a74c656bde6bc1474dcd44142f5
- Sigstore transparency entry: 1008303688
- Sigstore integration time: Mar 2, 2026
Source repository:
- Permalink: VectorArc/avp-python@ac1c6ede38cd6700989c6aeae36acfc34c9a9620
- Branch / Tag: refs/tags/v0.2.3
- Owner: https://github.com/VectorArc
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ac1c6ede38cd6700989c6aeae36acfc34c9a9620
- Trigger Event: push

avp 0.2.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Agent Vector Protocol (AVP) — KV-Cache Transfer for Multi-Agent LLMs

Who This Is For

Install

Quick Start

Key Results

How It Works

Connector API

Production: vLLM Integration

API Reference

Easy API (start here)

Connector API (advanced)

Protocol Layer

Cross-Model (Rosetta Stone)

Error Types

Roadmap

Works With

Agent Frameworks

Infrastructure & Protocols

Key Concepts

Documentation

Research Foundation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance