Anticipatory context allocation for long-horizon black-box LLM agents

These details have not been verified by PyPI

Project links

Project description

Foveance

Cut your LLM token bill by 60%+ — without changing your code or your answers.

What is this?

When you chat with an AI agent for a while, the conversation history keeps piling up. You pay for every old message on every new turn, and past a point the model actually gets worse because the important facts are buried under clutter.

Foveance fixes that automatically. It keeps the parts of the history that still matter, trims the parts that don't, and hands the model a shorter context — so you get the same answers for a fraction of the tokens. Nothing is deleted forever, and you don't change a single line of your app.

In real tests it kept full accuracy while using 60–64% fewer tokens, and it correctly recalled a buried fact that the full, uncompressed history got wrong.

Get started in 30 seconds

Option A — you use a coding agent (Claude Code, Codex, aider, …)

One command. It runs your tool exactly as before, just cheaper, and prints how much you saved:

pip install foveance
foveance wrap claude          # or:  foveance wrap -- codex "fix the tests"

foveance wrap demo — 3,590 to 1,677 input tokens, -53%, same answer

That's the whole thing. Your API key is untouched, nothing is stored, and a live "tokens saved ≈ $" dashboard runs at http://localhost:8799/ while you work.

Option B — you write Python

One install, one function. No server, no config, nothing to run:

pip install foveance

from foveance import shrink

smaller = shrink(messages, budget=2000)   # messages = your OpenAI-style list
# ...now send `smaller` to your model instead of `messages`. Same answers, fewer tokens.

shrink keeps your system prompt and your latest message exactly as-is and intelligently compresses the older turns. That's all you need to start.

Option C — try it right now, no API key, no GPU

pip install foveance
foveance demo

Prints a side-by-side table showing the token savings on a built-in example.

Does it actually work? (real numbers, nothing invented)

Measured on Gemma 2 (2B), Llama 3.2 (1B), and Qwen 2.5 (1.5B) via Ollama, 5 seeds each. At a tight token budget, Foveance matched the full, uncompressed accuracy using ~⅓ of the tokens, while the naive shortcuts (keep-recent, truncate, spread-evenly) failed:

with vs without Foveance: same accuracy, 64% fewer tokens

Model	full (no compression)	keep-recent	spread-evenly	Foveance
gemma2:2b	1.00 (10.2k tok)	0.67	0.00	1.00 (3.7k tok)
llama3.2:1b	1.00 (8.3k tok)	0.67	1.00	1.00 (3.1k tok)
qwen2.5:1.5b	1.00 (9.9k tok)	0.67	0.00	1.00 (3.6k tok)

Accuracy is "did it recall the buried fact." Foveance holds 1.00 at ~⅓ the tokens on every model; the shortcuts drop the fact. Every number traces to a CSV in bench/results/ — nothing is hand-entered.

_{Full benchmark, head-to-head vs LLMLingua-2, and the theory are further down and in
bench/report.md / docs/.}

Install options (click to expand)

pip install foveance          # everything you normally need: shrink(), foveance wrap, the proxy, and the demo
pip install "foveance[all]"   # the above plus the ML embedder and benchmark tooling (numpy, torch, matplotlib, …)

The allocator/predictor core imports no heavy libraries; the base install adds only the small web-server packages that power foveance wrap and the proxy.

Under the hood (the technical part)

Everything above is all most people need. The rest of this document is for people who want the proxy details, the full benchmark, and the theory.

The drop-in proxy — cut tokens for any tool, zero code changes

foveance wrap <tool> is a convenience wrapper around a small reverse proxy you can also run yourself. It speaks the OpenAI Chat Completions, OpenAI Responses, and Anthropic Messages wire protocols, streams, and forwards your credentials untouched. It keeps a per-conversation multi-fidelity store and spends a token budget on the context most likely to matter next, before forwarding upstream.

foveance proxy --upstream https://api.openai.com/v1      # OpenAI
foveance proxy --upstream https://api.anthropic.com/v1   # Anthropic / Claude
foveance proxy --upstream http://localhost:11434/v1      # Ollama (local), vLLM, TGI, LM Studio

# then point any client at it with one variable (your API key still goes straight upstream):
export OPENAI_BASE_URL=http://localhost:8799/v1          # OpenAI SDK, Codex, Ollama-backed apps
export ANTHROPIC_BASE_URL=http://localhost:8799          # Anthropic SDK, Claude Code

Works with anything that lets you set its base URL — the OpenAI and Anthropic SDKs, Claude Code, Codex (with an API key), aider, Continue, Cursor, LangChain, LiteLLM, and local runtimes like Ollama / vLLM / LM Studio. Foveance is auth-free: it adds no login of its own and stores no key. The only thing it can't intercept is a client that cryptographically hard-pins its endpoint (e.g. ChatGPT-subscription Codex); give such a tool an API key and it works like everything else.

It listens on http://localhost:8799 and exposes POST /v1/chat/completions, POST /v1/messages, POST /v1/responses, GET /v1/models, GET /health, GET /admin/stats (JSON), and a live dashboard at GET / (tokens saved and ≈$ at --price-per-mtok). "stream": true is passed through verbatim. Plain chat is compressed by the anticipatory allocator; tool-using (agentic) requests are compressed structurally in place, preserving every message and tool-call pairing.

Prompt-cache aware: blocks carrying an Anthropic cache_control breakpoint are never modified, and with --cache-aware the proxy never touches anything at or before the last breakpoint — so it never invalidates the provider's prompt cache. See docs/limitations.md for the cost arithmetic.

foveance works with Claude Code, Codex, Ollama, and any OpenAI/Anthropic-compatible tool

Client / agent	How to route it through Foveance
OpenAI SDK (Python/JS)	`base_url="http://localhost:8799/v1"` (or `OPENAI_BASE_URL`)
Anthropic SDK / Claude Code	`ANTHROPIC_BASE_URL=http://localhost:8799`
Ollama	`foveance proxy --upstream http://localhost:11434/v1`; point your app at `:8799/v1`
OpenAI Codex CLI	API-key custom provider in `~/.codex/config.toml`: `base_url="http://localhost:8799/v1"`, `wire_api="responses"` (subscription Codex can't be proxied — use an API key)
Cursor / Continue / Antigravity	set the custom OpenAI base URL to `http://localhost:8799/v1`
aider / opencode / Crush	set the OpenAI-compatible base URL to `http://localhost:8799/v1`
LangChain / LlamaIndex / LiteLLM	pass `base_url=`/`api_base="http://localhost:8799/v1"`
Node / npm tools	`npx foveance-proxy --upstream https://api.openai.com/v1`

Measured real-world results

Setting	Tokens	Outcome
`foveance wrap` (live) — llama3.2:1b via Ollama, buried-fact recall	2,127 → 186 est. tokens (−91%)	fact recalled correctly through the compressed context
Local model — llama3.2:1b, long chat with a buried fact	3,590 → 1,677 tokens (−53%)	Foveance correct; full replay hallucinated the value
Claude Code (live, Anthropic OAuth) — agentic in-place compression	~71% fewer tokens on an 8-tool-call transcript	works end-to-end, tool pairing preserved
Benchmark — Gemma/Llama/Qwen, 5 seeds	62–64% fewer at iso-accuracy	matches full-replay accuracy

Head-to-head vs other methods (real model + real LLMLingua-2)

A long trajectory hides one load-bearing fact early amid filler; each method compresses to a budget, then the real model (llama3.2:1b) is asked to recall it. Only the query-aware allocators recall it at every budget, at 5–10× fewer tokens than full replay:

recall @ budget	full	truncate	spread-evenly	LLMLingua-2	reactive (AFM)	Foveance
200 (tight)	1.00	0.00	0.00	0.00	1.00	1.00
300	1.00	1.00	1.00	0.00	1.00	1.00
500	1.00	0.00	1.00	0.33	1.00	1.00

baseline comparison

The same ordering holds in the full multi-turn agent loop, ruling out a one-shot artifact:

full agent-loop comparison

Reproduce: python bench/compare_baselines.py --with-llmlingua && python bench/plot_baselines.py. LLMLingua-2 is a real run via the llmlingua package (CPU).

Library usage (beyond `shrink`)

from foveance import Controller, Item
from foveance.llm import MockLLM   # or OllamaLLM("gemma2:9b"), OpenAICompatLLM(...)

ctrl = Controller(MockLLM(), budget=2000, policy="foveance", drift=0.7)
ctrl.add_item(Item("obs0", "tool_output", "FACT api_key=sk-123\n...lots of logs...", created_turn=0))
rec = ctrl.step("recall api_key", turn=0)
print(rec.answer, rec.input_tokens, rec.peak_tokens)

Swap policy="reactive_afm" (the AFM baseline), "recency", "full", or "oracle" to compare.

The public API at a glance (from foveance import ...):

Name	What it is
`shrink(messages, budget=2000)`	the one-liner — compress a messages list, no setup
`Controller`, `Item`	the full stepping loop (add items, `step(query, turn)`)
`index_allocate`, `dp_allocate`, `lp_bound`	the index policy, exact DP optimum, and LP bound (`index ≤ OPT ≤ LP`)
`AnticipatoryPredictor`, `PredictorConfig`	the anticipatory future-relevance scorer (`drift` knob)
`MultiFidelityStore`, `Fidelity`	the reversible multi-fidelity store
`HashingEmbedder`, `cosine`	the offline embedder + similarity
`baselines`, `metrics`	policy arms (`full`/`recency`/`reactive_afm`/`oracle`/…) and scoring helpers
`foveance.proxy.FoveanceProxy`	the proxy core, if you want to embed it

Honest positioning

As of mid-2026 this space is crowded. Per-message multi-fidelity tiering under a token budget already exists — see AFM (Cruz 2025), ContextBudget, ACON, MemAct. That mechanism is substrate, not the contribution here. Foveance ships a faithful AFM-style reactive policy as a first-class baseline — it is literally the drift = 0 special case of the predictor. The defensible novelty is narrow and specific:

an anticipatory allocation criterion (expected future relevance) — the reactive AFM-style criterion is the drift = 0 special case;
a fundamental-limits theory for the black-box, multi-turn, task-success setting;
a near-optimal index policy with a measured greedy gap, plus a theorem for when anticipation beats the reactive heuristics everyone ships;
successive-refinability conditions making reversible re-inflation "free";
an open benchmark placing all methods on one accuracy–token frontier vs the bound.

The deployable index allocator stays within ~1.8% of the exact DP optimum and below the LP bound (index ≤ OPT ≤ LP). Full claim boundaries and the prior-art table are in docs/NOVELTY.md.

What's in the package

src/foveance/   store.py · predictor.py (anticipatory future-relevance) · allocator.py
              (index + exact DP + LP bound) · controller.py · compressors.py · embedders.py ·
              baselines.py · metrics.py · learned.py · proxy.py · cli.py · llm.py
tests/        store/predictor/allocator/controller (100% covered) + integration
bench/        run_bench.py · analyze.py · plots.py · report.md · results/ (real CSVs)
docs/         architecture.md · theory.md · baselines.md · limitations.md · NOVELTY.md

Reproduce the benchmark

bash scripts/run_everything.sh       # real models via Ollama (installs + pulls + runs + plots)
bash scripts/run_offline_demo.sh     # no GPU: identical chain with a deterministic mock model

Outputs land in bench/report.md, bench/results/, and bench/plots/. No number is hand-entered; every figure traces to a CSV.

License

Apache-2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Jul 4, 2026

0.1.1

Jul 3, 2026

0.1.0

Jul 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

foveance-0.1.2.tar.gz (1.3 MB view details)

Uploaded Jul 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

foveance-0.1.2-py3-none-any.whl (46.9 kB view details)

Uploaded Jul 4, 2026 Python 3

File details

Details for the file foveance-0.1.2.tar.gz.

File metadata

Download URL: foveance-0.1.2.tar.gz
Upload date: Jul 4, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for foveance-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`06d5fc5315bc9a14cd625f1995179b15627218f04dc244603c12ff0418b7f3c1`
MD5	`b470d03edf006180435fc795eec2ab88`
BLAKE2b-256	`12789e6e3f50f2ef5f8e6babf30ac1e2921e64db77ee244ab7b099c516afe73d`

See more details on using hashes here.

File details

Details for the file foveance-0.1.2-py3-none-any.whl.

File metadata

Download URL: foveance-0.1.2-py3-none-any.whl
Upload date: Jul 4, 2026
Size: 46.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for foveance-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fa8b545597ed5ff3b5bb224638093227f637dc0cc7df08a5fdbb153147219a5a`
MD5	`da2338bb7af4aa4a7ceac95f6ce88496`
BLAKE2b-256	`3972941030e5ee5c050f912112fff6c7628553e910901b10300ebcd8f94b8473`

See more details on using hashes here.

foveance 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

What is this?

Get started in 30 seconds

Option A — you use a coding agent (Claude Code, Codex, aider, …)

Option B — you write Python

Option C — try it right now, no API key, no GPU

Does it actually work? (real numbers, nothing invented)

Under the hood (the technical part)

The drop-in proxy — cut tokens for any tool, zero code changes

Measured real-world results

Head-to-head vs other methods (real model + real LLMLingua-2)

Library usage (beyond shrink)

Honest positioning

What's in the package

Reproduce the benchmark

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Library usage (beyond `shrink`)