foveance

Anticipatory context allocation for long-horizon black-box LLM agents

These details have not been verified by PyPI

Project links

Project description

Foveance

Anticipatory context allocation for long-horizon LLM agents.
Spend a fixed token budget where the future of the trajectory will look, not just where the current query points — black-box (prompt/proxy layer, no model internals, no fine-tuning), reversible (nothing is destroyed; anything can be re-inflated), and theory-backed.

Coverage

Core is dependency-free and offline-capable. Real-model results on Gemma, Qwen, and Llama are in bench/report.md; the theory is summarized in docs/theory.md. Positioning against prior art is stated honestly in Honest positioning below.

One command, any tool: `foveance wrap`

pip install "foveance[proxy]"
foveance wrap claude          # or:  foveance wrap -- codex "fix the tests"

That's it. wrap starts the proxy, routes the tool through it (env is set for the child process only), runs it exactly as before — your API key / OAuth untouched — and prints a tokens-saved summary when you exit. A live dashboard with a running "tokens saved ≈ $" figure serves at http://localhost:8799/ while it runs.

foveance wrap demo (real measured session: 3,590 -> 1,677 input tokens, -53%)

_{The session shown replays the real measured Ollama demo below (3,590 → 1,677 input tokens,
−53%, correct answer where full replay hallucinated) — no invented numbers.}

Or run the proxy standalone

foveance proxy --upstream https://api.openai.com/v1    # now listening on http://localhost:8799

# point any OpenAI- or Anthropic-compatible client at it (one variable) and keep your API key:
export OPENAI_BASE_URL=http://localhost:8799/v1        # OpenAI SDK, Codex, Ollama-backed apps, ...
export ANTHROPIC_BASE_URL=http://localhost:8799        # Anthropic SDK, Claude Code

Other install flavors

pip install "foveance"             # core only: dependency-free, offline (library + `foveance demo`)
pip install "foveance[all]"        # proxy + ML embedder + benchmark, everything in one shot
pip install -e ".[dev]"            # from a clone: core + test tooling
pip install -e ".[dev,bench]"      # + numpy/matplotlib/tiktoken for the benchmark

60-second quickstart

make test            # pytest, ≥90% coverage on core modules
foveance demo          # offline Pareto table (MockLLM, no GPU/network)
make bench           # offline benchmark + analysis + plots

foveance demo prints accuracy/tokens per arm across budgets: budgeted policies (reactive_afm, foveance, oracle) match full-replay accuracy at a fraction of its tokens and dominate naive recency. As the theory predicts, reactive_afm ≈ foveance on the easy per-turn-recompute regime; the anticipation gap appears under harder conditions (unnamed targets, fidelity-change cost, no-retrieve) — see the drift sweep in bench/report.md.

Results (real models)

Run on Gemma 2 (2B), Llama 3.2 (1B), and Qwen 2.5 (1.5B) via Ollama, 5 seeds each. At a binding token budget, budgeted policies reach full-replay accuracy at 62–64% fewer tokens, while naive recency truncation collapses to 0.67. Every number traces to bench/results/ (no hand-entry); see bench/report.md.

with vs without Foveance: same accuracy, 64% fewer tokens (real numbers from bench/report.md)

Full 6-arm comparison at the tight budget (400), accuracy across all three models — the naive baselines fail while foveance matches full replay at ~⅓ the tokens:

Model	full	recency	uniform	reactive (AFM)	foveance
gemma2:2b	1.00 (10.2k)	0.67	0.00	1.00	1.00 (3.7k)
llama3.2:1b	1.00 (8.3k)	0.67	1.00	1.00	1.00 (3.1k)
qwen2.5:1.5b	1.00 (9.9k)	0.67	0.00	1.00	1.00 (3.6k)

recency plateaus at 0.67 everywhere; truncate collapses on all models; uniform fails on 2 of 3 (on gemma it spends more tokens than foveance yet still fails, because it spreads fidelity instead of concentrating it on the load-bearing item). Foveance reaches full accuracy at 62–64% fewer tokens than full replay. Every number traces to bench/results/ (no hand-entry).

The deployable index allocator stays within ~1.8% of the exact DP optimum and below the LP bound (index ≤ OPT ≤ LP), and the drift-twin audit confirms reactive_afm and foveance differ only in predictor drift. As the locality-gap theorem predicts, reactive_afm ≈ foveance on the named-target, low-drift headline; the drift sweep (--ablations) shows where anticipation pulls ahead. Reproduce with bash scripts/run_everything.sh (or run_offline_demo.sh with no GPU).

Library usage

from foveance import Controller, Item
from foveance.llm import MockLLM   # or OllamaLLM("gemma2:9b"), OpenAICompatLLM(...)

ctrl = Controller(MockLLM(), budget=2000, policy="foveance", drift=0.7)
ctrl.add_item(Item("obs0", "tool_output", "FACT api_key=sk-123\n...lots of logs...", created_turn=0))
rec = ctrl.step("recall api_key", turn=0)
print(rec.answer, rec.input_tokens, rec.peak_tokens)

Swap policy="reactive_afm" (the AFM baseline), "recency", "full", or "oracle" to compare.

Cut your token usage everywhere (drop-in proxy, zero client changes)

Foveance runs as a transparent reverse proxy that speaks the OpenAI Chat Completions, OpenAI Responses, and Anthropic Messages wire protocols, streams, and forwards your credentials untouched. It keeps a per-conversation multi-fidelity store and spends a token budget on the context most likely to matter next, before forwarding upstream.

Works with anything that lets you point its base URL at the proxy — which is essentially every client and agent: the OpenAI and Anthropic SDKs, Claude Code, Codex (with an API key), aider, Continue, Cursor, LangChain, LiteLLM, and local runtimes like Ollama / vLLM / LM Studio. Foveance is auth-free: it adds no login of its own and stores no key — you keep your existing credentials, and local models need none at all. The only thing it cannot intercept is a client that cryptographically hard-pins its endpoint (e.g. ChatGPT-subscription Codex, which refuses any redirect); that is the tool's design, and no proxy can route around it. Give such a tool an API key instead and it works like everything else.

Real-world results (measured, not hypothetical)

Setting	Tokens	Accuracy / validity
`foveance wrap` (live) — llama3.2:1b via Ollama, buried-fact recall at budget 500	2,127 → 186 est. input tokens (−91%)	Fact recalled correctly through the compressed context; full wrap→proxy→model→summary path
Local model — llama3.2:1b via Ollama, long chat with a buried fact	3,590 → 1,677 input tokens (−53%)	Foveance answered correctly; full-replay hallucinated the value (context dilution)
Claude Code (live, Anthropic OAuth) — agentic, in-place compression	~71% fewer input tokens on an 8-tool-call transcript*	Works end-to-end, correct answer, no errors; tool_use/tool_result pairing preserved
Benchmark — Gemma 2 2B / Llama 3.2 1B / Qwen 2.5 1.5B, 5 seeds	62–64% fewer at iso-accuracy	matches full-replay accuracy; recency truncation dominated

_{*Agentic token reduction is measured on a controlled transcript where token accounting is exact; the live Claude Code run confirms validity and correctness. Numbers reproduce via scripts/ and the proxy; see bench/report.md and docs/usage.md.}

Head-to-head vs other methods (real model + real LLMLingua-2, single-shot recall)

A long trajectory hides one load-bearing fact early amid filler; each method compresses to a token budget, then the real model (llama3.2:1b) is asked to recall it. Only the query-aware budgeted allocators (reactive_afm, foveance) recall the fact at every budget, at 5–10× fewer tokens than full replay. Every relevance-blind method drops it under pressure: recency fails at every budget, and real LLMLingua-2 (query-agnostic token deletion) recalls at most 1/3 even at the loosest budget while spending the same tokens as foveance. Reproduce: python bench/compare_baselines.py --with-llmlingua && python bench/plot_baselines.py.

recall accuracy @ budget	full	recency	truncate	uniform	LLMLingua-2	reactive (AFM)	foveance
200 (tight)	1.00	0.00	0.00	0.00	0.00	1.00	1.00
300	1.00	0.00	1.00	1.00	0.00	1.00	1.00
500	1.00	0.00	0.00	1.00	0.33	1.00	1.00
input tokens (≈)	3782	1493	270–880	270–650	280–570	350–790	350–790

baseline comparison

The same ordering holds in the full multi-turn agent loop (recall queries throughout), which rules out any "one-shot artifact" objection: recency plateaus at 0.67, truncate collapses to 0.0–0.40 and at the loosest budget spends more tokens than full replay, while foveance matches full accuracy at ~⅓ the tokens.

full agent-loop comparison

_{Honest reading: only allocation by relevance to the query holds accuracy where the
relevance-blind methods (recency, truncation, uniform, and LLMLingua-2's query-agnostic deletion)
collapse, and it never underperforms full replay while using a fraction of its tokens. LLMLingua-2 is
a real run via the llmlingua package (CPU). The single-shot study includes LLMLingua-2 across all
budgets; the full-trajectory figure is llama3.2:1b (5 seeds) — extending the trajectory sweep with
these arms to the other models is a GPU/overnight job on CPU (qwen ≈ 57 s per large call).}

pip install -e ".[proxy]"
# Point it at whatever you already use as the upstream:
foveance proxy --upstream https://api.openai.com/v1      # OpenAI
foveance proxy --upstream https://api.anthropic.com/v1   # Anthropic / Claude
foveance proxy --upstream http://localhost:11434/v1      # Ollama (local), vLLM, TGI, LM Studio
# Config via flags or env: FOVEANCE_UPSTREAM, FOVEANCE_BUDGET, FOVEANCE_DRIFT, FOVEANCE_POLICY

It listens on http://localhost:8799 and speaks three wire protocols: POST /v1/chat/completions (OpenAI Chat), POST /v1/messages (Anthropic Messages), and POST /v1/responses (OpenAI Responses, used by Codex and the Agents SDK), plus GET /v1/models, GET /health, GET /admin/stats (JSON), and a live dashboard at GET / showing tokens saved and the ≈$ equivalent (--price-per-mtok sets the assumed input price). "stream": true is passed through verbatim. Plain chat is compressed by the anticipatory allocator; tool-using (agentic) requests are compressed structurally in place, preserving every message and tool-call pairing.

Prompt-cache aware: blocks carrying an Anthropic cache_control breakpoint are never modified, and with --cache-aware the proxy never touches anything at or before the last breakpoint — so it never invalidates the provider's prompt cache. See docs/limitations.md for the exact cost arithmetic of when to enable it.

Point your tool at it (one variable)

foveance works with Claude Code, Codex, Ollama, and any OpenAI/Anthropic-compatible SDK or tool

Client / agent	How to route it through Foveance
OpenAI SDK (Python/JS)	`base_url="http://localhost:8799/v1"` (or `OPENAI_BASE_URL`)
Anthropic SDK / Claude Code	`ANTHROPIC_BASE_URL=http://localhost:8799`
Ollama	run `foveance proxy --upstream http://localhost:11434/v1`; point your app at `:8799/v1`
OpenAI Codex CLI	define an API-key custom provider in `~/.codex/config.toml` with `base_url="http://localhost:8799/v1"`, `wire_api="responses"` (the proxy speaks Responses). Note: ChatGPT-subscription Codex cannot be proxied — its built-in `openai` provider is locked and the OAuth is pinned to the ChatGPT backend; use an API key.
Google Antigravity / Cursor / Continue	set the custom OpenAI base URL to `http://localhost:8799/v1`
opencode / Crush / aider	set the OpenAI-compatible base URL / `--openai-api-base` to `http://localhost:8799/v1`
LangChain / LlamaIndex / LiteLLM	pass `base_url=`/`api_base="http://localhost:8799/v1"`
curl / anything else	`POST http://localhost:8799/v1/chat/completions`

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8799/v1", api_key="sk-...")  # nothing else changes

# Claude Code: route the whole agent through Foveance
ANTHROPIC_BASE_URL=http://localhost:8799 claude

# Node / npm tools: launch the proxy without a manual Python step
npx foveance-proxy --upstream https://api.openai.com/v1     # see npm/ (wraps the Python proxy)

Every client header (Authorization/x-api-key of any kind, anthropic-beta feature flags, and tool-specific headers) is forwarded upstream unchanged and never stored, so OAuth-authenticated agents connect through it without extra config. Plain multi-turn chat is compressed by the anticipatory allocator (where it saves the most). Agentic requests that declare a tools array (Claude Code, Codex, and similar) are compressed structurally: every message, role, and tool_use/tool_result pair is kept intact and recent turns are protected, while large stale tool outputs in older turns are digested in place — so the request stays valid for the provider and the agent keeps working. Verified live with Claude Code (71% fewer input tokens on an 8-tool-call transcript offline; correct end-to-end answers live). See docs/usage.md and docs/limitations.md. The core (foveance.proxy.FoveanceProxy) is pure and unit-tested against a real HTTP upstream, covering OpenAI, Anthropic, streaming, and /v1/models. Full guide with copy-paste recipes per tool: docs/usage.md.

Honest positioning

As of mid-2026 this space is crowded. Per-message multi-fidelity tiering under a token budget already exists — see AFM (Cruz 2025), ContextBudget, ACON, MemAct. That mechanism is substrate, not our contribution. We ship a faithful AFM-style reactive policy as a first-class baseline — it is literally the drift = 0 special case of our predictor. Foveance's defensible novelty is narrow and specific:

an anticipatory allocation criterion (expected future relevance via a posterior over upcoming needs) — the reactive AFM-style criterion is the drift = 0 special case;
a fundamental-limits theory for the black-box, multi-turn, task-success setting (a predictive trajectory rate–distortion function);
a near-optimal index policy with a measured greedy gap, plus a theorem for when anticipation beats the reactive heuristics everyone ships;
successive-refinability conditions making reversible re-inflation "free";
an open benchmark placing all methods on one Pareto frontier vs the theoretical bound.

The full claim boundaries and prior-art table are in docs/NOVELTY.md.

What's in the package

src/foveance/   store.py · predictor.py (anticipatory future-relevance) · allocator.py
              (index + exact DP + LP bound) · controller.py (policy seam + retrieve tool) ·
              compressors.py (heuristic + LLM renderers) · embedders.py (hashing/ST/API) ·
              baselines.py (full/recency/reactive_afm/oracle/llmlingua2) · metrics.py ·
              learned.py (logistic future-relevance) · proxy.py · cli.py · llm.py
tests/        store/predictor/allocator/controller (100% covered) + integration + modules
bench/        tasks.py (Suite interface + synthetic + LongBench/RULER/AppWorld/OfficeBench
              adapters) · run_bench.py · analyze.py · plots.py · report.md · results/ (real CSVs)
docs/         architecture.md · theory.md · baselines.md · limitations.md · NOVELTY.md

Run the real comparison (Gemma + others, on your hardware)

One command — installs Ollama, pulls the models, runs the budget sweep + greedy-gap + ablations, does the full analysis (bootstrap CIs, paired Wilcoxon, iso-accuracy token savings, Pareto AUC), and makes the plots:

bash scripts/run_everything.sh
MODELS="gemma2:9b,qwen2.5:7b,llama3.1:8b" BUDGETS="600,1200,2400,4800" TASKS=8 \
    bash scripts/run_everything.sh

Outputs: bench/report.md, bench/results/{headline.json,summary.csv,...}, and bench/plots/*.png.

No GPU? Prove the whole chain offline

bash scripts/run_offline_demo.sh   # mock model -> benchmark -> analysis -> plots

Identical pipeline with a deterministic mock model, clearly marked illustrative until you run real models. No number in the repo is hand-entered — every figure traces to a CSV in bench/results/.

Reproduce a single result

python bench/run_bench.py --backend mock --models mock --suite synthetic \
    --budgets 600,1200,1600,2500,4000 --tasks 6 --turns 40 --drift 0.7 \
    --name-target false --fidelity-cost true --greedy-gap --ablations
python bench/analyze.py && python bench/plots.py

Integrity

Do not fabricate benchmark numbers — fill tables from your actual runs. See docs/NOVELTY.md for the claim boundaries, and docs/limitations.md for the honest failure modes.

License

Apache-2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

Jul 4, 2026

0.1.1

Jul 3, 2026

This version

0.1.0

Jul 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

foveance-0.1.0.tar.gz (1.3 MB view details)

Uploaded Jul 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

foveance-0.1.0-py3-none-any.whl (48.4 kB view details)

Uploaded Jul 3, 2026 Python 3

File details

Details for the file foveance-0.1.0.tar.gz.

File metadata

Download URL: foveance-0.1.0.tar.gz
Upload date: Jul 3, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for foveance-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2467c4005bdbeb2e842ebdacd8633eafff0f0cbe64f1fac80151a10738e06013`
MD5	`f2323ed56bf58a4bc8f33831ee2b2472`
BLAKE2b-256	`24ee7e5329a23b0eb65c458059c34990193470c42dbeabbfc6ceb9cdc33b923c`

See more details on using hashes here.

File details

Details for the file foveance-0.1.0-py3-none-any.whl.

File metadata

Download URL: foveance-0.1.0-py3-none-any.whl
Upload date: Jul 3, 2026
Size: 48.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for foveance-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9125e772c236cdcf4e1d904b5766139201e3b59047a4621b174789e66b315910`
MD5	`bf91fe2964c871c88d46c367048a9134`
BLAKE2b-256	`5ccd2ca9f9d1a57449bcdee7ab2f1a7865a7c36922b76c06e7d4a4d7783dd49b`

See more details on using hashes here.

foveance 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

One command, any tool: foveance wrap

Or run the proxy standalone

Other install flavors

60-second quickstart

Results (real models)

Library usage

Cut your token usage everywhere (drop-in proxy, zero client changes)

Real-world results (measured, not hypothetical)

Head-to-head vs other methods (real model + real LLMLingua-2, single-shot recall)

Point your tool at it (one variable)

Honest positioning

What's in the package

Run the real comparison (Gemma + others, on your hardware)

No GPU? Prove the whole chain offline

Reproduce a single result

Integrity

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

One command, any tool: `foveance wrap`