Test whether your LLM retrieves information from every position in its context window.
Project description
context-lens
Test whether your LLM actually retrieves information from every position in its context window — before it silently fails in production.
Your LLM passes all your evals. You ship it. Users start complaining that it ignores half their documents. You check the logs — technically successful calls, no errors. The bug is invisible. This is the lost-in-the-middle problem, and context-lens is the missing test gate.
The Problem
Research confirmed what production engineers already knew: LLMs pay heavy attention to information at the start and end of long contexts, and silently drop information buried in the middle.
Context position: [start] ====== [middle] ====== [end]
LLM attention: HIGH LOW HIGH
This breaks:
- RAG pipelines — retrieved chunks in positions 3-7 of 10 may be ignored
- Long document analysis — key clauses in the middle of contracts get missed
- Multi-turn agents — prior tool outputs buried in conversation history get lost
- System prompts with long instructions — middle constraints are violated silently
The failure mode is always the same: traditional evals test correctness on a single input. They do not test whether the model is reliable at every position in the context window.
context-lens fills this gap.
What context-lens Does
context-lens places a "needle" (a key fact or instruction) at every position across your context window, runs your LLM, and produces a PositionHeatmap — a complete picture of where your model is reliable and where it fails.
position 1/10 (fraction=0.00) [OK] ##
position 2/10 (fraction=0.11) [OK] ##
position 3/10 (fraction=0.22) [OK] ##
position 4/10 (fraction=0.33) [MISS] .. <- FAULT ZONE
position 5/10 (fraction=0.44) [MISS] .. <- FAULT ZONE
position 6/10 (fraction=0.56) [MISS] .. <- FAULT ZONE
position 7/10 (fraction=0.67) [OK] ##
position 8/10 (fraction=0.78) [OK] ##
position 9/10 (fraction=0.89) [OK] ##
position 10/10 (fraction=1.00) [OK] ##
Retrieval Score: 70% — CONDITIONAL
Fault zones: MIDDLE-HEAVY FAILURE (lost-in-the-middle pattern detected)
Then it gives you a CI gate that fails your pipeline if the score drops below your threshold.
Installation
pip install context-lens
# With Anthropic support:
pip install "context-lens[anthropic]"
# With OpenAI support:
pip install "context-lens[openai]"
# With YAML config support:
pip install "context-lens[yaml]"
# Everything:
pip install "context-lens[all]"
Zero hard dependencies. context-lens uses only Python stdlib. Install provider SDKs separately.
Quick Start
from context_lens import ContextLens, Needle, HaystackTemplate
import anthropic
# 1. Wrap your LLM in a str -> str function
client = anthropic.Anthropic()
def my_llm(prompt: str) -> str:
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
messages=[{"role": "user", "content": prompt}],
)
return response.content[0].text
# 2. Define what must be found (the "needle")
needle = Needle(
label="API rate limit",
content="The API rate limit is 1000 requests per minute per key.",
question="What is the API rate limit?",
expected_answer="1000 requests per minute",
answer_keywords=["1000", "per minute"],
)
# 3. Define the surrounding context (the "haystack")
haystack = HaystackTemplate(
filler_text="This document describes the system API. All endpoints require authentication. ",
target_tokens=4000,
tokens_per_filler=15,
)
# 4. Run the audit
lens = ContextLens(model_fn=my_llm, model_name="claude-haiku")
heatmap = lens.audit(needle=needle, haystack=haystack, positions=10)
# 5. Read the result
heatmap.report()
print(f"Score: {heatmap.retrieval_score:.1%}")
print(f"Verdict: {heatmap.verdict}")
print(f"Fault zones: {heatmap.fault_zones}")
Multi-Needle Audit
Test multiple pieces of critical information in one run:
needles = [
Needle(
label="Rate limit",
content="Rate limit: 1000 req/min.",
question="What is the rate limit?",
expected_answer="1000 req/min",
answer_keywords=["1000"],
),
Needle(
label="Retry policy",
content="On 429 errors, use exponential backoff starting at 2 seconds.",
question="How should you handle 429 errors?",
expected_answer="exponential backoff, 2 seconds",
answer_keywords=["exponential backoff", "2 seconds"],
),
Needle(
label="Token expiry",
content="Session tokens expire after 24 hours.",
question="When do session tokens expire?",
expected_answer="24 hours",
answer_keywords=["24 hours"],
),
]
heatmaps = lens.audit_multi(needles, haystack, positions=10)
summary = lens.summary_report(heatmaps)
print(f"Overall score: {summary['overall_score']:.1%}")
print(f"Overall verdict: {summary['overall_verdict']}")
CI Gate
Block deployment if context retrieval is unreliable:
heatmaps = lens.audit_multi(needles, haystack, positions=10)
passed, message = lens.ci_gate(heatmaps, min_score=0.80)
print(message)
import sys
sys.exit(0 if passed else 1)
CLI Usage
# Run an audit from config file
context-lens audit --config my_audit.yaml
# Write results to JSON
context-lens audit --config my_audit.yaml --output results.json
# CI gate (exits 1 on failure)
context-lens ci --config my_audit.yaml --min-score 0.85
# View audit history
context-lens history --limit 10
Config File Format (YAML)
# my_audit.yaml
model_name: claude-haiku-4-5-20251001
provider: anthropic # anthropic | openai | mock
model: claude-haiku-4-5-20251001
positions: 10
reliable_threshold: 0.90
conditional_threshold: 0.70
haystack:
filler_text: "This document contains system documentation. "
target_tokens: 4000
tokens_per_filler: 10
system_prompt: "Answer questions using only the provided context."
needles:
- label: "Database connection string"
content: "The database connection string is db://prod-server:5432/myapp"
question: "What is the database connection string?"
expected_answer: "db://prod-server:5432/myapp"
answer_keywords: ["prod-server", "5432"]
- label: "Retry limit"
content: "The maximum retry count is 3 attempts with 5-second intervals."
question: "How many retries are allowed and at what interval?"
expected_answer: "3 retries, 5-second intervals"
answer_keywords: ["3", "5-second"]
GitHub Actions Integration
# .github/workflows/context-lens.yml
name: Context Window Audit
on: [push, pull_request]
jobs:
context-audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install context-lens
run: pip install "context-lens[all]"
- name: Run context position audit
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
context-lens ci --config .context_lens.yaml --min-score 0.80
- name: Upload audit results
if: always()
uses: actions/upload-artifact@v4
with:
name: context-lens-results
path: .context_lens.db
Verdicts Explained
| Verdict | Score Range | Meaning |
|---|---|---|
| RELIABLE | >= 90% | LLM consistently retrieves information from all context positions. Safe to ship. |
| CONDITIONAL | 70–89% | LLM has some positional failures. Review fault zones before shipping. |
| UNRELIABLE | < 70% | LLM has significant positional failures. Do not ship this configuration. |
Fault Zone Patterns
context-lens identifies three failure patterns:
1. Middle-Heavy Failure (Lost-in-the-Middle)
Positions: [OK] [OK] [MISS] [MISS] [MISS] [OK] [OK]
Information in the middle of the context is not retrieved. Classic LLM attention pattern. Fix: Reorder retrieved chunks to put critical info first/last. Reduce total context size.
2. Edge Failure
Positions: [MISS] [OK] [OK] [OK] [OK] [OK] [MISS]
Rare — usually indicates prompt structure issues.
3. Scattered Failures
Positions: [OK] [MISS] [OK] [MISS] [OK] [MISS] [OK]
General degradation. Often indicates context is too long for the model's reliable attention window.
Why context-lens?
| Tool | What it tests |
|---|---|
| DeepEval, Promptfoo | Whether specific inputs produce correct outputs |
| prompt-shield | Whether outputs are stable across paraphrase variants |
| drift-guard | Whether PR code matches PR intent |
| context-lens | Whether the LLM retrieves information from all context positions |
The problem these tools solve is different. context-lens tests a specific failure mode that is invisible to all of them: positional sensitivity in the context window.
Roadmap
- v0.1 (current): KeywordJudge, PositionHeatmap, CLI, SQLite history, CI gate, GitHub Action
- v0.2: LLM-as-judge for semantic retrieval checking (beyond keyword matching)
- v0.3: Automatic fault zone diagnosis with remediation suggestions
- v0.4: Token-precise position control (place needle at exact token offset)
- v0.5: Multi-model comparison (which model is more position-robust?)
- v1.0: pytest plugin, pre-commit hook
License
MIT License. Copyright 2026 BuildWorld.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file context_recall-0.1.0.tar.gz.
File metadata
- Download URL: context_recall-0.1.0.tar.gz
- Upload date:
- Size: 21.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a58669f237b01b7d0cee6a0680b5d8d096987aeb144721b74fcb0bd90a5e2cf
|
|
| MD5 |
eaa5d85a0c7a31dc490ddde070b0df6a
|
|
| BLAKE2b-256 |
09edc1724a73590ad89db040d4d542a0a0d6321faa01cd7817e6aac19c208e7c
|
File details
Details for the file context_recall-0.1.0-py3-none-any.whl.
File metadata
- Download URL: context_recall-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e44468552930a655aaa7f328dd84cb081949c9399399ae0847a0db4c65ba838
|
|
| MD5 |
9c8c09eb240cdd2b31ebddf82931fa3f
|
|
| BLAKE2b-256 |
08c5ca4f910bed9e1cac829417d17e44374d244b9b34a866b7d0c6c6a12557c6
|