Test whether your LLM retrieves information from every position in its context window.

These details have not been verified by PyPI

Project links

Project description

context-lens

Test whether your LLM actually retrieves information from every position in its context window — before it silently fails in production.

Your LLM passes all your evals. You ship it. Users start complaining that it ignores half their documents. You check the logs — technically successful calls, no errors. The bug is invisible. This is the lost-in-the-middle problem, and context-lens is the missing test gate.

The Problem

Research confirmed what production engineers already knew: LLMs pay heavy attention to information at the start and end of long contexts, and silently drop information buried in the middle.

Context position:  [start] ====== [middle] ====== [end]
LLM attention:     HIGH               LOW           HIGH

This breaks:

RAG pipelines — retrieved chunks in positions 3-7 of 10 may be ignored
Long document analysis — key clauses in the middle of contracts get missed
Multi-turn agents — prior tool outputs buried in conversation history get lost
System prompts with long instructions — middle constraints are violated silently

The failure mode is always the same: traditional evals test correctness on a single input. They do not test whether the model is reliable at every position in the context window.

context-lens fills this gap.

What context-lens Does

context-lens places a "needle" (a key fact or instruction) at every position across your context window, runs your LLM, and produces a PositionHeatmap — a complete picture of where your model is reliable and where it fails.

position 1/10  (fraction=0.00)  [OK]  ##
position 2/10  (fraction=0.11)  [OK]  ##
position 3/10  (fraction=0.22)  [OK]  ##
position 4/10  (fraction=0.33)  [MISS]  ..    <- FAULT ZONE
position 5/10  (fraction=0.44)  [MISS]  ..    <- FAULT ZONE
position 6/10  (fraction=0.56)  [MISS]  ..    <- FAULT ZONE
position 7/10  (fraction=0.67)  [OK]  ##
position 8/10  (fraction=0.78)  [OK]  ##
position 9/10  (fraction=0.89)  [OK]  ##
position 10/10 (fraction=1.00)  [OK]  ##

Retrieval Score: 70%  — CONDITIONAL
Fault zones: MIDDLE-HEAVY FAILURE (lost-in-the-middle pattern detected)

Then it gives you a CI gate that fails your pipeline if the score drops below your threshold.

Installation

pip install context-lens

# With Anthropic support:
pip install "context-lens[anthropic]"

# With OpenAI support:
pip install "context-lens[openai]"

# With YAML config support:
pip install "context-lens[yaml]"

# Everything:
pip install "context-lens[all]"

Zero hard dependencies. context-lens uses only Python stdlib. Install provider SDKs separately.

Quick Start

from context_lens import ContextLens, Needle, HaystackTemplate
import anthropic

# 1. Wrap your LLM in a str -> str function
client = anthropic.Anthropic()
def my_llm(prompt: str) -> str:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text

# 2. Define what must be found (the "needle")
needle = Needle(
    label="API rate limit",
    content="The API rate limit is 1000 requests per minute per key.",
    question="What is the API rate limit?",
    expected_answer="1000 requests per minute",
    answer_keywords=["1000", "per minute"],
)

# 3. Define the surrounding context (the "haystack")
haystack = HaystackTemplate(
    filler_text="This document describes the system API. All endpoints require authentication. ",
    target_tokens=4000,
    tokens_per_filler=15,
)

# 4. Run the audit
lens = ContextLens(model_fn=my_llm, model_name="claude-haiku")
heatmap = lens.audit(needle=needle, haystack=haystack, positions=10)

# 5. Read the result
heatmap.report()
print(f"Score: {heatmap.retrieval_score:.1%}")
print(f"Verdict: {heatmap.verdict}")
print(f"Fault zones: {heatmap.fault_zones}")

Multi-Needle Audit

Test multiple pieces of critical information in one run:

needles = [
    Needle(
        label="Rate limit",
        content="Rate limit: 1000 req/min.",
        question="What is the rate limit?",
        expected_answer="1000 req/min",
        answer_keywords=["1000"],
    ),
    Needle(
        label="Retry policy",
        content="On 429 errors, use exponential backoff starting at 2 seconds.",
        question="How should you handle 429 errors?",
        expected_answer="exponential backoff, 2 seconds",
        answer_keywords=["exponential backoff", "2 seconds"],
    ),
    Needle(
        label="Token expiry",
        content="Session tokens expire after 24 hours.",
        question="When do session tokens expire?",
        expected_answer="24 hours",
        answer_keywords=["24 hours"],
    ),
]

heatmaps = lens.audit_multi(needles, haystack, positions=10)
summary = lens.summary_report(heatmaps)
print(f"Overall score: {summary['overall_score']:.1%}")
print(f"Overall verdict: {summary['overall_verdict']}")

CI Gate

Block deployment if context retrieval is unreliable:

heatmaps = lens.audit_multi(needles, haystack, positions=10)
passed, message = lens.ci_gate(heatmaps, min_score=0.80)
print(message)
import sys
sys.exit(0 if passed else 1)

CLI Usage

# Run an audit from config file
context-lens audit --config my_audit.yaml

# Write results to JSON
context-lens audit --config my_audit.yaml --output results.json

# CI gate (exits 1 on failure)
context-lens ci --config my_audit.yaml --min-score 0.85

# View audit history
context-lens history --limit 10

Config File Format (YAML)

# my_audit.yaml
model_name: claude-haiku-4-5-20251001
provider: anthropic     # anthropic | openai | mock
model: claude-haiku-4-5-20251001
positions: 10
reliable_threshold: 0.90
conditional_threshold: 0.70

haystack:
  filler_text: "This document contains system documentation. "
  target_tokens: 4000
  tokens_per_filler: 10
  system_prompt: "Answer questions using only the provided context."

needles:
  - label: "Database connection string"
    content: "The database connection string is db://prod-server:5432/myapp"
    question: "What is the database connection string?"
    expected_answer: "db://prod-server:5432/myapp"
    answer_keywords: ["prod-server", "5432"]

  - label: "Retry limit"
    content: "The maximum retry count is 3 attempts with 5-second intervals."
    question: "How many retries are allowed and at what interval?"
    expected_answer: "3 retries, 5-second intervals"
    answer_keywords: ["3", "5-second"]

GitHub Actions Integration

# .github/workflows/context-lens.yml
name: Context Window Audit

on: [push, pull_request]

jobs:
  context-audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install context-lens
        run: pip install "context-lens[all]"

      - name: Run context position audit
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          context-lens ci --config .context_lens.yaml --min-score 0.80

      - name: Upload audit results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: context-lens-results
          path: .context_lens.db

Verdicts Explained

Verdict	Score Range	Meaning
RELIABLE	>= 90%	LLM consistently retrieves information from all context positions. Safe to ship.
CONDITIONAL	70–89%	LLM has some positional failures. Review fault zones before shipping.
UNRELIABLE	< 70%	LLM has significant positional failures. Do not ship this configuration.

Fault Zone Patterns

context-lens identifies three failure patterns:

1. Middle-Heavy Failure (Lost-in-the-Middle)

Positions: [OK] [OK] [MISS] [MISS] [MISS] [OK] [OK]

Information in the middle of the context is not retrieved. Classic LLM attention pattern. Fix: Reorder retrieved chunks to put critical info first/last. Reduce total context size.

2. Edge Failure

Positions: [MISS] [OK] [OK] [OK] [OK] [OK] [MISS]

Rare — usually indicates prompt structure issues.

3. Scattered Failures

Positions: [OK] [MISS] [OK] [MISS] [OK] [MISS] [OK]

General degradation. Often indicates context is too long for the model's reliable attention window.

Why context-lens?

Tool	What it tests
DeepEval, Promptfoo	Whether specific inputs produce correct outputs
prompt-shield	Whether outputs are stable across paraphrase variants
drift-guard	Whether PR code matches PR intent
context-lens	Whether the LLM retrieves information from all context positions

The problem these tools solve is different. context-lens tests a specific failure mode that is invisible to all of them: positional sensitivity in the context window.

Roadmap

v0.1 (current): KeywordJudge, PositionHeatmap, CLI, SQLite history, CI gate, GitHub Action
v0.2: LLM-as-judge for semantic retrieval checking (beyond keyword matching)
v0.3: Automatic fault zone diagnosis with remediation suggestions
v0.4: Token-precise position control (place needle at exact token offset)
v0.5: Multi-model comparison (which model is more position-robust?)
v1.0: pytest plugin, pre-commit hook

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

context_recall-0.1.0.tar.gz (21.5 kB view details)

Uploaded Mar 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

context_recall-0.1.0-py3-none-any.whl (15.0 kB view details)

Uploaded Mar 28, 2026 Python 3

File details

Details for the file context_recall-0.1.0.tar.gz.

File metadata

Download URL: context_recall-0.1.0.tar.gz
Upload date: Mar 28, 2026
Size: 21.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for context_recall-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3a58669f237b01b7d0cee6a0680b5d8d096987aeb144721b74fcb0bd90a5e2cf`
MD5	`eaa5d85a0c7a31dc490ddde070b0df6a`
BLAKE2b-256	`09edc1724a73590ad89db040d4d542a0a0d6321faa01cd7817e6aac19c208e7c`

See more details on using hashes here.

File details

Details for the file context_recall-0.1.0-py3-none-any.whl.

File metadata

Download URL: context_recall-0.1.0-py3-none-any.whl
Upload date: Mar 28, 2026
Size: 15.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for context_recall-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2e44468552930a655aaa7f328dd84cb081949c9399399ae0847a0db4c65ba838`
MD5	`9c8c09eb240cdd2b31ebddf82931fa3f`
BLAKE2b-256	`08c5ca4f910bed9e1cac829417d17e44374d244b9b34a866b7d0c6c6a12557c6`

See more details on using hashes here.

context-recall 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

context-lens

The Problem

What context-lens Does

Installation

Quick Start

Multi-Needle Audit

CI Gate

CLI Usage

Config File Format (YAML)

GitHub Actions Integration

Verdicts Explained

Fault Zone Patterns

1. Middle-Heavy Failure (Lost-in-the-Middle)

2. Edge Failure

3. Scattered Failures

Why context-lens?

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes