Skip to main content

Pressure-test LLM long-context retrieval with the Needle In A Haystack benchmark.

Project description

Needle In A Haystack

Pressure-test LLM long-context retrieval. Now in v2.

niah runs a sweep of (context length × needle depth) cells against any configured model, scores each response, and writes one result row per cell to a JSONL file. Built-in tasks include single-fact lookup, multi-fact recall, single-UUID retrieval, and UUID-chain hops for testing multi-step reasoning over long contexts.

Supported providers out of the box: OpenAI, Anthropic, Cohere. Adding more is a small plugin.


Quick start

# 1. Clone + install (uv handles the venv + lock)
git clone https://github.com/gkamradt/needle-in-a-haystack.git
cd needle-in-a-haystack
uv sync

# 2. Drop your API key(s) in a .env file (auto-loaded by niah; .env is gitignored)
echo "OPENAI_API_KEY=sk-..." >> .env
# or: export OPENAI_API_KEY=sk-...   if you prefer not to use .env

# 3. Smoke-test the full pipeline with no API calls
uv run niah run configs/runs/smoke.fake.yaml

# 4. Validate then run an example against a real model
uv run niah validate configs/runs/single_needle.example.yaml
uv run niah run      configs/runs/single_needle.example.yaml

The run writes one JSONL row per sweep cell. Each row carries the score, token usage, cost, and a tiny recipe that lets you exactly reconstruct the context the model saw:

niah reconstruct results/single-needle-opus.jsonl --row 0

What you configure

You point niah at one run config (YAML) that references one model config (also YAML). Examples live in configs/.

Run config (configs/runs/uuid_chain.example.yaml)

run_name: "uuid-chain-opus"
model: "anthropic-opus-4-medium"   # resolved against configs/models/

task:
  type: "uuid_chain"
  chain_length: 5

haystack:
  type: "files"
  path: "PaulGrahamEssays"

sweep:
  context_lengths: {min: 2000, max: 32000, num: 8,  scale: "linear"}
  depth_percents:  {min: 0,    max: 100,   num: 11, scale: "sigmoid"}
  seeds: [1, 2, 3]

runner:
  concurrency: 2
  retries: 2
  resume: true

store:
  type: "jsonl"
  path: "results/uuid-chain-opus.jsonl"

Model config (configs/models/anthropic-opus-4-medium.yaml)

id: "anthropic-opus-4-medium"
runtime:
  sdk: "anthropic-python"
  api: "messages"
client:
  api_key_env: "ANTHROPIC_API_KEY"
request:
  model: "claude-opus-4"
  max_tokens: 120000
  thinking:
    type: "adaptive"
  output_config:
    effort: "medium"
pricing:
  input: 5.00      # USD per 1M input tokens
  output: 25.00    # USD per 1M output tokens

Anything under request: is forwarded verbatim to the SDK, so adding new provider-specific knobs (thinking, reasoning_effort, top_p, …) doesn't require a code change.


Built-in tasks

task.type What it does
single One fact placed at one depth; exact-match scored.
multi N facts spread evenly through the context; fractional score.
uuid One fresh UUID at one depth; model must repeat it.
uuid_chain Chain of A → B → C → … links spread through the context. The question asks "what is the value associated with A?" without revealing the chain structure — the model has to discover the hops on its own.

Tasks are a small Protocol — see needlehaystack/tasks/. Adding your own is one file and a registry call; nothing in the runner needs to change.

from needlehaystack.tasks import register_task

class MyCustomTask:
    name = "my_task"
    inserter_name = "single_depth"
    def generate_needle(self, seed): ...
    def insert(self, ctx, needle, depth): ...
    def question(self, needle): ...
    def score(self, response, needle): ...

register_task("my_task", MyCustomTask)

Reference it from a run config with task.type: "my_task".


CLI

niah run        <run.yaml>            run a sweep, append to JSONL
niah run        <run.yaml> --dry-run  validate, resolve model, print plan, exit
niah validate   <run.yaml>            parse + resolve without running
niah reconstruct <results.jsonl> --row N [--out file]
                                      rebuild the exact context shown to the model

--model-dir DIR (repeatable) adds extra search paths for bare model ids.


Result rows & reconstruction

Each row in the JSONL is small (a few KB) regardless of context size. We don't store the rendered 200k-token context per row — that would balloon a single sweep into gigabytes. Instead each row carries a recipe:

{
  "schema_version": 2,
  "run_name": "uuid-chain-opus",
  "model_id": "anthropic-opus-4-medium",
  "task_type": "uuid_chain",
  "context_length": 32000,
  "target_depth_percent": 50.0,
  "recipe": {
    "haystack": {"type": "files", "path": "PaulGrahamEssays"},
    "inserter": "even_spread",
    "needle_placements": [
      {"text": "abc... maps to def...", "insertion_token_index": 15876, "actual_depth_percent": 49.6},
      ...
    ],
    "final_context_token_count": 32082
  },
  "expected_answer": "the-final-uuid",
  "prompt_question": "What is the value associated with abc-...?",
  "response": "...",
  "score": {"value": 0.6, "details": {"hops_correct": 3, "chain_length": 5}},
  "usage": {"input_tokens": 32100, "output_tokens": 412},
  "cost_usd": 0.171,
  "duration_seconds": 12.4,
  "status": "ok",
  "seed": 1,
  "timestamp_utc": "2026-..."
}

niah reconstruct walks the recipe and produces a byte-identical string of what the model actually saw, which is what you want when a result is surprising and you want to read the prompt.


Extending

  • New provider: write a class satisfying ModelProvider and call register_provider(sdk, api, factory). See needlehaystack/providers/openai.py as a reference.
  • New task: as above; see needlehaystack/tasks/uuid_chain.py.
  • New haystack source: implement HaystackSource (load(min_tokens) + descriptor()).
  • New scorer: implement Scorer (score(response, needle)).

The system is intentionally a set of small Protocols connected by registries so contributors never need to edit the runner.


Contributing

uv sync --extra dev
uv run ruff check .
uv run ruff format --check .
uv run mypy needlehaystack
uv run pytest

CI runs all of the above on every PR.


Original story & historical results

The original 2023 runs that started all this:

Needle In A Haystack code snippet

OpenAI's GPT-4-128K (Run 11/8/2023)

GPT-4-128 Context Testing

Anthropic's Claude 2.1 (Run 11/21/2023)

Claude 2.1 Context Testing

The raw result files from those original runs are preserved in original_results/ for posterity — the schema does not match v2, so they don't load with the new tooling.

How multi-needle spacing works (still accurate in v2)

Given N needles and a starting depth_percent, the EvenSpreadInserter places the first needle at depth_percent, then distributes the rest evenly through the remaining context up to 100%. The interval is:

depth_percent_interval = (100 - depth_percent) / N

So for N=10 needles starting at depth_percent=40:

depth_percent_interval = (100 - 40) / 10 = 6

Needle 1: 40
Needle 2: 46
Needle 3: 52
Needle 4: 58
Needle 5: 64
Needle 6: 70
Needle 7: 76
Needle 8: 82
Needle 9: 88
Needle 10: 94

v2 fixes a bug in the v1 multi-needle code where each needle's reported depth was off by however much the earlier needles had inflated the token count. The new inserter computes target depths against the pre-insertion length and reports the true depth each needle landed at.


License

MIT — see LICENSE.txt. Use of this software requires attribution to the original author and project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

needlehaystack-2.0.0.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

needlehaystack-2.0.0-py3-none-any.whl (336.2 kB view details)

Uploaded Python 3

File details

Details for the file needlehaystack-2.0.0.tar.gz.

File metadata

  • Download URL: needlehaystack-2.0.0.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.8

File hashes

Hashes for needlehaystack-2.0.0.tar.gz
Algorithm Hash digest
SHA256 0cf0beabf35b2de155da4278e306c3a8dca2bafa656f45729db943592175d187
MD5 20d940f9cdcef752241c2e57030b3cfe
BLAKE2b-256 8d3f831e00b94b4c1cd3fb9ba71010f66c80ee358e2fff71468b27e02065d8e6

See more details on using hashes here.

File details

Details for the file needlehaystack-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for needlehaystack-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6ee22c9ef34bc84bf9e3addbdb7de5669b8dbcc5c0b824740e5b61414c06538c
MD5 4d259c18877657761ff869119443b137
BLAKE2b-256 3aa91fd0e3013a7fab746c7791b7b97768cff1ab1eb131ec4cf3939d2c6e296e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page