Skip to main content

Task-aware context selection for LLMs

Project description

cognitive-cache

PyPI Python 3.11+ License

Existing LLM tools (Cursor, Claude Code, Copilot, and the rest) pick what to put in the context window with heuristics: grep for some symbols, embed and cosine-similarity search, or stuff as many files as will fit. Few of them treat it as an explicit selection problem.

This project is one attempt at treating it as one.

Runs entirely local with no LLM calls, API keys, or cloud dependencies, and supports Python, JavaScript, TypeScript, Go, Rust, Java, Ruby, C, and C++.

cognitive-cache finding the right files for a real GitHub issue

the problem: context as an os-level resource

Think of it this way:

Classic OS LLM Equivalent Current State
RAM Context window Manually managed
Virtual memory / page swaps Context eviction + retrieval Crude summarization
Process scheduler Agent orchestration Hand-coded loops
File system cache Knowledge retrieval Cosine similarity
Memory allocator Token budget allocation Nobody does this

Early computers had programmers managing memory addresses by hand. Then virtual memory shipped, and it changed what was possible.

Right now LLM context is managed by hand: grep, RAG, cram-everything. Cognitive-cache is a stab at an algorithm in that gap. Whether it ends up being the right one is an open question; the point is that the gap is real.

what it does

Given a task (like a GitHub issue) and a codebase, cognitive-cache picks which files to include in the LLM's context window across nine languages. It combines multiple signals (symbol matching, dependency graph distance, git recency, semantic similarity, redundancy penalties, file role awareness) into a scoring function and runs greedy submodular optimization to select the highest-value set of files that fits within a token budget.

The key insight is treating context selection as a constrained optimization problem rather than a retrieval problem. RAG systems ask "what's most similar to the query?" but what you actually want is "what maximizes the chance the model gets this right?" Those are different questions.

benchmark results

Benchmarked on 23 real bug-fix PRs across 8 open-source repositories. For each issue, we run every strategy with a 12K token budget and measure file recall: did the algorithm select the files that were actually modified in the fix?

overall performance

Strategy Avg Recall Median Head-to-head vs cognitive-cache
cognitive-cache 34.7% 40%
llm-triage 25.7% 20% 8W / 11T / 4L
embedding (RAG) 25.9% 0% 9W / 9T / 5L
grep 22.8% 0% 7W / 15T / 1L
random 4.7% 0% 12W / 11T / 0L
full-stuff 2.2% 0% 13W / 10T / 0L

Head-to-head, cognitive-cache has a positive record against every baseline, including llm-triage (asking the LLM to pick its own files), which is the hardest. The absolute numbers stay modest: 34.7% recall means it still misses the right files most of the time, just less often than the alternatives.

per-repo breakdown

Repository cognitive-cache llm-triage embedding grep Issues
Textualize/rich 100% 0% 50% 100% 1
pallets/werkzeug 62% 38% 44% 44% 4
psf/requests 50% 25% 25% 25% 2
pallets/flask 38% 36% 23% 22% 3
pallets/click 33% 33% 0% 33% 1
pallets/jinja 20% 40% 20% 20% 5
fastify/fastify 17% 0% 25% 0% 6
tiangolo/fastapi 0% 50% 0% 0% 1

per-issue detail

Every issue, every strategy:

Issue CC LLM Emb Grep Rand Full
Textualize/rich#4006 100% 0% 50% 100% 0% 0%
fastify/fastify#6013 0% 0% 0% 0% 0% 0%
fastify/fastify#6021 50% 0% 0% 0% 0% 0%
fastify/fastify#6026 0% 0% 0% 0% 0% 0%
fastify/fastify#6030 0% 0% 50% 0% 0% 0%
fastify/fastify#6064 0% 0% 0% 0% 0% 0%
fastify/fastify#6613 50% 0% 100% 0% 0% 0%
pallets/click#3225 33% 33% 0% 33% 33% 0%
pallets/flask#5899 50% 50% 50% 0% 0% 0%
pallets/flask#5917 40% 20% 20% 40% 0% 0%
pallets/flask#5928 25% 38% 0% 25% 0% 0%
pallets/jinja#1663 0% 100% 50% 50% 0% 0%
pallets/jinja#1665 0% 0% 0% 0% 0% 0%
pallets/jinja#1706 0% 0% 50% 0% 0% 0%
pallets/jinja#1852 0% 50% 0% 0% 0% 0%
pallets/jinja#2061 100% 50% 0% 50% 0% 0%
pallets/werkzeug#3038 50% 50% 75% 25% 25% 0%
pallets/werkzeug#3078 50% 50% 50% 50% 0% 0%
pallets/werkzeug#3080 50% 0% 0% 0% 0% 0%
pallets/werkzeug#3129 100% 50% 50% 100% 0% 0%
psf/requests#7308 50% 50% 0% 0% 0% 50%
psf/requests#7309 50% 0% 50% 50% 50% 0%
tiangolo/fastapi#15139 0% 50% 0% 0% 0% 0%

Bold = best recall for that issue. All runs use Qwen 3.5 9B (Q4_K_M) via llama.cpp at zero API cost.

what the baselines do

  • random: picks files at random until the token budget is full
  • full-stuff: crams files alphabetically until the budget is full, which is roughly what most tools approximate
  • embedding (RAG): TF-IDF cosine similarity (scikit-learn, up to 5K features) between the issue text and file contents. This uses TF-IDF rather than neural embeddings, so it's a simpler version of what most RAG tools do; real-world RAG with a proper embedding model would likely score somewhat higher
  • grep: searches for symbols and identifiers mentioned in the issue
  • llm-triage: gives the LLM (same Qwen 3.5 9B) the issue plus a full file listing and asks it to pick the most relevant files. This simulates what tools like Cursor and Claude Code do when they decide what to read, and is the hardest baseline to beat

how it works

Six signals score each file, with configurable weights:

Signal Weight What it does
symbol overlap 0.45 Does this file define or mention identifiers from the task? Dominant signal when it fires
graph distance 0.20 How many imports away from task-mentioned files? Built with networkx from the dependency graph (resolves imports for all nine languages, including TS path aliases from tsconfig.json)
change recency 0.03 Recently changed in git? Only fires when the file also has structural relevance, to prevent recently-touched-but-unrelated files from flooding results
redundancy penalty 0.10 Already selected a file with similar symbols? This one is worth less, preventing budget waste on duplicate context
lexical similarity 0.15 TF-IDF cosine similarity between task and file content. Not neural embeddings; a neural upgrade is an open path gated on a benchmark win
file role prior 0.07 Source files score higher than test files by default. Test files get boosted only when the task mentions testing

These get combined into a weighted score, and a two-phase greedy selector picks files: first by absolute score (core files), then by value-per-token (supporting context). Redundancy is re-evaluated after each pick. If a file is too large to fit (like a 13K token app.py when your budget is 12K), it gets chunked to extract just the relevant functions.

Test files are automatically excluded unless the task mentions testing keywords (test, spec, coverage, fixture, mock, stub), though you can override this with include_tests=True/False.

install

pip install cognitive-cache

Or with uv:

uv add cognitive-cache

using it

as a library

from cognitive_cache import select_context_from_repo

result = select_context_from_repo(".", "fix the login redirect bug")
for item in result.selected:
    print(f"{item.source.path} (score: {item.score:.3f})")

For repeated queries against the same repo, build the index once and reuse it:

from cognitive_cache import RepoIndex, select_context

index = RepoIndex.build(".")
r1 = select_context(index, "fix the login bug")
r2 = select_context(index, "add rate limiting to the API")

All the scoring parameters are exposed if you need control over them:

from cognitive_cache import RepoIndex, select_context
from cognitive_cache.core.value_function import WeightConfig

index = RepoIndex.build(".")
result = select_context(
    index,
    "add test coverage for the auth module",
    budget=20000,
    include_tests=True,
    max_files=10,
    min_score=0.15,
    weights=WeightConfig(symbol_overlap=0.50, graph_distance=0.25),
)

as a CLI

cognitive-cache select --repo . --task "fix the login redirect bug"
cognitive-cache select --repo . --task "fix login" --json              # machine-readable
cognitive-cache select --repo . --task "fix login" --output ctx.txt    # dump context to file
cognitive-cache select --repo . --task "fix login" --include-tests no  # exclude test files
cognitive-cache select --repo . --task "fix login" --max-files 5 --min-score 0.2

as an MCP server (for Claude Code, Cursor, etc.)

Claude Code (registers at user scope, available in all projects):

claude mcp add --scope user cognitive-cache -- uvx --from "cognitive-cache[mcp]" cognitive-cache-mcp

Cursor / Windsurf / other editors (add to your MCP config file):

{
  "mcpServers": {
    "cognitive-cache": {
      "command": "uvx",
      "args": ["--from", "cognitive-cache[mcp]", "cognitive-cache-mcp"]
    }
  }
}

telling the model when to use it

The tool is most useful when the relevant files aren't obvious: cross-cutting bugs, unfamiliar parts of a large codebase, or tasks that span multiple layers. For small repos or tasks where you already know which files to touch, it doesn't add much over grep.

Add this to your CLAUDE.md (or equivalent system prompt / rules file):

## context selection

When a task spans multiple files or you're not sure where to look, call `select_context_tool`
from the `cognitive-cache` MCP server before reading files manually:

- `repo_path`: absolute path to the repo root
- `task`: specific description of the task (the more precise, the better the results)
- `budget`: token budget for returned context (default 12000; raise for complex tasks)
- `include_tests`: true to always include test files, false to exclude, null for auto-detection
- `max_files`: cap on number of files returned (default 15)
- `min_score`: minimum relevance score threshold (default 0.0)

The tool returns file contents directly, so use them instead of separate file reads.
Call it at the start of investigation; the index is cached and follow-up calls are fast.

Skip it when you already know which files to read.

A precise task description outperforms a vague one because symbol overlap and lexical similarity both depend on the exact words used. If nothing clears the internal confidence floor, the selector falls back to returning the top-scoring files with a stderr warning rather than an empty result. "users get 401 after OIDC callback when session token is present" will score better than "fix auth bug".

as a GitHub Action

The included workflow (.github/workflows/context-suggest.yml) automatically comments on new issues with the most relevant files. It runs entirely in CI with no API keys required.

running the benchmark

uv sync --dev
uv run pytest tests/  # 140 tests

To run the benchmark with a local llama.cpp server:

LLAMACPP_BASE_URL=http://localhost:8080 PYTHONPATH=. uv run python benchmark/run_local.py   # full 23-issue benchmark
PYTHONPATH=. uv run python benchmark/run_test.py                                            # quick 3-issue test

Configure the llama.cpp connection with environment variables:

Variable Default Description
LLAMACPP_BASE_URL http://localhost:8080 llama.cpp server URL
LLAMACPP_MODEL Qwen3.5-9B-Q4_K_M Model name to request

To expand the benchmark dataset (needs a GitHub token for API access):

GITHUB_TOKEN=ghp_xxx uv run python benchmark/curate_dataset.py

project structure

src/cognitive_cache/
    api.py              # public API: RepoIndex, select_context
    cli.py              # CLI entry point
    mcp_server.py       # MCP server for Claude Code / Cursor
    models.py           # core types (Source, Task, ScoredSource, SelectionResult)
    indexer/             # turns a repo directory into a list of Source objects
    signals/             # the six scoring signals
    core/                # value function, greedy selector, file chunker
    baselines/           # the five baseline strategies we benchmark against
    llm/                 # adapters for calling LLMs (claude, openai, llama.cpp)
benchmark/
    dataset/             # curated github issues with known fixes
    runner.py            # orchestrates benchmark runs
    evaluator.py         # computes recall and efficiency metrics

whats next

  • Adaptive replanning that re-optimizes context mid-conversation after the model calls a tool or asks a followup
  • Task-aware compression that goes beyond chunking to actually compress file content while preserving what's relevant
  • Expanding the benchmark dataset to include Go, Rust, and Java repositories now that multi-language indexing is in place

why this matters

Context windows keep growing (1M tokens, more soon) and the temptation is to assume that fixes the problem. It doesn't: more capacity means more choices, and stuffing everything in burns compute and dilutes attention. Selection gets more useful as windows grow, not less.

license

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cognitive_cache-0.3.0.tar.gz (33.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cognitive_cache-0.3.0-py3-none-any.whl (46.5 kB view details)

Uploaded Python 3

File details

Details for the file cognitive_cache-0.3.0.tar.gz.

File metadata

  • Download URL: cognitive_cache-0.3.0.tar.gz
  • Upload date:
  • Size: 33.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cognitive_cache-0.3.0.tar.gz
Algorithm Hash digest
SHA256 cab14748e43722031a4a7ccfac9bdb0d409b74b4fde5a185c5e0dcf51e6c7cdd
MD5 f441df0c2c91373f3bfce14fc1180e46
BLAKE2b-256 7b6301b703bae14c32ada10a369b844bf1fdb4af0b661a51ecaf8e7f17d0629f

See more details on using hashes here.

File details

Details for the file cognitive_cache-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cognitive_cache-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4804d8ba9f14f78625537fbf249821f415eb9efde6e17258db713aa3ab663725
MD5 f3b8c73791abd2947d371cb8562aebfe
BLAKE2b-256 e0059cdd00a203ba5306ca8488cbba756b5fc8c2efd647a903dd8ec84c700434

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page