Queryable concept map of a codebase for LLM coding agents

These details have not been verified by PyPI

Project description

combfind

When an AI coding agent gets a ticket like "users get logged out randomly on mobile," it has two failure modes: it reads too many files burning tokens and time, or it finds a relevant file and patches it locally, missing that the bug lives in shared code, an interface, or a sibling implementation.

combfind fixes this. It builds a concept map of a codebase so an agent can query "session token refresh" and get back ranked symbols with files and line ranges. The key is what it tells you about structure: is this an interface, an implementation, or one of several siblings that all need to change together? That context is what prevents a local patch to the wrong layer. In practice it cuts orientation-phase token cost by 50-66% (measured on one dev loop; your mileage will vary): the agent reads 3-5 targeted files instead of scanning dozens.

Runs entirely locally. Doesn't require paid APIs.

Install

# Local LLM (llama.cpp)
pip install "combfind[llm]" \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu

# Apple Silicon (MLX)
pip install "combfind[mlx]"

# Remote OpenAI-compatible API
pip install "combfind[openai]"

# Gleam support
pip install "combfind[gleam]"

Download the default local model (~2.5 GB):

combfind download-model

Quick start

# Build the index
combfind init /path/to/repo --db repo.db

# Query it
combfind query "how does authentication work" --db repo.db

# Inspect a symbol from the results
combfind inspect auth.service.AuthService --db repo.db

Usage

init: build the index

# Basic
combfind init /path/to/repo --db repo.db

# Exclude test files (recommended for cleaner concepts)
combfind init /path/to/repo --db repo.db --exclude-regex '.*test.*'

# OpenAI-compatible API
COMBFIND_LLM_API_KEY=sk-... COMBFIND_LLM_MODEL=gpt-4o-mini \
  combfind init /path/to/repo --db repo.db --llm-mode openai

# Apple Silicon MLX
combfind init /path/to/repo --db repo.db --llm-mode mlx \
  --llm-model mlx-community/Qwen2.5-7B-Instruct-4bit

Flag	Default	Description
`--db`	`<repo_path>/.combfind.db`	Output database path
`--llm-mode`	`local`	LLM backend: `local`, `openai`, or `mlx`
`--llm-model`	auto-detected	GGUF path (local) or HF repo ID (mlx)
`--exclude-paths`		Paths to skip, relative to repo root (repeatable)
`--exclude-regex`		Regex matched against file paths to skip
`--llm-workers`	`1`	Parallel LLM calls (useful with `--llm-mode openai`)
`--docgen`	off	Generate docstrings for undocumented symbols (slow)
`--force`	off	Re-run all stages, ignoring the cache

query: search the index

combfind query "users get logged out randomly" --db repo.db
combfind query "where are database migrations" --db repo.db --format json

Text output:

[1] Token Refresh (implementation) - 0.87
    why: Handles session token validation and refresh logic.
    auth/service.py
      auth.service.AuthService.refresh  :42-67
      auth.service.AuthService.validate  :70-91

JSON output:

[
  {
    "rank": 1,
    "concept": "Token Refresh",
    "role": "implementation",
    "score": 0.87,
    "files": [
      {
        "path": "auth/service.py",
        "symbols": [
          {"name": "refresh", "qualified_name": "auth.service.AuthService.refresh", "start_line": 42, "end_line": 67},
          {"name": "validate", "qualified_name": "auth.service.AuthService.validate", "start_line": 70, "end_line": 91}
        ]
      }
    ],
    "why_relevant": "Handles session token validation and refresh logic.",
    "sibling_implementations": []
  }
]

Flag	Default	Description
`--db`	`.combfind.db`	Database to query
`--top-k`	`5`	Number of results
`--format`	`text`	`text` or `json`
`--rerank`	off	Re-score results with LLM (requires `--llm-mode`)
`--agentic`	off	Iterative query loop: LLM steers follow-up searches until satisfied (requires `--llm-mode`)
`--agentic-limit`	`3`	Max iterations for `--agentic`
`--llm-mode`		LLM backend for `--rerank` / `--agentic`: `local`, `openai`, or `mlx`

inspect: look up a symbol

combfind inspect auth.service.AuthService --db repo.db
combfind inspect auth.service.AuthService auth.service.TokenService --db repo.db --format json

Output:

auth.service.AuthService  (class, auth/service.py:10-80)
concept:  Token Refresh  [implementation]
sig:      class AuthService

callers (1):
  auth.mock.MockAuthService  auth/mock.py:5

callees (1):
  auth.service.AuthService.validate  auth/service.py:20

concept siblings (1):
  auth.service.AuthService.validate  [method]  auth/service.py

Flag	Default	Description
`--db`	`.combfind.db`	Database to query
`--format`	`text`	`text` or `json`

How it works

The init pipeline runs six stages, each reading and writing to a SQLite file:

parse: tree-sitter extracts files, symbols (signatures, line ranges, docstrings, imports)
index: SCIP or tree-sitter heuristics populate a references table of calls, imports, and inheritance edges
embed: sentence-transformers produces a vector per symbol
cluster: symbols are grouped by package/directory, then sub-clustered with KMeans (~20 symbols per concept)
label: a local LLM names and describes each cluster and assigns a structural role (see Concept roles below)
embed concepts: sentence-transformers produces a vector per concept description

At query time: embed the query, cosine search over concept embeddings, optionally rerank with LLM, expand top concepts to member symbols and 1-hop callers/callees, return ranked symbols and code regions.

Stages are cached by a content hash of their inputs. When you re-run init, only stages affected by changed files are re-executed; the rest are skipped. Pass --force to rebuild from scratch.

Performance

All numbers below are from my own ~50k LOC Go codebase using Qwen2.5:7b via Ollama. Treat them as directional, not a cross-repo benchmark.

Initial index builds in ~5 minutes. Query time is around 7 seconds, most of which is loading the local model on the first call. In --agentic mode the model is loaded once and kept warm across all iterations, so a 3-iteration run is roughly 7s + 2x steer time, not 3x7s.

Incremental reindexing is fast. When a handful of files change, re-running init takes around 30 seconds; only the stages affected by changed files are re-executed. The index is also crash-safe: progress is committed to SQLite in batches within each stage, so if a run is interrupted it picks up close to where it left off rather than starting over.

The goal is not to replace careful code reading. It is to give an agent a cheap orientation pass so it knows which 3-5 files to read rather than all 500. On that goal, combfind achieves file_recall@3 of 0.75 on structural queries with --rerank, evaluated against 10 hand-picked bug fixes from that codebase (n=10, single repo). No API costs, no multi-step LLM pipelines, runs fully local.

How to query well

combfind matches against concept descriptions, so structural queries outperform symptom descriptions.

"Where are user creation request DTOs and their field definitions?" finds the right code immediately. "EmailVerified boolean gets rejected by the validator" does not, because the symptom vocabulary has no overlap with the code structure.

When an agent receives a bug ticket, the right move is to translate the symptom into a structural question before querying: not what went wrong, but where does this kind of code live.

Concept roles

Every concept cluster is tagged with one of seven roles. An agent that finds TokenRefresh tagged interface knows to also look at all implementation siblings before touching anything. Not because it's smart, but because combfind surfaced them.

Role	Meaning
`interface`	Contract or protocol definition; changes here propagate to all implementations
`implementation`	Concrete implementation of an interface; there may be siblings that also need updating
`orchestrator`	Coordinates other components; high fan-out, changes ripple broadly
`entry_point`	Top-level handlers (HTTP routes, CLI commands, queue consumers)
`domain_model`	Core data structures and business entities
`infrastructure`	I/O, persistence, external service clients
`cross_cutting`	Utilities, logging, auth middleware used throughout

Supported languages

Python, Go, Java, Gleam, Erlang.

Optional SCIP tools

These are not required but produce more accurate call and import edges than the tree-sitter fallback:

Tool	Language	Install
`scip-go`	Go	`go install github.com/scip-code/scip-go/cmd/scip-go@latest`
`scip-python`	Python	`npm install -g @sourcegraph/scip-python`
`scip-java`	Java	scip-java releases

Using a remote LLM

Pass --llm-mode openai to use any OpenAI-compatible API:

export COMBFIND_LLM_BASE_URL=https://api.openai.com/v1
export COMBFIND_LLM_API_KEY=sk-...
export COMBFIND_LLM_MODEL=gpt-4o-mini

combfind init /path/to/repo --db repo.db --llm-mode openai

Works with OpenAI, Ollama (http://localhost:11434/v1), LM Studio (http://localhost:1234/v1), and any other OpenAI-compatible server.

Environment variables

Variable	Default	Description
`COMBFIND_LOG_LEVEL`	`info`	Log verbosity: `debug`, `info`, `warning`, `error`
`COMBFIND_MODEL`	auto-detected	GGUF path (local) or HF repo ID (mlx); equivalent to `--llm-model`
`COMBFIND_LLM_BASE_URL`		Base URL for OpenAI-compatible API
`COMBFIND_LLM_API_KEY`		API key for remote LLM
`COMBFIND_LLM_MODEL`	`gpt-4o-mini`	Model name for `--llm-mode openai`
`HF_HUB_OFFLINE`		Set to `1` to use cached embedding models without network access

Contributing

See CONTRIBUTING.md for dev setup, commit conventions, and the release pipeline.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.5.3

May 3, 2026

1.5.2

May 3, 2026

1.5.1

May 2, 2026

1.5.0

May 2, 2026

1.4.0

May 2, 2026

1.3.0

May 2, 2026

1.2.0

May 2, 2026

1.1.0

May 2, 2026

1.0.3

May 2, 2026

1.0.2

May 2, 2026

1.0.1

May 2, 2026

0.2.0

May 2, 2026

0.1.25

May 1, 2026

0.1.24

May 1, 2026

0.1.23

May 1, 2026

0.1.22

May 1, 2026

0.1.21

May 1, 2026

0.1.20

May 1, 2026

0.1.19

May 1, 2026

0.1.18

May 1, 2026

0.1.17

May 1, 2026

0.1.16

May 1, 2026

0.1.15

May 1, 2026

0.1.14

May 1, 2026

0.1.13

May 1, 2026

0.1.12

May 1, 2026

0.1.11

May 1, 2026

0.1.10

May 1, 2026

0.1.9

May 1, 2026

0.1.8

May 1, 2026

0.1.7

May 1, 2026

0.1.6

May 1, 2026

0.1.5

May 1, 2026

0.1.4

May 1, 2026

0.1.3

May 1, 2026

0.1.2

May 1, 2026

0.1.0

May 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

combfind-1.5.3.tar.gz (42.6 kB view details)

Uploaded May 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

combfind-1.5.3-py3-none-any.whl (51.9 kB view details)

Uploaded May 3, 2026 Python 3

File details

Details for the file combfind-1.5.3.tar.gz.

File metadata

Download URL: combfind-1.5.3.tar.gz
Upload date: May 3, 2026
Size: 42.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for combfind-1.5.3.tar.gz
Algorithm	Hash digest
SHA256	`690ab4da276ca31526530f7448de1809e094c1c16bab5a94419b075fb016bea5`
MD5	`51e17a8123b8aa9fc6d227f90d1a167b`
BLAKE2b-256	`d290db25f3f01456c9600d53259718faa78eb129c7e1276a6903071cfb837ef5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for combfind-1.5.3.tar.gz:

Publisher: release.yml on The127/combfind

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: combfind-1.5.3.tar.gz
- Subject digest: 690ab4da276ca31526530f7448de1809e094c1c16bab5a94419b075fb016bea5
- Sigstore transparency entry: 1432828438
- Sigstore integration time: May 3, 2026
Source repository:
- Permalink: The127/combfind@65e72383c1ff91d468ff046d9dfe793b24e9fbbd
- Branch / Tag: refs/heads/master
- Owner: https://github.com/The127
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@65e72383c1ff91d468ff046d9dfe793b24e9fbbd
- Trigger Event: push

File details

Details for the file combfind-1.5.3-py3-none-any.whl.

File metadata

Download URL: combfind-1.5.3-py3-none-any.whl
Upload date: May 3, 2026
Size: 51.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for combfind-1.5.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cfac53adccfa034608eb5fb26b02d5b1bb00f271bf7d577aa708545124e92281`
MD5	`3fa94021c118fa725f5f62ee70c3d66d`
BLAKE2b-256	`48b5826b787777d2958d38b2bb0aef613b0873d4360f95364d1bbd1f3d92185b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for combfind-1.5.3-py3-none-any.whl:

Publisher: release.yml on The127/combfind

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: combfind-1.5.3-py3-none-any.whl
- Subject digest: cfac53adccfa034608eb5fb26b02d5b1bb00f271bf7d577aa708545124e92281
- Sigstore transparency entry: 1432828504
- Sigstore integration time: May 3, 2026
Source repository:
- Permalink: The127/combfind@65e72383c1ff91d468ff046d9dfe793b24e9fbbd
- Branch / Tag: refs/heads/master
- Owner: https://github.com/The127
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@65e72383c1ff91d468ff046d9dfe793b24e9fbbd
- Trigger Event: push

combfind 1.5.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

combfind

Install

Quick start

Usage

init: build the index

query: search the index

inspect: look up a symbol

How it works

Performance

How to query well

Concept roles

Supported languages

Optional SCIP tools

Using a remote LLM

Environment variables

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance