Big Indexer — language-agnostic hierarchical code intelligence
Project description
BGI - Big Indexer
Copyright (c) 2026 bigindexer.com — Licensed under Apache-2.0.
BGI is a static architecture analysis tool for large codebases.
It groups code units by behavioral role and emits explicit architectural boundaries.
Project domain: bigindexer.com
What problem this solves
Most architecture graphs fail at scale in two ways:
- too many noisy edges
- giant clusters that collapse unrelated components together
BGI is built to keep both under control, so the output remains usable on large repos.
What you can do with it (practical outcomes)
- Find probable component boundaries for refactoring and ownership.
- Spot high-coupling seams between subsystems.
- Generate machine-readable architecture artifacts (
bgi-graph.json,fuse-graph.json) for automation and review. - Feed AI agents implementation-oriented MCP context (
task_fingerprint,behavioral_twins,twin_context) so they start from proven in-repo patterns.
5-minute example
Run BGI on the included fixture repo:
python3 - <<'PY'
from pipeline import run_scan
run_scan("tests/fixtures", language="python", output="/tmp/bgi-example.json")
PY
Observed result on this repository:
- units:
12 - edges:
14 - clusters:
2 - max cluster in sample:
6units
One produced edge looks like:
{
"source": "auth_module.py::AuthService::__init__",
"target": "auth_module.py::AuthService::__del__",
"key": "COV.INIT",
"lock": "COV.TEARDOWN",
"type": "HARD"
}
Why this matters: instead of raw syntax references only, you get behavioral relationships plus cluster structure that can drive architecture decisions.
Plain-English glossary
| BGI term | Plain meaning |
|---|---|
| COV token | A behavior label for a unit (for example: FETCH, PERSIST, AUTHENTICATE) |
| Key-Lock edge | A behavioral connection between two units with complementary roles |
| DRS cluster | A group of units likely belonging to one architectural component |
| Fuse edge / fuse event | A refused merge because cluster growth hit the cap; treated as boundary signal |
| Spectral masks | Scope rules that limit where matching is allowed (global, directory, file) |
Architecture in one view
Source files
->
Gate 1: fingerprint unit behavior (COV tokens)
->
Gate 2: create behavioral edges with scoped matching
->
Gate 3: cluster with hard size cap + boundary emission
->
Artifacts: bgi-graph.json, fuse-graph.json, optional routes/graphml/html
Core approach:
- TOKEN-CENSUS - classify token frequency per repo.
- SPECTRAL-MASKS - restrict match scope by token frequency.
- FUSE-MAP - cap cluster growth and record refused merges.
- MASK-4-GATE-3 - use import proximity as clustering signal.
- WATER-CLOCK +
.scm- single-pass query extraction path in Gate 1.
Why BGI is different from common alternatives
| Capability | LSP / SCIP index | Call-graph + generic community detection | BGI |
|---|---|---|---|
| Fast symbol lookup | Strong | Medium | Available (Phase 6 index) |
| Behavioral token model | No | Usually no | Yes |
| Hard-bounded clustering | No | Usually no | Yes |
| First-class boundary artifact | No | Usually no | Yes (fuse-graph.json) |
| Scope-constrained edge generation | Limited | Rare | Yes (spectral masks) |
Evidence (current, verifiable)
Large-repo scale evidence
Comparable kubernetes sample (go comparable mode, 162,917 units):
- Gate 1:
141.964s - Gate 2:
67.261s(historical comparable baseline:138.869s) - Gate 3:
9.359s - Total:
218.584s - Max cluster:
1.113% - Fuse events:
0
Artifact: output/validation/kubernetes-optionb-controlled-median-v21.json
Quality guard evidence (beyond raw speed)
- Gate 2 scope safety tests block invalid cross-scope merges (see
tests/test_gate2.py). - Gate 3 tests verify no legacy namespace over-merge without import evidence (see
tests/test_gate3.py). - Current full suite status:
python3 -m pytest tests/ -x -q(project baseline target remains passing).
Evidence summary
- Current published validation set: 100 scored runs across 5 repos and 3 models.
- Full 20-run post-shipment benchmark refresh for BGI-TWIN context (
task → COV → top-3 twins + seam + rubric) is complete: actionability 4.75/5 (p04 slice: 4.8/5), boundary 1.0, hallucinations 0. - Independent-model replication is now complete on azure/gpt-4o (20 runs) and gemini/auto (20 runs): GPT-4o actionability 4.85/5, Gemini actionability 4.25/5, both with zero hallucinations; Gemini boundary 0.95 reflects one genuine
django/p02miss. - Still missing: labeled precision/recall benchmark on an external corpus and head-to-head quantitative benchmark vs external tools on the same labeled dataset.
Language support tiers (explicit)
BGI does not treat all languages equally; support is tiered:
- Query-backed (
.scm):python,typescript - Tree-sitter scanner + rule path:
javascript,java,go,rust,ruby,csharp,php,kotlin,c,scala,lua,elixir - Generic regex fallback by extension:
swift,r,dart,bash,nim,zig,haskell,ocaml,fsharp,clojure,erlang,matlab,vb,crystal,cobol,groovy
Use this as a reliability signal: query-backed and dedicated scanner tiers are stronger than generic fallback.
Limitations and non-goals
- BGI is static analysis; it does not ingest runtime traces.
- Cross-file semantic resolution is heuristic and language-dependent.
- Cluster-size health is measured; full external precision/recall is not yet published.
- Shared-host benchmarking introduces variance; decisions should use controlled medians.
Install
pip install -e .
Quickstart commands
# scan
bgi scan /path/to/repo --lang auto --out bgi-graph.json
# optional outputs
bgi scan /path/to/repo --lang auto \
--fuse-graph fuse-graph.json \
--routes routes.json \
--graphml graph.graphml \
--html
# incremental
bgi scan /path/to/repo --lang auto --incremental --cache .bgi-cache.json
# diff
bgi diff /path/before /path/after --lang auto --out diff.json
# run MCP server over generated artifacts
bgi mcp --graph bgi-graph.json --fuse-graph fuse-graph.json
Example MCP usage pattern (from your client prompt):
Use MCP tool twin_context for:
"Add endpoint that validates input and persists data."
Return top twin candidate, seam suggestion, and rubric checklist.
Current project status
- Phase 1 quality architecture: complete
- Phase 5 Water-Clock: complete
- Phase 6 interactive index/search: complete
- Phase 7 Option B (Gate 2 performance tuning): complete
- Phase 8 MCP + BGI-TWIN context packaging: shipped (full 20-run post-shipment refresh + GPT-4o + Gemini replication complete)
- Phase 9 public launch: in progress (local/public doc split complete; registry submission next)
Documentation map
MEMORANDUM.md- design contracts and invariantsdocs/LANGUAGE_SUPPORT.md- language implementation detailsdocs/CONTRIBUTING_LANGUAGES.md- language contribution guidedocs/INDEX_SCHEMA.md- interactive index schemadocs/QUERY_PLANNER.md- query planner scoringdocs/MCP_SETUP.md- MCP server setup and usagehttps://bigindexer.com/validation- public validation evidencedocs/MCP_QUICKSTART_DEMO.md- 5-minute demo walkthroughdocs/MCP_EXAMPLE_TRANSCRIPTS.md- real-world MCP tool invocation examplesdocs/MCP_REAL_TRANSCRIPT.md- unedited transcript from FastAPI analysisscripts/mcp-demo.sh- automated demo script for multiple CLIs and repositories
License and Copyright
- License: Apache License 2.0 (
LICENSE) - Copyright holder: bigindexer.com
- Contributor terms: Developer Certificate of Origin (
DCO) enforced on pull requests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bigindexer-0.1.2.tar.gz.
File metadata
- Download URL: bigindexer-0.1.2.tar.gz
- Upload date:
- Size: 482.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
357cd031ab2afdf83c7fc7c8c3e48df094fd521f880297714013e9a3973ab0e7
|
|
| MD5 |
2f36079253197f73afc39fc421fc2442
|
|
| BLAKE2b-256 |
0efb2a74f5d027593435b0f95a04a9bc86de33e23be65b7b37a9d540c2070e8c
|
File details
Details for the file bigindexer-0.1.2-py3-none-any.whl.
File metadata
- Download URL: bigindexer-0.1.2-py3-none-any.whl
- Upload date:
- Size: 622.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1592e02bff38f18141df1eed5254485b5ae05382cd18fa58ebce8da2c8a9ce4
|
|
| MD5 |
e1222d4288fc4019121f16e4852dab52
|
|
| BLAKE2b-256 |
2b1a2add2707fce6b114f2b7e7aa27719de47233d572efc06b1b188355830bcc
|