Big Indexer — language-agnostic hierarchical code intelligence

Project description

BGI - Big Indexer

Copyright (c) 2026 bigindexer.com — Licensed under Apache-2.0.

BGI is a static architecture analysis tool for large codebases. It groups code units by behavioral role and emits explicit architectural boundaries. Project domain: bigindexer.com

What problem this solves

Most architecture graphs fail at scale in two ways:

too many noisy edges
giant clusters that collapse unrelated components together

BGI is built to keep both under control, so the output remains usable on large repos.

What you can do with it (practical outcomes)

Find probable component boundaries for refactoring and ownership.
Spot high-coupling seams between subsystems.
Generate machine-readable architecture artifacts (bgi-graph.json, fuse-graph.json) for automation and review.
Feed AI agents implementation-oriented MCP context (task_fingerprint, behavioral_twins, twin_context) so they start from proven in-repo patterns.

5-minute example

Run BGI on the included fixture repo:

python3 - <<'PY'
from pipeline import run_scan
run_scan("tests/fixtures", language="python", output="/tmp/bgi-example.json")
PY

Observed result on this repository:

units: 12
edges: 14
clusters: 2
max cluster in sample: 6 units

One produced edge looks like:

{
  "source": "auth_module.py::AuthService::__init__",
  "target": "auth_module.py::AuthService::__del__",
  "key": "COV.INIT",
  "lock": "COV.TEARDOWN",
  "type": "HARD"
}

Why this matters: instead of raw syntax references only, you get behavioral relationships plus cluster structure that can drive architecture decisions.

Plain-English glossary

BGI term	Plain meaning
COV token	A behavior label for a unit (for example: `FETCH`, `PERSIST`, `AUTHENTICATE`)
Key-Lock edge	A behavioral connection between two units with complementary roles
DRS cluster	A group of units likely belonging to one architectural component
Fuse edge / fuse event	A refused merge because cluster growth hit the cap; treated as boundary signal
Spectral masks	Scope rules that limit where matching is allowed (global, directory, file)

Architecture in one view

Source files
   ->
Gate 1: fingerprint unit behavior (COV tokens)
   ->
Gate 2: create behavioral edges with scoped matching
   ->
Gate 3: cluster with hard size cap + boundary emission
   ->
Artifacts: bgi-graph.json, fuse-graph.json, optional routes/graphml/html

Core approach:

TOKEN-CENSUS - classify token frequency per repo.
SPECTRAL-MASKS - restrict match scope by token frequency.
FUSE-MAP - cap cluster growth and record refused merges.
MASK-4-GATE-3 - use import proximity as clustering signal.
WATER-CLOCK + .scm - single-pass query extraction path in Gate 1.

Why BGI is different from common alternatives

Capability	LSP / SCIP index	Call-graph + generic community detection	BGI
Fast symbol lookup	Strong	Medium	Available (Phase 6 index)
Behavioral token model	No	Usually no	Yes
Hard-bounded clustering	No	Usually no	Yes
First-class boundary artifact	No	Usually no	Yes (`fuse-graph.json`)
Scope-constrained edge generation	Limited	Rare	Yes (spectral masks)

Evidence (current, verifiable)

Large-repo scale evidence

Comparable kubernetes sample (go comparable mode, 162,917 units):

Gate 1: 141.964s
Gate 2: 67.261s (historical comparable baseline: 138.869s)
Gate 3: 9.359s
Total: 218.584s
Max cluster: 1.113%
Fuse events: 0

Artifact: output/validation/kubernetes-optionb-controlled-median-v21.json

Quality guard evidence (beyond raw speed)

Gate 2 scope safety tests block invalid cross-scope merges (see tests/test_gate2.py).
Gate 3 tests verify no legacy namespace over-merge without import evidence (see tests/test_gate3.py).
Current full suite status: python3 -m pytest tests/ -x -q (project baseline target remains passing).

Evidence summary

Current published validation set: 100 scored runs across 5 repos and 3 models.
Full 20-run post-shipment benchmark refresh for BGI-TWIN context (task → COV → top-3 twins + seam + rubric) is complete: actionability 4.75/5 (p04 slice: 4.8/5), boundary 1.0, hallucinations 0.
Independent-model replication is now complete on azure/gpt-4o (20 runs) and gemini/auto (20 runs): GPT-4o actionability 4.85/5, Gemini actionability 4.25/5, both with zero hallucinations; Gemini boundary 0.95 reflects one genuine django/p02 miss.
Still missing: labeled precision/recall benchmark on an external corpus and head-to-head quantitative benchmark vs external tools on the same labeled dataset.

Language support tiers (explicit)

BGI does not treat all languages equally; support is tiered:

Query-backed (.scm): python, typescript
Tree-sitter scanner + rule path: javascript, java, go, rust, ruby, csharp, php, kotlin, c, scala, lua, elixir
Generic regex fallback by extension: swift, r, dart, bash, nim, zig, haskell, ocaml, fsharp, clojure, erlang, matlab, vb, crystal, cobol, groovy

Use this as a reliability signal: query-backed and dedicated scanner tiers are stronger than generic fallback.

Limitations and non-goals

BGI is static analysis; it does not ingest runtime traces.
Cross-file semantic resolution is heuristic and language-dependent.
Cluster-size health is measured; full external precision/recall is not yet published.
Shared-host benchmarking introduces variance; decisions should use controlled medians.

Install

pip install -e .

Quickstart commands

# scan
bgi scan /path/to/repo --lang auto --out bgi-graph.json

# optional outputs
bgi scan /path/to/repo --lang auto \
  --fuse-graph fuse-graph.json \
  --routes routes.json \
  --graphml graph.graphml \
  --html

# incremental
bgi scan /path/to/repo --lang auto --incremental --cache .bgi-cache.json

# diff
bgi diff /path/before /path/after --lang auto --out diff.json

# run MCP server over generated artifacts
bgi mcp --graph bgi-graph.json --fuse-graph fuse-graph.json

Example MCP usage pattern (from your client prompt):

Use MCP tool twin_context for:
"Add endpoint that validates input and persists data."
Return top twin candidate, seam suggestion, and rubric checklist.

Current project status

Phase 1 quality architecture: complete
Phase 5 Water-Clock: complete
Phase 6 interactive index/search: complete
Phase 7 Option B (Gate 2 performance tuning): complete
Phase 8 MCP + BGI-TWIN context packaging: shipped (full 20-run post-shipment refresh + GPT-4o + Gemini replication complete)
Phase 9 public launch: in progress (local/public doc split complete; registry submission next)

Documentation map

MEMORANDUM.md - design contracts and invariants
docs/LANGUAGE_SUPPORT.md - language implementation details
docs/CONTRIBUTING_LANGUAGES.md - language contribution guide
docs/INDEX_SCHEMA.md - interactive index schema
docs/QUERY_PLANNER.md - query planner scoring
docs/MCP_SETUP.md - MCP server setup and usage
https://bigindexer.com/validation - public validation evidence
docs/MCP_QUICKSTART_DEMO.md - 5-minute demo walkthrough
docs/MCP_EXAMPLE_TRANSCRIPTS.md - real-world MCP tool invocation examples
docs/MCP_REAL_TRANSCRIPT.md - unedited transcript from FastAPI analysis
scripts/mcp-demo.sh - automated demo script for multiple CLIs and repositories

License and Copyright

License: Apache License 2.0 (LICENSE)
Copyright holder: bigindexer.com
Contributor terms: Developer Certificate of Origin (DCO) enforced on pull requests

Project details

Release history Release notifications | RSS feed

This version

0.1.2

May 13, 2026

0.1.1

May 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigindexer-0.1.2.tar.gz (482.4 kB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bigindexer-0.1.2-py3-none-any.whl (622.3 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file bigindexer-0.1.2.tar.gz.

File metadata

Download URL: bigindexer-0.1.2.tar.gz
Upload date: May 13, 2026
Size: 482.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.16

File hashes

Hashes for bigindexer-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`357cd031ab2afdf83c7fc7c8c3e48df094fd521f880297714013e9a3973ab0e7`
MD5	`2f36079253197f73afc39fc421fc2442`
BLAKE2b-256	`0efb2a74f5d027593435b0f95a04a9bc86de33e23be65b7b37a9d540c2070e8c`

See more details on using hashes here.

File details

Details for the file bigindexer-0.1.2-py3-none-any.whl.

File metadata

Download URL: bigindexer-0.1.2-py3-none-any.whl
Upload date: May 13, 2026
Size: 622.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.16

File hashes

Hashes for bigindexer-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b1592e02bff38f18141df1eed5254485b5ae05382cd18fa58ebce8da2c8a9ce4`
MD5	`e1222d4288fc4019121f16e4852dab52`
BLAKE2b-256	`2b1a2add2707fce6b114f2b7e7aa27719de47233d572efc06b1b188355830bcc`

See more details on using hashes here.

bigindexer 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

BGI - Big Indexer

What problem this solves

What you can do with it (practical outcomes)

5-minute example

Plain-English glossary

Architecture in one view

Why BGI is different from common alternatives

Evidence (current, verifiable)

Large-repo scale evidence

Quality guard evidence (beyond raw speed)

Evidence summary

Language support tiers (explicit)

Limitations and non-goals

Install

Quickstart commands

Current project status

Documentation map

License and Copyright

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes