Generate an AI-ready knowledge graph (KNOWLEDGE_GRAPH.md + kg.json) of any codebase: modules, dependency edges, branches, PRs, ops surface.
Project description
repokg
Generate an AI-ready knowledge graph of any codebase — so an AI agent (or a new developer) can read one file and start building immediately.
repokg extracts everything that can be known deterministically about a repo —
module inventory, internal import graph, every branch classified against every PR
(merged / squash-merged / abandoned / stale), contributor stats, CI/Docker/Helm/Make
surface — and renders it as:
KNOWLEDGE_GRAPH.md— a single human/AI-readable document with a mermaid architecture graph, module tables, branch & PR catalog, timeline, and ops inventory..repokg/kg.json— the same graph, machine-readable.
The semantic layer (module purposes, data-flow narratives, project eras, gotchas)
can't be produced by static analysis without guessing — so repokg is
agent-first: it emits .repokg/prompts/enrich.md, a rigorous prompt any AI coding
agent (Claude Code, Cursor, Copilot Workspace…) executes to verify-and-fill the
narrative sections, writing .repokg/narratives.json. Re-render and the knowledge graph is
complete. No API keys, no LLM dependency in the tool itself.
Install
pipx install repokg # or: pip install repokg
# from source:
pipx install git+https://github.com/NehharShah/repokg
Requirements: Python ≥ 3.9, git. Optional: gh (logged
in) for the PR/branch cross-reference — without it the knowledge graph still builds, minus PR data.
Usage
cd your-repo
repokg # = generate: scan + prompts + render
Output:
.repokg/kg.json # machine-readable knowledge graph
.repokg/prompts/enrich.md # hand this to your AI agent
KNOWLEDGE_GRAPH.md # the knowledge graph document
Then, in your AI agent of choice:
Follow the instructions in .repokg/prompts/enrich.md
The agent explores the code, writes .repokg/narratives.json, and runs
repokg render — KNOWLEDGE_GRAPH.md now carries verified purposes, data flows,
timeline eras, and gotchas alongside the deterministic structure.
Commands
| Command | Effect |
|---|---|
repokg scan [path] |
Extract structure → .repokg/kg.json |
repokg prompts [path] |
Write the enrichment prompt |
repokg render [path] |
kg.json (+ narratives.json) → KNOWLEDGE_GRAPH.md |
repokg generate [path] |
All three (default) |
repokg inject [path] |
Wire the knowledge graph into CLAUDE.md / AGENTS.md / Cursor rules (--diff for dry run) |
repokg audit [path] |
Show every inferred conclusion with confidence + evidence (--json for machines) |
repokg clean [path] |
Remove everything repokg authored — never touches your content (--diff for dry run) |
repokg check [path] |
Exit 1 if the knowledge graph is stale vs HEAD (CI-friendly) |
Flags: --out DIR (default <repo>/.repokg), --md FILE (default <repo>/KNOWLEDGE_GRAPH.md),
--no-github, --pr-limit N, --diff, --json.
Honesty layer
Most of the graph is measured fact. The parts that are heuristics are labeled
as findings with confidence and evidence, surfaced by repokg audit:
[git]
trunk = master high detected via origin/HEAD symref
integration = staging medium matched a well-known integration branch name
[modules]
4 flagged generated low path-name heuristic; verify before excluding
Agent-written narratives.json is schema-validated before rendering — malformed
enrichment fails loudly with errors precise enough for the agent to self-correct.
(Findings/confidence design inspired by RepoCanon.)
Agent integration
repokg inject adds a managed block (delimited by
<!-- repokg:begin/end -->, idempotent, never touches your hand-written
content) pointing agents at KNOWLEDGE_GRAPH.md:
CLAUDE.md(Claude Code) — updated if presentAGENTS.md(the cross-tool agent standard) — updated if present, created if no agent file exists at all.github/copilot-instructions.md(Copilot) — updated if present.cursor/rules/repokg.mdc(Cursor, withalwaysApply: true) — created if.cursor/rules/exists; falls back to legacy.cursorrules
Keep it fresh in CI:
- run: pipx run repokg check . || echo "::warning::KNOWLEDGE_GRAPH.md is stale"
KNOWLEDGE_GRAPH.md itself also lists any agent-context files it found, so an agent landing on the knowledge graph discovers your rules — and vice versa.
What gets extracted (all verified, never guessed)
| Area | How |
|---|---|
| Branch classification | git for-each-ref + --merged ancestry vs the integration branch (auto-detects staging/develop), cross-referenced with every PR's head ref via gh — distinguishes true merges from squash-merges from abandoned work |
| PR catalog | gh pr list --state all — open / merged / closed-unmerged, full appendix table |
| Module inventory | Filesystem walk with LOC per directory, language detection, generated-code flagging |
| Import graph | Go: import blocks resolved against go.mod module paths · Python: stdlib ast incl. relative imports · JS/TS: relative import/require resolution. Directory→directory edges with counts |
| Ops surface | CI workflow names, Dockerfiles, compose files, Helm charts, Makefile targets, config/docs/test/migration dirs |
| Timeline | Merged PRs grouped by month with conventional-commit scope frequencies (replaced by agent-written eras after enrichment) |
Why agent-first instead of calling an LLM API?
Because the enrichment quality depends on reading the code, and your coding agent
already has the repo open, tools to search it, and your permission model. A prompt it
can execute beats a second LLM integration with its own keys, costs, and context limits.
The contract between tool and agent is one JSON file (narratives.json) with a fixed
schema — everything else stays deterministic and reproducible.
Known limitations
- JS/TS: only relative imports are resolved; alias imports (
@/…, tsconfigpaths) are ignored. - Fork PRs: a fork PR whose head branch name matches a local branch will be linked to it (GitHub's API reports bare head refs).
- Python: packages are discovered at the repo root and under
src/; deeper monorepo layouts (packages/*/src/…) get file-level edges only. - Branch
aheadcounts use one batched git call on git ≥ 2.41, with a per-branch fallback on older git.
Roadmap
- Rust / Java / Kotlin import graphs
-
--excludeglob patterns -
llms.txtemission alongside KNOWLEDGE_GRAPH.md - tsconfig
pathsalias resolution - PyPI release + prebuilt GitHub Action
Development
pip install -e .
python -m unittest discover -s tests -v
No runtime dependencies — stdlib only.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file repokg-0.2.0.tar.gz.
File metadata
- Download URL: repokg-0.2.0.tar.gz
- Upload date:
- Size: 30.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
704663f080cb4989e6d363750f896d48048ad4d27ba0994505cbe4bb6bc5e7d4
|
|
| MD5 |
738ec6c4559fc6967f95ee55e6b2ad9f
|
|
| BLAKE2b-256 |
b0b0f9103c5ba261819504069f221c156bed1057399aae22158796d0994fc598
|
Provenance
The following attestation bundles were made for repokg-0.2.0.tar.gz:
Publisher:
release.yml on NehharShah/repokg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
repokg-0.2.0.tar.gz -
Subject digest:
704663f080cb4989e6d363750f896d48048ad4d27ba0994505cbe4bb6bc5e7d4 - Sigstore transparency entry: 2063519527
- Sigstore integration time:
-
Permalink:
NehharShah/repokg@dbef29fe678f79a6c1246f2a4419ec085e93efaf -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/NehharShah
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@dbef29fe678f79a6c1246f2a4419ec085e93efaf -
Trigger Event:
push
-
Statement type:
File details
Details for the file repokg-0.2.0-py3-none-any.whl.
File metadata
- Download URL: repokg-0.2.0-py3-none-any.whl
- Upload date:
- Size: 27.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b529281e38e483c434ddca0c4317c19c96f0f4b098888e63d5d6f89360566f5d
|
|
| MD5 |
aa7993d03860e77a74ba2ea2bfecafac
|
|
| BLAKE2b-256 |
e402913f60e53f9f21b40313790ec5ff22cbd38c2c557fdaff729e0eee6fe609
|
Provenance
The following attestation bundles were made for repokg-0.2.0-py3-none-any.whl:
Publisher:
release.yml on NehharShah/repokg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
repokg-0.2.0-py3-none-any.whl -
Subject digest:
b529281e38e483c434ddca0c4317c19c96f0f4b098888e63d5d6f89360566f5d - Sigstore transparency entry: 2063519535
- Sigstore integration time:
-
Permalink:
NehharShah/repokg@dbef29fe678f79a6c1246f2a4419ec085e93efaf -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/NehharShah
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@dbef29fe678f79a6c1246f2a4419ec085e93efaf -
Trigger Event:
push
-
Statement type: