Skip to main content

Generate an AI-ready knowledge graph (KNOWLEDGE_GRAPH.md + kg.json) of any codebase: modules, dependency edges, branches, PRs, ops surface.

Project description

repokg

Generate an AI-ready knowledge graph of any codebase — so an AI agent (or a new developer) can read one file and start building immediately.

repokg extracts everything that can be known deterministically about a repo — module inventory, internal import graph, every branch classified against every PR (merged / squash-merged / abandoned / stale), contributor stats, CI/Docker/Helm/Make surface — and renders it as:

  • KNOWLEDGE_GRAPH.md — a single human/AI-readable document with a mermaid architecture graph, module tables, branch & PR catalog, timeline, and ops inventory.
  • .repokg/kg.json — the same graph, machine-readable.

The semantic layer (module purposes, data-flow narratives, project eras, gotchas) can't be produced by static analysis without guessing — so repokg is agent-first: it emits .repokg/prompts/enrich.md, a rigorous prompt any AI coding agent (Claude Code, Cursor, Copilot Workspace…) executes to verify-and-fill the narrative sections, writing .repokg/narratives.json. Re-render and the knowledge graph is complete. No API keys, no LLM dependency in the tool itself.

Install

pipx install repokg        # or: pip install repokg
# from source:
pipx install git+https://github.com/NehharShah/repokg

Requirements: Python ≥ 3.9, git. Optional: gh (logged in) for the PR/branch cross-reference — without it the knowledge graph still builds, minus PR data.

Usage

cd your-repo
repokg                      # = generate: scan + prompts + render

Output:

.repokg/kg.json               # machine-readable knowledge graph
.repokg/prompts/enrich.md        # hand this to your AI agent
KNOWLEDGE_GRAPH.md                        # the knowledge graph document

Then, in your AI agent of choice:

Follow the instructions in .repokg/prompts/enrich.md

The agent explores the code, writes .repokg/narratives.json, and runs repokg render — KNOWLEDGE_GRAPH.md now carries verified purposes, data flows, timeline eras, and gotchas alongside the deterministic structure.

Commands

Command Effect
repokg scan [path] Extract structure → .repokg/kg.json
repokg prompts [path] Write the enrichment prompt
repokg render [path] kg.json (+ narratives.json) → KNOWLEDGE_GRAPH.md
repokg generate [path] All three (default)
repokg inject [path] Wire the knowledge graph into CLAUDE.md / AGENTS.md / Cursor rules (--diff for dry run)
repokg audit [path] Show every inferred conclusion with confidence + evidence (--json for machines)
repokg clean [path] Remove everything repokg authored — never touches your content (--diff for dry run)
repokg check [path] Exit 1 if the knowledge graph is stale vs HEAD (CI-friendly)

Flags: --out DIR (default <repo>/.repokg), --md FILE (default <repo>/KNOWLEDGE_GRAPH.md), --no-github, --pr-limit N, --diff, --json.

Honesty layer

Most of the graph is measured fact. The parts that are heuristics are labeled as findings with confidence and evidence, surfaced by repokg audit:

[git]
  trunk = master          high    detected via origin/HEAD symref
  integration = staging   medium  matched a well-known integration branch name
[modules]
  4 flagged generated     low     path-name heuristic; verify before excluding

Agent-written narratives.json is schema-validated before rendering — malformed enrichment fails loudly with errors precise enough for the agent to self-correct. (Findings/confidence design inspired by RepoCanon.)

Agent integration

repokg inject adds a managed block (delimited by <!-- repokg:begin/end -->, idempotent, never touches your hand-written content) pointing agents at KNOWLEDGE_GRAPH.md:

  • CLAUDE.md (Claude Code) — updated if present
  • AGENTS.md (the cross-tool agent standard) — updated if present, created if no agent file exists at all
  • .github/copilot-instructions.md (Copilot) — updated if present
  • .cursor/rules/repokg.mdc (Cursor, with alwaysApply: true) — created if .cursor/rules/ exists; falls back to legacy .cursorrules

Keep it fresh in CI:

- run: pipx run repokg check . || echo "::warning::KNOWLEDGE_GRAPH.md is stale"

KNOWLEDGE_GRAPH.md itself also lists any agent-context files it found, so an agent landing on the knowledge graph discovers your rules — and vice versa.

What gets extracted (all verified, never guessed)

Area How
Branch classification git for-each-ref + --merged ancestry vs the integration branch (auto-detects staging/develop), cross-referenced with every PR's head ref via gh — distinguishes true merges from squash-merges from abandoned work
PR catalog gh pr list --state all — open / merged / closed-unmerged, full appendix table
Module inventory Filesystem walk with LOC per directory, language detection, generated-code flagging
Import graph Go: import blocks resolved against go.mod module paths · Python: stdlib ast incl. relative imports · JS/TS: relative import/require resolution. Directory→directory edges with counts
Ops surface CI workflow names, Dockerfiles, compose files, Helm charts, Makefile targets, config/docs/test/migration dirs
Timeline Merged PRs grouped by month with conventional-commit scope frequencies (replaced by agent-written eras after enrichment)

Why agent-first instead of calling an LLM API?

Because the enrichment quality depends on reading the code, and your coding agent already has the repo open, tools to search it, and your permission model. A prompt it can execute beats a second LLM integration with its own keys, costs, and context limits. The contract between tool and agent is one JSON file (narratives.json) with a fixed schema — everything else stays deterministic and reproducible.

Known limitations

  • JS/TS: only relative imports are resolved; alias imports (@/…, tsconfig paths) are ignored.
  • Fork PRs: a fork PR whose head branch name matches a local branch will be linked to it (GitHub's API reports bare head refs).
  • Python: packages are discovered at the repo root and under src/; deeper monorepo layouts (packages/*/src/…) get file-level edges only.
  • Branch ahead counts use one batched git call on git ≥ 2.41, with a per-branch fallback on older git.

Roadmap

  • Rust / Java / Kotlin import graphs
  • --exclude glob patterns
  • llms.txt emission alongside KNOWLEDGE_GRAPH.md
  • tsconfig paths alias resolution
  • PyPI release + prebuilt GitHub Action

Development

pip install -e .
python -m unittest discover -s tests -v

No runtime dependencies — stdlib only.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repokg-0.2.0.tar.gz (30.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

repokg-0.2.0-py3-none-any.whl (27.8 kB view details)

Uploaded Python 3

File details

Details for the file repokg-0.2.0.tar.gz.

File metadata

  • Download URL: repokg-0.2.0.tar.gz
  • Upload date:
  • Size: 30.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for repokg-0.2.0.tar.gz
Algorithm Hash digest
SHA256 704663f080cb4989e6d363750f896d48048ad4d27ba0994505cbe4bb6bc5e7d4
MD5 738ec6c4559fc6967f95ee55e6b2ad9f
BLAKE2b-256 b0b0f9103c5ba261819504069f221c156bed1057399aae22158796d0994fc598

See more details on using hashes here.

Provenance

The following attestation bundles were made for repokg-0.2.0.tar.gz:

Publisher: release.yml on NehharShah/repokg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file repokg-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: repokg-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 27.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for repokg-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b529281e38e483c434ddca0c4317c19c96f0f4b098888e63d5d6f89360566f5d
MD5 aa7993d03860e77a74ba2ea2bfecafac
BLAKE2b-256 e402913f60e53f9f21b40313790ec5ff22cbd38c2c557fdaff729e0eee6fe609

See more details on using hashes here.

Provenance

The following attestation bundles were made for repokg-0.2.0-py3-none-any.whl:

Publisher: release.yml on NehharShah/repokg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page