Skip to main content

Build a code Knowledge Graph from any repository (Python/JS/TS/Java/Go/Rust/C/C++/Ruby/C#) for token-efficient AI agent context. Supports JSON and TOON output formats. Agent-friendly: generates CLAUDE.md, CODEBASE.md, and supports zero-dep queries.

Project description

repo2kg

Turn any repository into a Knowledge Graph that AI agents can query instead of reading every file.

Instead of feeding entire codebases to Claude Code, Codex, or Copilot, repo2kg builds a searchable graph of your codebase — functions, classes, call edges, and semantic embeddings — so agents query exactly what they need.

v0.5.0 adds support for 10 languages and a new TOON format (Token-Oriented Object Notation) that stores the KG in ~40% fewer tokens than JSON.

Without repo2kg                      With repo2kg
────────────────────                 ────────────────────
Agent reads 177 files                Agent reads CODEBASE.md
→ 43,000 tokens                      → 8,900 tokens (80% savings)
→ Slow, hits context limits          → Fast, precise, relevant
→ New session = start over           → KG persists forever

Languages Supported

Language Extensions
Python .py
JavaScript .js, .jsx, .mjs
TypeScript .ts, .tsx, .mts
Java .java
Go .go
Rust .rs
C .c, .h
C++ .cpp, .cc, .cxx, .hpp
Ruby .rb
C# .cs

All parsers extract: classes, functions, methods, signatures, docstrings, 8-line body previews, call relationships, and imports.

Output Formats

Format Extension Best For
JSON .json Universal, programmatic access
TOML .toml Human-readable editing
TOON .toon LLM context windows (~40% fewer tokens)

Format is auto-detected from the output file extension.

How It Works

┌──────────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Multi-Language       │────▶│  Knowledge Graph  │────▶│  FAISS + Embed  │
│  Parsers (10 langs)  │     │  Nodes + Edges   │     │  Semantic Index  │
└──────────────────────┘     └──────────────────┘     └─────────────────┘
                                      │
             ┌────────────────────────┴────────────────────────┐
             │                                                  │
        Layer 1: Graph                                  Layer 2: RAG
        ─ Functions, Classes, Methods                   ─ Sentence embeddings
        ─ CALLS / CALLED_BY edges                       ─ FAISS vector search
        ─ File locations, signatures                    ─ Semantic entry points
        ─ 8-line body previews
  1. Parse — Walks every source file across all supported languages: functions, classes, methods, signatures, docstrings, 8-line previews
  2. Link — Resolves call edges between nodes (who calls whom)
  3. Embed — Encodes each node into a vector with all-MiniLM-L6-v2, indexes with FAISS
  4. Query — Semantic search finds entry points → graph traversal expands related code → token-minimal output

Installation

pip install repo2kg

Or from source:

git clone https://github.com/shreeramktm2004-dev/repo2kg.git
cd repo2kg
pip install -e .

Requirements: Python 3.10+, deps auto-installed: numpy, faiss-cpu, sentence-transformers


Complete Workflow

Step 1 — One-time global setup

repo2kg user-setup

Installs agent instructions globally:

  • ~/.claude/CLAUDE.md — Claude Code reads this at the start of every session
  • ~/.codex/AGENTS.md — Codex equivalent
  • ~/.repo2kg/GLOBAL_AGENTS.md — reference doc

Step 2 — Register all your projects at once

repo2kg scan                          # scans your entire home directory
repo2kg scan --root ~/projects        # or scope to a specific folder

Finds every KG file (by detecting companion .faiss files) and registers them all in ~/.repo2kg/registry.json. Re-run anytime to pick up new projects.

Step 3 — Build + set up a project

cd /your/project
repo2kg build --repo . --out kg.json        # JSON (default)
repo2kg build --repo . --out kg.toon        # TOON (token-efficient)
repo2kg agent-setup --kg kg.json --dir .
git add CLAUDE.md CODEBASE.md AGENTS.md kg.json
git commit -m "Add repo2kg knowledge graph"

agent-setup generates five files and auto-registers the project:

File Who reads it Purpose
CODEBASE.md Any agent / human Full overview: file map, classes, call graph, all signatures
CLAUDE.md Claude Code (auto-detected) Instructions to use KG instead of reading files
.copilot-instructions.md GitHub Copilot Same, Copilot format
.github/copilot-instructions.md Copilot agent mode Same, agent mode format
AGENTS.md Multi-agent systems Node schema + query strategies

Step 4 — Just ask the agent

Open any session in a registered project and ask naturally:

"Add a password reset endpoint"
"Fix the double-charge bug in payment flow"
"Where is rate limiting applied?"
"Refactor the database connection pooling"

The agent finds the KG automatically via ~/.claude/CLAUDE.md, reads CODEBASE.md, queries kg.json with stdlib Python, then reads only the 2-3 source files it actually needs.


CLI Reference

# Build (all supported languages)
repo2kg build [--repo PATH] [--out kg.json|kg.toml|kg.toon] [--exclude PATTERN ...]

# Query (requires FAISS + sentence-transformers)
repo2kg query QUESTION [--kg FILE] [--k INT] [--depth INT] [--format text|json]

# Query (zero dependencies — keyword matching)
repo2kg query-lite KEYWORDS [--kg FILE] [--k INT] [--depth INT] [--format text|json]

# Export standalone markdown for agents
repo2kg export [--kg FILE] [--out FILE]

# Generate all agent instruction files for a project
repo2kg agent-setup [--kg FILE] [--dir PATH] [--no-register]

# Global setup (run once ever)
repo2kg user-setup

# Register all KG files under a directory tree
repo2kg scan [--root PATH]

# Register one project manually
repo2kg register [--kg FILE] [--project PATH]

# List registered projects
repo2kg list

# Show KG statistics
repo2kg stats [--kg FILE]

# Machine-readable tool info for agent discovery
repo2kg info
Flag Default Description
--repo . Repository root to scan
--out kg.json Output path for the KG (.json, .toml, or .toon)
--kg kg.json Path to a saved KG
--k 5 Number of search results
--depth 1 Graph traversal depth (1 = direct calls, 2 = calls-of-calls)
--format text Output format: text or json
--dir . Target directory for generated agent files
--root ~ Root directory for scan
-v off Verbose/debug logging

Default Exclusions

Automatically skipped during build:

__pycache__  .git  node_modules  .tox  .venv  venv  env
.mypy_cache  .pytest_cache  dist  build  site-packages

Add more: repo2kg build --repo . --exclude migrations fixtures


TOON Format

TOON (Token-Oriented Object Notation) is a line-oriented format designed for LLM context windows, using ~40% fewer tokens than JSON.

repo2kg build --repo . --out kg.toon     # Save as TOON
repo2kg query-lite "auth" --kg kg.toon   # Query TOON directly

Example TOON output for a single node:

nodes[788]:
  -
    id: "src/auth.ts::AuthService"
    name: AuthService
    kind: class
    file: src/auth.ts
    signature: export class AuthService
    docstring: Handles JWT authentication
    calls[2]: verifyToken,refreshToken
    callers[0]:

All commands (build, query, query-lite, export, agent-setup, stats) work with .toon files identically to .json.


How Agents Use It (Zero Dependencies)

The global ~/.claude/CLAUDE.md tells the agent to run this at the start of every task:

import json
from pathlib import Path

# Walk up from cwd to find the closest registered project
registry = json.load(open(Path.home() / ".repo2kg" / "registry.json"))
check = Path.cwd()
while check != check.parent:
    if str(check) in registry["projects"]:
        kg = json.load(open(registry["projects"][str(check)]["kg"]))
        break
    check = check.parent

# Now search — no FAISS, no embeddings, pure stdlib
matches = [n for n in kg.values() if "auth" in n["name"].lower()]
for n in matches[:10]:
    print(n["signature"], "—", n.get("docstring", "")[:100])
    print("calls:", [c.split("::")[-1] for c in n.get("calls", [])])

The agent only reads actual source files when the 8-line body_preview is not enough.


Commands In Detail

repo2kg query — Semantic search

repo2kg query "how does authentication work" --kg kg.json
repo2kg query "database connection pooling" --kg kg.json --depth 2
repo2kg query "auth flow" --kg kg.json --format json

Output:

# Query: how does authentication work
# Nodes returned: 12  |  ~340 tokens
# Entry points: verify_token, authenticate_user, JWTMiddleware
# ─────────────────────────────────────────────

# METHOD: verify_token
# file: auth/jwt.py  class: JWTService
def JWTService.verify_token(self, token: str) -> dict:
"""Verify and decode a JWT token."""
    decoded = jwt.decode(token, self.secret, algorithms=["HS256"])
# calls: decode, get_user_from_cache
# called_by: authenticate_user, AuthMiddleware.process_request

repo2kg query-lite — Keyword search, zero heavy deps

No FAISS, no embeddings, instant startup. Uses keyword + graph expansion.

repo2kg query-lite "auth" --kg kg.json
repo2kg query-lite "database connection" --kg kg.toon --format json

repo2kg export — Standalone CODEBASE.md

Generates a single markdown file any agent can read without any tooling:

repo2kg export --kg kg.json --out CODEBASE.md

Contains: overview table, file map, architecture grouping, class/method tables, call graph, all signatures.

repo2kg info — Agent discovery

Prints machine-readable tool description for agents that run repo2kg --help:

repo2kg info    # instant, no heavy deps loaded

repo2kg scan — Register everything at once

repo2kg scan                       # scans ~ (home directory)
repo2kg scan --root ~/work         # scans a specific root

Walks the directory tree, finds every KG file (.json, .toml, .toon with .faiss sidecar), and registers them in ~/.repo2kg/registry.json. Safe to re-run.

repo2kg list

Registered projects (4):
  Project                            Registered
  ──────────────────────────────     ─────────────────────
  /home/user/myproject           ✓   2026-04-16T15:37:13
  /home/user/other-project       ✓   2026-04-16T15:46:34

Python API

# Full mode (requires FAISS + sentence-transformers)
from repo2kg import RepoKG

kg = RepoKG().build("./my_project")
kg.save("kg.json")           # JSON
kg.save("kg.toon")           # TOON (token-efficient)
kg.save("kg.toml")           # TOML (human-readable)

kg = RepoKG.load("kg.json")  # auto-detects format
print(kg.query("payment processing", k=5, depth=2))        # text
print(kg.query_json("auth", k=3))                          # structured dict

# Lightweight mode (stdlib only — safe for agent environments)
from repo2kg import RepoKGLite

kg = RepoKGLite("kg.toon")   # works with any format
print(kg.query("payment processing", k=5, depth=1))        # text
result = kg.query_json("auth", k=3)                        # structured dict
callers = kg.get_callers("auth/service.py::AuthService.login")

Node Schema

Every node in kg.json / kg.toon / kg.toml:

{
  "auth/jwt.py::JWTService.verify_token": {
    "id": "auth/jwt.py::JWTService.verify_token",
    "name": "verify_token",
    "kind": "method",
    "file": "auth/jwt.py",
    "parent_class": "JWTService",
    "signature": "def JWTService.verify_token(self, token: str) -> dict:",
    "docstring": "Verify and decode a JWT token.",
    "body_preview": "    decoded = jwt.decode(token, self.secret, ...)\n    ...",
    "calls": ["auth/jwt.py::JWTService.decode", "cache/redis.py::get_user_from_cache"],
    "callers": ["auth/service.py::authenticate_user"],
    "imports": ["jwt", "datetime"]
  }
}

When to Use repo2kg

Project Size Files Token Savings Recommendation
Small < 20 10–20% Optional
Medium 20–100 40–70% Recommended
Large 100–500 70–90% Strongly recommended
Monorepo 500+ 85–95% Essential

Value scales with interconnectedness — more call edges = more graph traversal advantage.

Real example: The Aviary platform (mixed TS/JS/Python) produced 788 nodes from 177 files across 3 languages in a single build.


Limitations

  • Static analysis — Captures AST/regex calls, not dynamic dispatch or runtime-generated calls.
  • Name-based resolution — Call edges resolved by function name (same-file preferred). No full type inference.
  • No incremental updates — Rebuilds the full graph each time. Large repos take 30–60 seconds.

Contributing

Contributions welcome. Open an issue first for major changes.

git clone https://github.com/shreeramktm2004-dev/repo2kg.git
cd repo2kg
pip install -e .

License

MIT

Requirements: Python 3.10+, deps auto-installed: numpy, faiss-cpu, sentence-transformers


Complete Workflow

Step 1 — One-time global setup

repo2kg user-setup

Installs agent instructions globally:

  • ~/.claude/CLAUDE.md — Claude Code reads this at the start of every session
  • ~/.codex/AGENTS.md — Codex equivalent
  • ~/.repo2kg/GLOBAL_AGENTS.md — reference doc

Step 2 — Register all your projects at once

repo2kg scan                          # scans your entire home directory
repo2kg scan --root ~/projects        # or scope to a specific folder

Finds every kg.json file (by detecting companion .faiss files) and registers them all in ~/.repo2kg/registry.json. Re-run anytime to pick up new projects.

Step 3 — Build + set up a project

cd /your/project
repo2kg build --repo . --out kg.json
repo2kg agent-setup --kg kg.json --dir .
git add CLAUDE.md CODEBASE.md AGENTS.md kg.json
git commit -m "Add repo2kg knowledge graph"

agent-setup generates five files and auto-registers the project:

File Who reads it Purpose
CODEBASE.md Any agent / human Full overview: file map, classes, call graph, all signatures
CLAUDE.md Claude Code (auto-detected) Instructions to use KG instead of reading files
.copilot-instructions.md GitHub Copilot Same, Copilot format
.github/copilot-instructions.md Copilot agent mode Same, agent mode format
AGENTS.md Multi-agent systems Node schema + query strategies

Step 4 — Just ask the agent

Open any session in a registered project and ask naturally:

"Add a password reset endpoint"
"Fix the double-charge bug in payment flow"
"Where is rate limiting applied?"
"Refactor the database connection pooling"

The agent finds the KG automatically via ~/.claude/CLAUDE.md, reads CODEBASE.md, queries kg.json with stdlib Python, then reads only the 2-3 source files it actually needs.


CLI Reference

# Build
repo2kg build [--repo PATH] [--out FILE] [--exclude PATTERN ...]

# Query (requires FAISS + sentence-transformers)
repo2kg query QUESTION [--kg FILE] [--k INT] [--depth INT] [--format text|json]

# Query (zero dependencies — keyword matching)
repo2kg query-lite KEYWORDS [--kg FILE] [--k INT] [--depth INT] [--format text|json]

# Export standalone markdown for agents
repo2kg export [--kg FILE] [--out FILE]

# Generate all agent instruction files for a project
repo2kg agent-setup [--kg FILE] [--dir PATH] [--no-register]

# Global setup (run once ever)
repo2kg user-setup

# Register all KG files under a directory tree
repo2kg scan [--root PATH]

# Register one project manually
repo2kg register [--kg FILE] [--project PATH]

# List registered projects
repo2kg list

# Show KG statistics
repo2kg stats [--kg FILE]
Flag Default Description
--repo . Repository root to scan
--out kg.json Output path for the KG
--kg kg.json Path to a saved KG
--k 5 Number of search results
--depth 1 Graph traversal depth (1 = direct calls, 2 = calls-of-calls)
--format text Output format: text or json
--dir . Target directory for generated agent files
--root ~ Root directory for scan
-v off Verbose/debug logging

Default Exclusions

Automatically skipped during build:

__pycache__  .git  node_modules  .tox  .venv  venv  env
.mypy_cache  .pytest_cache  dist  build  site-packages

Add more: repo2kg build --repo . --exclude migrations fixtures


How Agents Use It (Zero Dependencies)

The global ~/.claude/CLAUDE.md tells the agent to run this at the start of every task:

import json
from pathlib import Path

# Walk up from cwd to find the closest registered project
registry = json.load(open(Path.home() / ".repo2kg" / "registry.json"))
check = Path.cwd()
while check != check.parent:
    if str(check) in registry["projects"]:
        kg = json.load(open(registry["projects"][str(check)]["kg"]))
        break
    check = check.parent

# Now search — no FAISS, no embeddings, pure stdlib
matches = [n for n in kg.values() if "auth" in n["name"].lower()]
for n in matches[:10]:
    print(n["signature"], "—", n.get("docstring", "")[:100])
    print("calls:", [c.split("::")[-1] for c in n.get("calls", [])])

The agent only reads actual source files when the 8-line body_preview is not enough.


Commands In Detail

repo2kg query — Semantic search

repo2kg query "how does authentication work" --kg kg.json
repo2kg query "database connection pooling" --kg kg.json --depth 2
repo2kg query "auth flow" --kg kg.json --format json

Output:

# Query: how does authentication work
# Nodes returned: 12  |  ~340 tokens
# Entry points: verify_token, authenticate_user, JWTMiddleware
# ─────────────────────────────────────────────

# METHOD: verify_token
# file: auth/jwt.py  class: JWTService
def JWTService.verify_token(self, token: str) -> dict:
"""Verify and decode a JWT token."""
    decoded = jwt.decode(token, self.secret, algorithms=["HS256"])
# calls: decode, get_user_from_cache
# called_by: authenticate_user, AuthMiddleware.process_request

repo2kg query-lite — Keyword search, zero heavy deps

No FAISS, no embeddings, instant startup. Uses keyword + graph expansion.

repo2kg query-lite "auth" --kg kg.json
repo2kg query-lite "database connection" --kg kg.json --format json

repo2kg export — Standalone CODEBASE.md

Generates a single markdown file any agent can read without any tooling:

repo2kg export --kg kg.json --out CODEBASE.md

Contains: overview table, file map, architecture grouping, class/method tables, call graph, all signatures.

repo2kg scan — Register everything at once

repo2kg scan                       # scans ~ (home directory)
repo2kg scan --root ~/work         # scans a specific root

Walks the directory tree, finds every KG file (by detecting companion .faiss files), and registers them all in ~/.repo2kg/registry.json. Safe to re-run — existing entries are updated, not duplicated.

Scanning /home/user for KG files...
  + /home/user/myproject
      KG: /home/user/myproject/kg.json
  ~ /home/user/other-project (updated)

Done: 1 new, 1 updated — 2 total registered

repo2kg list

Registered projects (4):
  Project                            Registered
  ──────────────────────────────     ─────────────────────
  /home/user/myproject           ✓   2026-04-16T15:37:13
  /home/user/other-project       ✓   2026-04-16T15:46:34

Python API

# Full mode (requires FAISS + sentence-transformers)
from repo2kg import RepoKG

kg = RepoKG().build("./my_project")
kg.save("kg.json")

kg = RepoKG.load("kg.json")
print(kg.query("payment processing", k=5, depth=2))        # text
print(kg.query_json("auth", k=3))                          # structured dict

# Lightweight mode (stdlib only — safe for agent environments)
from repo2kg import RepoKGLite

kg = RepoKGLite("kg.json")
print(kg.query("payment processing", k=5, depth=1))        # text
result = kg.query_json("auth", k=3)                        # structured dict
callers = kg.get_callers("auth/service.py::AuthService.login")

Node Schema

Every node in kg.json:

{
  "auth/jwt.py::JWTService.verify_token": {
    "id": "auth/jwt.py::JWTService.verify_token",
    "name": "verify_token",
    "kind": "method",
    "file": "auth/jwt.py",
    "parent_class": "JWTService",
    "signature": "def JWTService.verify_token(self, token: str) -> dict:",
    "docstring": "Verify and decode a JWT token.",
    "body_preview": "    decoded = jwt.decode(token, self.secret, ...)\n    ...",
    "calls": ["auth/jwt.py::JWTService.decode", "cache/redis.py::get_user_from_cache"],
    "callers": ["auth/service.py::authenticate_user"],
    "imports": ["jwt", "datetime"]
  }
}

When to Use repo2kg

Project Size Files Token Savings Recommendation
Small < 20 10–20% Optional
Medium 20–100 40–70% Recommended
Large 100–500 70–90% Strongly recommended
Monorepo 500+ 85–95% Essential

Value scales with interconnectedness — more call edges = more graph traversal advantage.


Limitations

  • Python only — AST parsing is Python-specific. TypeScript/Go/Rust support planned.
  • Static analysis — Captures AST calls, not dynamic dispatch or runtime-generated calls.
  • Name-based resolution — Call edges resolved by function name (same-file preferred). No full type inference.
  • No incremental updates — Rebuilds the full graph each time. Large repos take 30–60 seconds.

Contributing

Contributions welcome. Open an issue first for major changes.

git clone https://github.com/shreeramktm2004-dev/repo2kg.git
cd repo2kg
pip install -e .

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repo2kg-0.5.3.tar.gz (33.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

repo2kg-0.5.3-py3-none-any.whl (33.2 kB view details)

Uploaded Python 3

File details

Details for the file repo2kg-0.5.3.tar.gz.

File metadata

  • Download URL: repo2kg-0.5.3.tar.gz
  • Upload date:
  • Size: 33.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for repo2kg-0.5.3.tar.gz
Algorithm Hash digest
SHA256 df462d178eef46fe3e4034d78766f4517573422ce2e71611b15ff8782305e756
MD5 cf5c26726e7c3aa2c87760a45d07e7ce
BLAKE2b-256 55b37e3180d861e9b6c79348c5eeb28c23ebd0abd6e3ba11828885cf939c948f

See more details on using hashes here.

File details

Details for the file repo2kg-0.5.3-py3-none-any.whl.

File metadata

  • Download URL: repo2kg-0.5.3-py3-none-any.whl
  • Upload date:
  • Size: 33.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for repo2kg-0.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7447cf7c67fa08cf036243bfbebe4a83abbdb726283f14851d7ce9dfeee9f4b0
MD5 29452f9c7a75c14b0e9ca37d238e96f9
BLAKE2b-256 1c94f9cfce0ab7d6227faebbdfb309898a956b59a467c0997f7f2c5def75651d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page