repo2kg

Build a code Knowledge Graph from any repository (Python/JS/TS/Java/Go/Rust/C/C++/Ruby/C#) for token-efficient AI agent context. Supports JSON and TOON output formats. Agent-friendly: generates CLAUDE.md, CODEBASE.md, and supports zero-dep queries.

These details have not been verified by PyPI

Project links

Project description

  ██████╗ ███████╗██████╗  ██████╗ ██████╗ ██╗  ██╗ ██████╗
  ██╔══██╗██╔════╝██╔══██╗██╔═══██╗╚════██╗██║ ██╔╝██╔════╝
  ██████╔╝█████╗  ██████╔╝██║   ██║ █████╔╝█████╔╝ ██║  ███╗
  ██╔══██╗██╔══╝  ██╔═══╝ ██║   ██║██╔═══╝ ██╔═██╗ ██║   ██║
  ██║  ██║███████╗██║      ╚██████╔╝███████╗██║  ██╗╚██████╔╝
  ╚═╝  ╚═╝╚══════╝╚═╝       ╚═════╝ ╚══════╝╚═╝  ╚═╝ ╚═════╝

Turn any repository into a Knowledge Graph for token-efficient AI agent context.

The Problem

Every time you ask Claude Code, Copilot, or Codex to work on your project, it reads hundreds of files — burning through your context window before it even starts.

Without repo2kg                      With repo2kg
────────────────────                 ─────────────────────
Agent reads 229 files                Agent reads CODEBASE.md
→ ~53,000+ tokens used               → ~4,000 tokens (85%-95% saved)
→ Slow, hits context limits          → Fast, precise, relevant
→ New session = start over           → KG persists forever

repo2kg builds a searchable graph of your entire codebase — functions, classes, call edges — so agents query exactly what they need instead of everything.

Token Efficiency: Real Numbers

Tested on a project with 229 source files, 2,909 nodes, and 4,615 call edges. The agent understood the full architecture without opening a single raw source file.

With repo2kg

The agent ran four commands:

repo2kg list
repo2kg stats --kg kg.json
repo2kg query-lite "main app entry" --kg kg.json
repo2kg query-lite "authentication login jwt token" --kg kg.json
repo2kg export --kg kg.json

Step	Tokens
stats output	~100
4 × query-lite results	~2,400
CODEBASE.md overview	~1,500
Total	~4,000 tokens

Without repo2kg

# Agent must search and read files manually
glob file listing → grep functions/classes → read source files

Step	Tokens
File listing	~3,000
Reading 10–20 source files	~15,000–40,000
Grep results	~5,000–10,000
Total	~23,000–53,000 tokens

Result

Method	Tokens	Quality
repo2kg	~4,000	High — structured graph, signatures, call edges
Normal search	~23,000–53,000	Medium — raw code scanning

Reduction:  83–92% fewer tokens during codebase discovery
Full session savings (including messages):  ~40–60%

The larger and more interconnected your repo, the bigger the gain:

Project Size	Files	Token Savings
Small	< 20	10–20%
Medium	20–100	40–70%
Large	100–500	70–90%
Monorepo	500+	85–95%

Quick Start

pip install repo2kg

# 1. One-time global setup (runs in seconds, no heavy deps)
repo2kg user-setup

# 2. Build a KG for your project
cd /your/project
repo2kg build --repo . --out kg.json

# 3. Generate agent instruction files
repo2kg agent-setup --kg kg.json --dir .

# 4. Query it
repo2kg query "how does authentication work" --kg kg.json

# 5. Generate an interactive visual graph
repo2kg visualize --kg kg.json --out kg_graph.html

Installation

pip install repo2kg

Requirements: Python 3.10+

Dependencies auto-installed:

Package	Size	When Used
`numpy`	~20 MB	build, query
`faiss-cpu`	~60 MB	build, query (CPU only, no GPU/CUDA needed)
`sentence-transformers`	~2 GB (PyTorch)	build, query
`tree-sitter` + 9 grammars	~10 MB	build (AST parsing)

Note: query-lite, user-setup, export, agent-setup, list, and register all start instantly — they never load PyTorch or FAISS. Only build and query load the heavy dependencies.

How It Works

Your Repository
       │
       ▼
┌──────────────────────────────────────┐
│  Parse all source files              │  ← tree-sitter (accurate) or regex (fallback)
│  Python/JS/TS/Java/Go/Rust/C/C++    │
│  Ruby/C#  (10 languages)            │
└──────────────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────┐
│  Build Knowledge Graph               │
│  • Nodes: functions, classes, methods│
│  • Edges: who calls whom             │
│  • Data: signatures, docstrings,     │
│    8-line body previews              │
└──────────────────────────────────────┘
       │                    │
       ▼                    ▼
  kg.json / kg.toon     FAISS index
  (structured data)     (semantic vectors)
       │                    │
       └────────┬───────────┘
                ▼
         query / query-lite
         → token-minimal context for your agent

Two query modes

Command	How it searches	Deps needed	Speed
`query`	Semantic similarity (FAISS + embeddings)	PyTorch + FAISS	Slower start
`query-lite`	Keyword matching + graph expansion	None (stdlib only)	Instant

Both do graph expansion — they find entry-point nodes, then follow call edges to return related code automatically.

Two output formats

Format	Extension	Best for
JSON	`.json`	Universal, default
TOON	`.toon`	LLM context windows (~40% fewer tokens)

Complete Workflow

Step 1 — Global setup (once ever)

repo2kg user-setup

Writes agent instructions to:

~/.claude/CLAUDE.md — Claude Code reads this at the start of every session
~/.codex/AGENTS.md — Codex equivalent
~/.repo2kg/GLOBAL_AGENTS.md — reference doc

From this point on, any registered project is automatically discovered by your agent.

Step 2 — Discover & register existing projects

repo2kg scan                     # scans your entire home directory
repo2kg scan --root ~/projects   # or a specific folder

Finds every KG file (by detecting companion .faiss files) and registers them in ~/.repo2kg/registry.json. Re-run any time.

Step 3 — Build a KG for your project

cd /your/project

repo2kg build --repo . --out kg.json     # JSON (default)
repo2kg build --repo . --out kg.toon     # TOON (40% fewer tokens for LLMs)

Output:

KG ready: 421 nodes, 892 call edges from 63 files
Languages: javascript: 38, python: 25
Saved to kg.json [JSON] (421 nodes)

Three files are created: kg.json, kg.json.faiss, kg.json.idx

Step 4 — Generate agent files

repo2kg agent-setup --kg kg.json --dir .

Creates five files:

File	Purpose
`CODEBASE.md`	Full overview — file map, classes, call graph, all signatures
`CLAUDE.md`	Tells Claude Code to use the KG instead of reading files
`.copilot-instructions.md`	Same, for GitHub Copilot
`.github/copilot-instructions.md`	Same, for Copilot agent mode
`AGENTS.md`	Node schema + query strategies for multi-agent systems

# Commit everything
git add CLAUDE.md CODEBASE.md AGENTS.md kg.json kg.json.faiss kg.json.idx
git commit -m "Add repo2kg knowledge graph"

Step 5 — Query it

# Keyword search — zero deps, instant
repo2kg query-lite "authentication" --kg kg.json
repo2kg query-lite "database connection" --kg kg.json --k 10 --depth 2

# Semantic search — more accurate, needs FAISS
repo2kg query "how does auth work" --kg kg.json
repo2kg query "payment processing flow" --kg kg.json --depth 2

# Structured JSON output (for scripts/agents)
repo2kg query-lite "auth" --kg kg.json --format json

Sample output:

# Query: authentication
# Nodes returned: 8  |  ~210 tokens
# Entry points: verify_token, JWTService, authenticate_user
# ─────────────────────────────────────────────

# METHOD: verify_token
# file: auth/jwt.py  class: JWTService
def JWTService.verify_token(self, token: str) -> dict:
"""Verify and decode a JWT token."""
    decoded = jwt.decode(token, self.secret, algorithms=["HS256"])
    return self.get_user_from_cache(decoded["sub"])
# calls: decode, get_user_from_cache
# called_by: authenticate_user, AuthMiddleware.process_request

Tree-Sitter Parsing

repo2kg uses tree-sitter for all non-Python languages — installed automatically with pip install repo2kg. No extra steps needed.

Advantage over regex	Detail
Multi-line signatures	Captured correctly
Nested classes	Exact parent class
Code in strings/comments	Ignored (no false matches)
Generic types `List<T>`	Handled correctly
Arrow functions in classes	Correct method detection

Python always uses Python's own ast module. If tree-sitter fails for any file, repo2kg silently falls back to regex.

TOON Format

TOON (Token-Oriented Object Notation) is a compact line-oriented format optimised for LLM context windows. It uses ~40% fewer tokens than JSON.

repo2kg build --repo . --out kg.toon     # Save as TOON
repo2kg query-lite "auth" --kg kg.toon   # Query TOON directly

Example TOON output for a single node:

nodes[788]:
  -
    id: "src/auth.ts::AuthService"
    name: AuthService
    kind: class
    file: src/auth.ts
    signature: export class AuthService
    docstring: Handles JWT authentication
    calls[2]: verifyToken,refreshToken
    callers[0]:

All commands (build, query, query-lite, export, agent-setup, stats) work with .toon files identically to .json.

How Agents Use It (Zero Dependencies)

The global ~/.claude/CLAUDE.md (installed by user-setup) tells the agent to run this at the start of every task:

import json
from pathlib import Path

# Walk up from cwd to find the closest registered project
registry = json.load(open(Path.home() / ".repo2kg" / "registry.json"))
check = Path.cwd()
while check != check.parent:
    if str(check) in registry["projects"]:
        kg = json.load(open(registry["projects"][str(check)]["kg"]))
        break
    check = check.parent

# Search — no FAISS, no embeddings, pure stdlib
matches = [n for n in kg.values() if "auth" in n["name"].lower()]
for n in matches[:10]:
    print(n["signature"], "—", n.get("docstring", "")[:100])
    print("calls:", [c.split("::")[-1] for c in n.get("calls", [])])

The agent only reads actual source files when the 8-line body_preview is not enough.

CLI Reference

repo2kg build        Build KG from a repository
repo2kg query        Semantic search (requires FAISS)
repo2kg query-lite   Keyword search (zero dependencies)
repo2kg export       Export as CODEBASE.md
repo2kg visualize    Generate interactive HTML graph (no browser server needed)
repo2kg agent-setup  Generate all agent instruction files
repo2kg user-setup   Install global agent instructions (run once)
repo2kg scan         Auto-discover and register all KGs under a directory
repo2kg register     Register a single project manually
repo2kg list         Show all registered projects
repo2kg stats        Show KG node/edge statistics
repo2kg info         Print machine-readable tool info (for agents)

Flags

Flag	Default	Description
`--repo`	`.`	Repository root to scan
`--out` (visualize)	`kg_graph.html`	Output HTML path for `visualize`
`--max-nodes`	`800`	Max symbol nodes in the HTML graph
`--out`	`kg.json`	Output path (`.json` or `.toon`)
`--kg`	`kg.json`	Path to a saved KG file
`--k`	`5`	Number of top results
`--depth`	`1`	Graph traversal depth (1 = direct calls, 2 = calls of calls)
`--format`	`text`	Output format: `text` or `json`
`--dir`	`.`	Target directory for `agent-setup`
`--root`	`~`	Root directory for `scan`
`-v`	off	Verbose/debug logging

Auto-excluded directories

__pycache__  .git  .hg  .svn  .tox  .nox  .mypy_cache  .pytest_cache
node_modules  vendor  bower_components  site-packages
dist  build  out  coverage  .next  .nuxt  .output  .vite  __sapper__
venv  .venv  env  .env

Auto-excluded file patterns

Files matching these patterns are skipped even inside user source directories:

*.min.js   *.min.css   *.bundle.js   *.chunk.js
*.generated.*  chunk-*.js   *_pb2.py   *.pb.go
*.g.dart   *.freezed.dart

Add more: repo2kg build --repo . --exclude migrations fixtures

Python API

# Full mode (FAISS + sentence-transformers)
from repo2kg import RepoKG

kg = RepoKG().build("./my_project")
kg.save("kg.json")                                # JSON
kg.save("kg.toon")                               # TOON (fewer tokens)

kg = RepoKG.load("kg.json")
print(kg.query("auth flow", k=5, depth=2))        # text output
data = kg.query_json("payment", k=3)              # structured dict

# Lightweight mode — stdlib only, safe for agents
from repo2kg import RepoKGLite

kg = RepoKGLite("kg.json")
print(kg.query("auth", k=5, depth=1))             # text output
result = kg.query_json("database", k=3)           # structured dict
callers = kg.get_callers("auth/service.py::AuthService.login")
callees = kg.get_callees("auth/service.py::AuthService.login")

Node Schema

Every node in kg.json / kg.toon:

{
  "auth/jwt.py::JWTService.verify_token": {
    "id":           "auth/jwt.py::JWTService.verify_token",
    "name":         "verify_token",
    "kind":         "method",
    "file":         "auth/jwt.py",
    "parent_class": "JWTService",
    "signature":    "def JWTService.verify_token(self, token: str) -> dict:",
    "docstring":    "Verify and decode a JWT token.",
    "body_preview": "    decoded = jwt.decode(token, self.secret, ...)\n    ...",
    "calls":        ["auth/jwt.py::JWTService.decode", "cache/redis.py::get_user"],
    "callers":      ["auth/service.py::authenticate_user"],
    "imports":      ["jwt", "datetime"]
  }
}

Supported Languages

Language	Extensions	Parser
Python	`.py`	`ast` module (always accurate)
JavaScript	`.js` `.jsx` `.mjs`	tree-sitter / regex
TypeScript	`.ts` `.tsx` `.mts`	tree-sitter / regex
Java	`.java`	tree-sitter / regex
Go	`.go`	tree-sitter / regex
Rust	`.rs`	tree-sitter / regex
C	`.c` `.h`	tree-sitter / regex
C++	`.cpp` `.cc` `.cxx` `.hpp`	tree-sitter / regex
Ruby	`.rb`	tree-sitter / regex
C#	`.cs`	tree-sitter / regex

Interactive Graph Visualization

Generate a fully self-contained HTML file — no server, no CDN, works offline:

repo2kg visualize --kg kg.json --out kg_graph.html
# open kg_graph.html in any browser

What you see

The graph uses a file-hub layout:

Large indigo circles = your source files, sized by how many symbols they contain
Small pill nodes = functions, classes, and methods inside each file
Dashed lines = containment (file → its symbols)
Solid arrows = call edges between symbols

What's included

Only user-written code is rendered. Third-party and bundled files are automatically excluded from the HTML graph (but remain in the KG file for queries):

Excluded path pattern	Example
`.vite/deps/`	Vite-bundled npm packages
`node_modules/`	Raw npm packages
`site-packages/`	Python pip packages
`/dist/`	Build output
`/.cache/`	Build caches

This typically reduces a 1500-node cluttered graph down to just your 100–300 meaningful nodes.

Interactions

Action	Result
Click a file node	Sidebar lists all its classes, functions, methods
Click a symbol node	Sidebar shows signature, docstring, calls, callers
Toggle Classes / Functions / Methods	Show/hide node kinds
Search box	Highlights matching nodes across files
Fit button	Auto-zooms to show the full graph
Drag any node	Pin it in place

Limitations

Static analysis only — captures defined calls, not dynamic dispatch or runtime-generated calls
Name-based call resolution — same-file preferred; no full type inference
No incremental updates — rebuilds the full graph each time (30–60s for large repos)
Visualization shows user code only — third-party files (.vite/deps/, node_modules/, site-packages/) are kept in the KG file but excluded from the HTML graph

Contributing

Contributions welcome! Open an issue first for major changes.

git clone https://github.com/mugenGH/repo2kg.git
cd repo2kg
pip install -e .

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.16

Apr 21, 2026

This version

0.5.15

Apr 21, 2026

0.5.14

Apr 21, 2026

0.5.13

Apr 21, 2026

0.5.12

Apr 21, 2026

0.5.11

Apr 21, 2026

0.5.10

Apr 21, 2026

0.5.9

Apr 21, 2026

0.5.8

Apr 21, 2026

0.5.7

Apr 21, 2026

0.5.6

Apr 21, 2026

0.5.5

Apr 21, 2026

0.5.4

Apr 17, 2026

0.5.3

Apr 17, 2026

0.5.2

Apr 17, 2026

0.5.1

Apr 17, 2026

0.5.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repo2kg-0.5.15.tar.gz (47.2 kB view details)

Uploaded Apr 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

repo2kg-0.5.15-py3-none-any.whl (47.2 kB view details)

Uploaded Apr 21, 2026 Python 3

File details

Details for the file repo2kg-0.5.15.tar.gz.

File metadata

Download URL: repo2kg-0.5.15.tar.gz
Upload date: Apr 21, 2026
Size: 47.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for repo2kg-0.5.15.tar.gz
Algorithm	Hash digest
SHA256	`d68d8d0815cab2b13189711030ec5b8147a6b004eb51b1a40f31341b74107f0e`
MD5	`d94b7f357149cd0adf312ec9afbd9955`
BLAKE2b-256	`54e4a57202c6976b2ee92e217fdd1d8b66bac31fc3fc8ea5fcd0858d0d65e5a4`

See more details on using hashes here.

File details

Details for the file repo2kg-0.5.15-py3-none-any.whl.

File metadata

Download URL: repo2kg-0.5.15-py3-none-any.whl
Upload date: Apr 21, 2026
Size: 47.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for repo2kg-0.5.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b179b80b9c86a0552e3e3740aa32c0ce8f422ef8ea5d16d6b6408a491446b518`
MD5	`120454deadf4e5423feaadc4fdeab2b8`
BLAKE2b-256	`2f6f728c0679281ef2acb938c16664596c42c8befaa7cb917dcdaa2bf90e8d0c`

See more details on using hashes here.

repo2kg 0.5.15

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

The Problem

Token Efficiency: Real Numbers

With repo2kg

Without repo2kg

Result

Quick Start

Installation

How It Works

Two query modes

Two output formats

Complete Workflow

Step 1 — Global setup (once ever)

Step 2 — Discover & register existing projects

Step 3 — Build a KG for your project

Step 4 — Generate agent files

Step 5 — Query it

Tree-Sitter Parsing

TOON Format

How Agents Use It (Zero Dependencies)

CLI Reference

Flags

Auto-excluded directories

Auto-excluded file patterns

Python API

Node Schema

Supported Languages

Interactive Graph Visualization

What you see

What's included

Interactions

Limitations

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes