Skip to main content

Build a code Knowledge Graph from any repository (Python/JS/TS/Java/Go/Rust/C/C++/Ruby/C#) for token-efficient AI agent context. Supports JSON and TOON output formats. Agent-friendly: generates CLAUDE.md, CODEBASE.md, and supports zero-dep queries.

Project description

  ██████╗ ███████╗██████╗  ██████╗ ██████╗ ██╗  ██╗ ██████╗
  ██╔══██╗██╔════╝██╔══██╗██╔═══██╗╚════██╗██║ ██╔╝██╔════╝
  ██████╔╝█████╗  ██████╔╝██║   ██║ █████╔╝█████╔╝ ██║  ███╗
  ██╔══██╗██╔══╝  ██╔═══╝ ██║   ██║██╔═══╝ ██╔═██╗ ██║   ██║
  ██║  ██║███████╗██║      ╚██████╔╝███████╗██║  ██╗╚██████╔╝
  ╚═╝  ╚═╝╚══════╝╚═╝       ╚═════╝ ╚══════╝╚═╝  ╚═╝ ╚═════╝

Turn any repository into a Knowledge Graph for token-efficient AI agent context.

PyPI version Python License: MIT


The Problem

Every time you ask Claude Code, Copilot, or Codex to work on your project, it reads hundreds of files — burning through your context window before it even starts.

Without repo2kg                      With repo2kg
────────────────────                 ─────────────────────
Agent reads 229 files                Agent reads CODEBASE.md
→ ~53,000+ tokens used               → ~4,000 tokens (85%-95% saved)
→ Slow, hits context limits          → Fast, precise, relevant
→ New session = start over           → KG persists forever

repo2kg builds a searchable graph of your entire codebase — functions, classes, call edges — so agents query exactly what they need instead of everything.


Token Efficiency: Real Numbers

Tested on a project with 229 source files, 2,909 nodes, and 4,615 call edges. The agent understood the full architecture without opening a single raw source file.

With repo2kg

The agent ran four commands:

repo2kg list
repo2kg stats --kg kg.json
repo2kg query-lite "main app entry" --kg kg.json
repo2kg query-lite "authentication login jwt token" --kg kg.json
repo2kg export --kg kg.json
Step Tokens
stats output ~100
4 × query-lite results ~2,400
CODEBASE.md overview ~1,500
Total ~4,000 tokens

Without repo2kg

# Agent must search and read files manually
glob file listing  grep functions/classes  read source files
Step Tokens
File listing ~3,000
Reading 10–20 source files ~15,000–40,000
Grep results ~5,000–10,000
Total ~23,000–53,000 tokens

Result

Method Tokens Quality
repo2kg ~4,000 High — structured graph, signatures, call edges
Normal search ~23,000–53,000 Medium — raw code scanning
Reduction:  83–92% fewer tokens during codebase discovery
Full session savings (including messages):  ~40–60%

The larger and more interconnected your repo, the bigger the gain:

Project Size Files Token Savings
Small < 20 10–20%
Medium 20–100 40–70%
Large 100–500 70–90%
Monorepo 500+ 85–95%

Quick Start

pip install repo2kg

# 1. One-time global setup (runs in seconds, no heavy deps)
repo2kg user-setup

# 2. Build a KG for your project
cd /your/project
repo2kg build --repo . --out kg.json

# 3. Generate agent instruction files
repo2kg agent-setup --kg kg.json --dir .

# 4. Query it
repo2kg query "how does authentication work" --kg kg.json

Installation

pip install repo2kg

Requirements: Python 3.10+

Dependencies auto-installed:

Package Size When Used
numpy ~20 MB build, query
faiss-cpu ~60 MB build, query (CPU only, no GPU/CUDA needed)
sentence-transformers ~2 GB (PyTorch) build, query
tree-sitter + 9 grammars ~10 MB build (AST parsing)

Note: query-lite, user-setup, export, agent-setup, list, and register all start instantly — they never load PyTorch or FAISS. Only build and query load the heavy dependencies.


How It Works

Your Repository
       │
       ▼
┌──────────────────────────────────────┐
│  Parse all source files              │  ← tree-sitter (accurate) or regex (fallback)
│  Python/JS/TS/Java/Go/Rust/C/C++    │
│  Ruby/C#  (10 languages)            │
└──────────────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────┐
│  Build Knowledge Graph               │
│  • Nodes: functions, classes, methods│
│  • Edges: who calls whom             │
│  • Data: signatures, docstrings,     │
│    8-line body previews              │
└──────────────────────────────────────┘
       │                    │
       ▼                    ▼
  kg.json / kg.toon     FAISS index
  (structured data)     (semantic vectors)
       │                    │
       └────────┬───────────┘
                ▼
         query / query-lite
         → token-minimal context for your agent

Two query modes

Command How it searches Deps needed Speed
query Semantic similarity (FAISS + embeddings) PyTorch + FAISS Slower start
query-lite Keyword matching + graph expansion None (stdlib only) Instant

Both do graph expansion — they find entry-point nodes, then follow call edges to return related code automatically.

Two output formats

Format Extension Best for
JSON .json Universal, default
TOON .toon LLM context windows (~40% fewer tokens)

Complete Workflow

Step 1 — Global setup (once ever)

repo2kg user-setup

Writes agent instructions to:

  • ~/.claude/CLAUDE.md — Claude Code reads this at the start of every session
  • ~/.codex/AGENTS.md — Codex equivalent
  • ~/.repo2kg/GLOBAL_AGENTS.md — reference doc

From this point on, any registered project is automatically discovered by your agent.


Step 2 — Discover & register existing projects

repo2kg scan                     # scans your entire home directory
repo2kg scan --root ~/projects   # or a specific folder

Finds every KG file (by detecting companion .faiss files) and registers them in ~/.repo2kg/registry.json. Re-run any time.


Step 3 — Build a KG for your project

cd /your/project

repo2kg build --repo . --out kg.json     # JSON (default)
repo2kg build --repo . --out kg.toon     # TOON (40% fewer tokens for LLMs)

Output:

KG ready: 421 nodes, 892 call edges from 63 files
Languages: javascript: 38, python: 25
Saved to kg.json [JSON] (421 nodes)

Three files are created: kg.json, kg.json.faiss, kg.json.idx


Step 4 — Generate agent files

repo2kg agent-setup --kg kg.json --dir .

Creates five files:

File Purpose
CODEBASE.md Full overview — file map, classes, call graph, all signatures
CLAUDE.md Tells Claude Code to use the KG instead of reading files
.copilot-instructions.md Same, for GitHub Copilot
.github/copilot-instructions.md Same, for Copilot agent mode
AGENTS.md Node schema + query strategies for multi-agent systems
# Commit everything
git add CLAUDE.md CODEBASE.md AGENTS.md kg.json kg.json.faiss kg.json.idx
git commit -m "Add repo2kg knowledge graph"

Step 5 — Query it

# Keyword search — zero deps, instant
repo2kg query-lite "authentication" --kg kg.json
repo2kg query-lite "database connection" --kg kg.json --k 10 --depth 2

# Semantic search — more accurate, needs FAISS
repo2kg query "how does auth work" --kg kg.json
repo2kg query "payment processing flow" --kg kg.json --depth 2

# Structured JSON output (for scripts/agents)
repo2kg query-lite "auth" --kg kg.json --format json

Sample output:

# Query: authentication
# Nodes returned: 8  |  ~210 tokens
# Entry points: verify_token, JWTService, authenticate_user
# ─────────────────────────────────────────────

# METHOD: verify_token
# file: auth/jwt.py  class: JWTService
def JWTService.verify_token(self, token: str) -> dict:
"""Verify and decode a JWT token."""
    decoded = jwt.decode(token, self.secret, algorithms=["HS256"])
    return self.get_user_from_cache(decoded["sub"])
# calls: decode, get_user_from_cache
# called_by: authenticate_user, AuthMiddleware.process_request

Tree-Sitter Parsing

repo2kg uses tree-sitter for all non-Python languages — installed automatically with pip install repo2kg. No extra steps needed.

Advantage over regex Detail
Multi-line signatures Captured correctly
Nested classes Exact parent class
Code in strings/comments Ignored (no false matches)
Generic types List<T> Handled correctly
Arrow functions in classes Correct method detection

Python always uses Python's own ast module. If tree-sitter fails for any file, repo2kg silently falls back to regex.


TOON Format

TOON (Token-Oriented Object Notation) is a compact line-oriented format optimised for LLM context windows. It uses ~40% fewer tokens than JSON.

repo2kg build --repo . --out kg.toon     # Save as TOON
repo2kg query-lite "auth" --kg kg.toon   # Query TOON directly

Example TOON output for a single node:

nodes[788]:
  -
    id: "src/auth.ts::AuthService"
    name: AuthService
    kind: class
    file: src/auth.ts
    signature: export class AuthService
    docstring: Handles JWT authentication
    calls[2]: verifyToken,refreshToken
    callers[0]:

All commands (build, query, query-lite, export, agent-setup, stats) work with .toon files identically to .json.


How Agents Use It (Zero Dependencies)

The global ~/.claude/CLAUDE.md (installed by user-setup) tells the agent to run this at the start of every task:

import json
from pathlib import Path

# Walk up from cwd to find the closest registered project
registry = json.load(open(Path.home() / ".repo2kg" / "registry.json"))
check = Path.cwd()
while check != check.parent:
    if str(check) in registry["projects"]:
        kg = json.load(open(registry["projects"][str(check)]["kg"]))
        break
    check = check.parent

# Search — no FAISS, no embeddings, pure stdlib
matches = [n for n in kg.values() if "auth" in n["name"].lower()]
for n in matches[:10]:
    print(n["signature"], "—", n.get("docstring", "")[:100])
    print("calls:", [c.split("::")[-1] for c in n.get("calls", [])])

The agent only reads actual source files when the 8-line body_preview is not enough.


CLI Reference

repo2kg build        Build KG from a repository
repo2kg query        Semantic search (requires FAISS)
repo2kg query-lite   Keyword search (zero dependencies)
repo2kg export       Export as CODEBASE.md
repo2kg agent-setup  Generate all agent instruction files
repo2kg user-setup   Install global agent instructions (run once)
repo2kg scan         Auto-discover and register all KGs under a directory
repo2kg register     Register a single project manually
repo2kg list         Show all registered projects
repo2kg stats        Show KG node/edge statistics
repo2kg info         Print machine-readable tool info (for agents)

Flags

Flag Default Description
--repo . Repository root to scan
--out kg.json Output path (.json or .toon)
--kg kg.json Path to a saved KG file
--k 5 Number of top results
--depth 1 Graph traversal depth (1 = direct calls, 2 = calls of calls)
--format text Output format: text or json
--dir . Target directory for agent-setup
--root ~ Root directory for scan
-v off Verbose/debug logging

Auto-excluded directories

__pycache__  .git  node_modules  .tox  .venv  venv  env
.mypy_cache  .pytest_cache  dist  build  site-packages

Add more: repo2kg build --repo . --exclude migrations fixtures


Python API

# Full mode (FAISS + sentence-transformers)
from repo2kg import RepoKG

kg = RepoKG().build("./my_project")
kg.save("kg.json")                                # JSON
kg.save("kg.toon")                               # TOON (fewer tokens)

kg = RepoKG.load("kg.json")
print(kg.query("auth flow", k=5, depth=2))        # text output
data = kg.query_json("payment", k=3)              # structured dict
# Lightweight mode — stdlib only, safe for agents
from repo2kg import RepoKGLite

kg = RepoKGLite("kg.json")
print(kg.query("auth", k=5, depth=1))             # text output
result = kg.query_json("database", k=3)           # structured dict
callers = kg.get_callers("auth/service.py::AuthService.login")
callees = kg.get_callees("auth/service.py::AuthService.login")

Node Schema

Every node in kg.json / kg.toon:

{
  "auth/jwt.py::JWTService.verify_token": {
    "id":           "auth/jwt.py::JWTService.verify_token",
    "name":         "verify_token",
    "kind":         "method",
    "file":         "auth/jwt.py",
    "parent_class": "JWTService",
    "signature":    "def JWTService.verify_token(self, token: str) -> dict:",
    "docstring":    "Verify and decode a JWT token.",
    "body_preview": "    decoded = jwt.decode(token, self.secret, ...)\n    ...",
    "calls":        ["auth/jwt.py::JWTService.decode", "cache/redis.py::get_user"],
    "callers":      ["auth/service.py::authenticate_user"],
    "imports":      ["jwt", "datetime"]
  }
}

Supported Languages

Language Extensions Parser
Python .py ast module (always accurate)
JavaScript .js .jsx .mjs tree-sitter / regex
TypeScript .ts .tsx .mts tree-sitter / regex
Java .java tree-sitter / regex
Go .go tree-sitter / regex
Rust .rs tree-sitter / regex
C .c .h tree-sitter / regex
C++ .cpp .cc .cxx .hpp tree-sitter / regex
Ruby .rb tree-sitter / regex
C# .cs tree-sitter / regex

Limitations

  • Static analysis only — captures defined calls, not dynamic dispatch or runtime-generated calls
  • Name-based call resolution — same-file preferred; no full type inference
  • No incremental updates — rebuilds the full graph each time (30–60s for large repos)

Contributing

Contributions welcome! Open an issue first for major changes.

git clone https://github.com/mugenGH/repo2kg.git
cd repo2kg
pip install -e .

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repo2kg-0.5.11.tar.gz (43.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

repo2kg-0.5.11-py3-none-any.whl (43.7 kB view details)

Uploaded Python 3

File details

Details for the file repo2kg-0.5.11.tar.gz.

File metadata

  • Download URL: repo2kg-0.5.11.tar.gz
  • Upload date:
  • Size: 43.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for repo2kg-0.5.11.tar.gz
Algorithm Hash digest
SHA256 38f8baa2a0544fb2be62ec47e3e8da6a4fbe6abd1c617aa7f645d415785d8135
MD5 8c28ff5572269dd0ed78b81e0e0c82fb
BLAKE2b-256 1d81325e8e1d0533be62f1f8ff11f492415102c6114051800f161c4b10b7c330

See more details on using hashes here.

File details

Details for the file repo2kg-0.5.11-py3-none-any.whl.

File metadata

  • Download URL: repo2kg-0.5.11-py3-none-any.whl
  • Upload date:
  • Size: 43.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for repo2kg-0.5.11-py3-none-any.whl
Algorithm Hash digest
SHA256 4ab192dbd32cce08c84507c4ae2d201939ee7a789cf931c7385be7576b9895e0
MD5 0f23b52a624a527eb969dd59843a2655
BLAKE2b-256 b5fb4a000840eba01a863d4ce1c298e6c54d6d5809b58d01b814e8fe7a0859de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page