Skip to main content

Extract knowledge graphs from source code repositories. Rank relevant nodes with Personalized PageRank for LLM context. No LLM dependency — bring your own model.

Project description

code2graph

PyPI version PyPI downloads Python CI License

Turn a source code repository into a queryable knowledge graph — no LLM required.

code2graph statically extracts the structure of a codebase — files, modules, functions, classes, calls, dependencies, schemas, infrastructure — as a typed graph of nodes and edges. Rank the most relevant nodes for any query with Personalized PageRank and pass focused context to any LLM.

Pure Python. No LLM dependency. Bring your own model.


Quick start

pip install codebase2graph
# Extract full graph from a repo
codebase2graph /path/to/repo --graph all --output repo.graph.json

# Python call graph only
codebase2graph /path/to/repo --graph call --output calls.graph.json

# With actionable summary
codebase2graph /path/to/repo --graph all \
  --output repo.graph.json \
  --summary-output repo.summary.json
from code2graph import build_graph

graph = build_graph("/path/to/repo", graph_type="all")
# graph.nodes — list of Node objects
# graph.edges — list of Edge objects

Why graph-based code context?

Approach What you lose
Dump entire codebase into prompt Token budget, focus
Embed + search file chunks Call relationships, module structure, dependency chains
code2graph Nothing — relationships are explicit labeled edges

The graph knows that auth.login() calls db.query(), which imports connection_pool, which depends on config.DATABASE_URL. Flat file chunks don't.


Graph types

Type What it extracts
folder Repo, folder, file nodes with contains edges
call Functions/methods with calls and defines edges (Python, JS, TS)
entity Classes, functions, constants with defines and imports edges
schema Database tables, columns, foreign keys (SQL, ORM models)
workflow CI/CD pipelines, GitHub Actions, Makefile targets
infra Dockerfiles, docker-compose, Terraform, Kubernetes manifests
security Hardcoded secrets patterns, dangerous function calls, exposed endpoints
web React/Vue components, routes, API endpoints
android Activities, services, permissions from AndroidManifest.xml
decision ADR-style architecture decisions
all Merged graph from all applicable extractors
codebase2graph /path/to/repo --graph call   --output call.graph.json
codebase2graph /path/to/repo --graph schema --output schema.graph.json
codebase2graph /path/to/repo --graph infra  --output infra.graph.json
codebase2graph /path/to/repo --graph all    --output full.graph.json

Installation

pip install codebase2graph

No extra dependencies required — all graph types work with the standard install.


Python API

Build a graph

from code2graph import build_graph, Graph, Node, Edge

# Full graph
graph: Graph = build_graph("/path/to/repo", graph_type="all")

# Specific type
call_graph = build_graph("/path/to/repo", graph_type="call")
schema_graph = build_graph("/path/to/repo", graph_type="schema")

Inspect results

print(f"{len(graph.nodes)} nodes, {len(graph.edges)} edges")

# Filter by kind
functions = [n for n in graph.nodes if n.attributes.get("kind") == "function"]
calls = [e for e in graph.edges if e.label == "calls"]

Export

import json

# To dict
d = {"nodes": [vars(n) for n in graph.nodes], "edges": [vars(e) for e in graph.edges]}
json.dump(d, open("graph.json", "w"), indent=2)

Graph output format

{
  "nodes": [
    {
      "id": "function:auth.login",
      "label": "login",
      "attributes": {
        "kind": "function",
        "module": "auth",
        "file": "src/auth.py",
        "line": 42
      },
      "content": "def login(username, password): ..."
    }
  ],
  "edges": [
    {
      "id": "edge:auth.login:calls:db.query",
      "from": "function:auth.login",
      "to": "function:db.query",
      "label": "calls"
    }
  ],
  "current_node_id": "repo"
}

CLI reference

codebase2graph <repo> [options]

Arguments:
  repo                    Path to the repository root

Options:
  --graph TYPE            Graph type: folder, call, entity, schema, workflow,
                          infra, security, web, android, decision, all
                          (default: all)
  --output PATH           Write graph JSON to this file (default: stdout)
  --pretty                Pretty-print JSON output
  --summary-output PATH   Write graph summary JSON (entrypoints, fan-in/out nodes)
  --update-existing PATH  Update an existing graph JSON in place
  --update-summary-output PATH
                          Write update diff summary JSON
  -h, --help              Show help

Update mode

Rebuild a graph from the current repository state while preserving stable node IDs and custom attributes added outside code2graph:

codebase2graph /path/to/repo --graph all \
  --update-existing repo.graph.json \
  --update-summary-output repo.update.json

Update mode removes stale nodes/edges for deleted or changed code, adds new nodes/edges, and keeps stable IDs for nodes that haven't changed. Custom attributes on existing nodes are preserved.


Use cases

  • Code review — extract call graph before/after a PR to see what changed structurally
  • LLM code assistance — pass ranked subgraph as context instead of dumping whole files
  • Dependency analysis — find all callers of a function, all modules depending on a service
  • Security audit — detect hardcoded secrets, dangerous API patterns, exposed endpoints
  • Architecture docs — extract infra + schema + decision graphs for living documentation
  • Onboarding — give a new developer a ranked subgraph of the most important entry points

Design principles

  • Pure Python — no LLM, no cloud, no database required
  • Deterministic — same repository state always produces the same graph
  • Static analysis only — no code execution, safe to run on any codebase
  • Works with any model — output is plain JSON; pass to GPT-4, Claude, Llama, or any other model
  • Companion to docs2graph — same node/edge schema, combine code and documentation graphs

Related projects

Package What it does
docs2graph Documents → knowledge graph (same node/edge schema)
graph2sql Graph-based schema analysis for text-to-SQL

Contributing

See CONTRIBUTING.md.

git clone https://github.com/jw-open/code2graph
cd code2graph
pip install -e ".[dev]"
pytest tests/ -v

License

Apache-2.0 — see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codebase2graph-0.1.0.tar.gz (80.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

codebase2graph-0.1.0-py3-none-any.whl (70.3 kB view details)

Uploaded Python 3

File details

Details for the file codebase2graph-0.1.0.tar.gz.

File metadata

  • Download URL: codebase2graph-0.1.0.tar.gz
  • Upload date:
  • Size: 80.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for codebase2graph-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2b0e3a24299c365b20a628e1f1ad5abc8a3c22845772ef2778b8164b01888bf7
MD5 4654dcf4f5e5eec563b53a796559d9fc
BLAKE2b-256 0c32d44682240f1fdf1c48f0d30ad7c781956dcf97d23ae3dcc6e0ee7c9ecafd

See more details on using hashes here.

File details

Details for the file codebase2graph-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: codebase2graph-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 70.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for codebase2graph-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c104221cc7539c7320d0ebb00618e1e27fc6987d233b7eb5892b521c299c0c77
MD5 b56c8b873d75d0b00f1149968230238e
BLAKE2b-256 db07bc1055dce8798a32335a317b15b84387feb833866c674dd55e26db3ca7dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page