Skip to main content

A tool to build a searchable knowledge graph from Python repositories

Project description

PyCodeKG

Python License: Elastic-2.0 Version CI Poetry DOI

PyCodeKG — A Knowledge Graph for Python Codebases

PyCodeKG turns a Python codebase into a deterministic, queryable knowledge graph — and uses it to produce architectural analyses you can act on, with or without an LLM in the loop.

It walks the AST of every module, class, function, and method in your repo, extracts the typed relationships that actually hold the code together (CONTAINS, CALLS, IMPORTS, INHERITS, RESOLVES_TO), and stores the result in SQLite. A LanceDB vector index sits alongside the graph so that "authentication flow" and "verify_jwt" both find the right place to start exploring. From there you can rank functions by structural importance, trace fan-in across import aliases, detect circular imports and dead code, render the call graph in 3D, snapshot metrics for diffing across releases, or hand the whole thing to Claude over MCP.

The original motivation was simple: produce thorough, defensible analyses of Python codebases that don't depend on inference. Every result is computed from the AST and the graph — no model is asked to guess. When an LLM is present, it consumes the same grounded output as a structured context pack, and the hallucinations that plague "embed-the-repo" tools largely disappear.

Everything runs on your laptop. No cloud APIs, no quotas, no source code leaving the machine.

Technical Paper (PDF) · Author: Eric G. Suchanek, PhD — Flux-Frontiers, Liberty TWP, OH


Sister projects

PyCodeKG is part of a growing family of knowledge-graph systems that share the same hybrid semantic-plus-structural design — each one applies it to a different kind of corpus:

  • DocKG — Markdown and prose. Indexes PyCodeKG's own documentation, so the docs you're reading are themselves a queryable graph.
  • MetaboKG — metabolic pathway data (KEGG, SBML, BioPAX), with FBA / ODE simulation on top of the graph.
  • DiaryKG — personal journals and diary corpora; semantic search and graph traversal over a writer's body of work.
  • FTreeKG — filesystem trees as a queryable graph of directories, files, and contents.
  • AgentKG — conversational memory as a knowledge graph: turns, decisions, commitments, preferences, and the relationships between them.

Together they form KGRAG, a federated retrieval layer where one query can span code, documentation, journals, filesystems, agent memory, and domain data simultaneously.


Two ways to use it

PyCodeKG is designed to be useful at both ends — as a standalone command-line analysis tool, and as a structured context layer for AI agents.

1. Standalone — pycodekg analyze

This is the bread and butter. One command, one repo, one architectural report:

pycodekg build --repo .                              # one-time index
pycodekg analyze .                                   # the report

analyze walks the graph and produces:

  • Complexity hotspots — high fan-in (broadly depended on, breaking-change risk) and high fan-out (orchestrators, refactoring candidates) functions, with risk levels
  • Docstring coverage — broken down by module, class, function, method
  • Circular import cycles — module loops that cause hard-to-debug failures
  • Orphaned functions — dead-code candidates with line counts (with caveats about entry points and reflection)
  • Module coupling — the import graph, with the most tightly coupled pairs called out
  • Issues and strengths — high-level callouts suitable for a design review or release note

It writes a Markdown report for humans and a timestamped JSON snapshot for tooling, CI gates, and trend tracking. Reach for analyze before any non-trivial refactor, at every release, and whenever you inherit an unfamiliar codebase. Full reference: docs/Analyze.md.

pycodekg analyze --quiet --json ~/.claude/pycodekg_analysis_latest.json
jq '.docstring_coverage.total' ~/.claude/pycodekg_analysis_latest.json

2. Agentic — MCP server for grounded AI workflows

Run pycodekg mcp and Claude (or any MCP-aware client) gets nineteen tools backed by the same graph: graph_stats, query_codebase, pack_snippets, get_node, list_nodes, callers, explain, centrality, bridge_centrality, framework_nodes, analyze_repo, snapshot_list / show / diff, and more. Setup for Claude Code, Claude Desktop, Kilo Code, Copilot, and Cline is a single line — see docs/MCP.md and docs/INSTALLATION.md.

The agent benefit isn't subtle. Tools like pack_snippets return actual source with line numbers and surrounding context; callers returns the real fan-in resolved across import aliases, not a regex's best guess. The agent stops fabricating function signatures and starts citing them. Multi-step workflows — "find the auth path, list its callers, summarize what changes if I rename it" — collapse from dozens of greps and file reads into a handful of source-grounded calls.

Independent assessments tend to put it the same way:

"PyCodeKG compresses a multi-step workflow — semantic search, graph expansion, caller tracing, snippet retrieval, and architectural summarization — into a small set of tools that are fast to invoke and easy to chain. In practice, it let me move from broad orientation to intent-driven discovery and then to structural validation without dropping down into manual grep or repeated file reads." — GPT-5 (via Cline)

"What sets it apart from 'search the repo with embeddings' tools is the structural layer… Verdict: 4.5/5 — recommend without reservation for any non-trivial Python codebase." — Claude Opus 4.7

"PyCodeKG is dramatically more effective than traditional grep/file-reading workflows. Unique value: hybrid search combining natural-language intent with precise structural relationships." — Claude Haiku 4.5

Full reports in assessments/.


Get started in 60 seconds

Requirements: Python ≥ 3.12, < 3.14

pip install 'pycode-kg[viz,viz3d]'        # base + Streamlit + 3-D viewer

cd /path/to/your/repo
pycodekg init --repo .                    # download model, build graph, install hooks, snapshot
pycodekg analyze .                        # the architectural report

That's the recommended path. Variants (minimal install, MCP-only, contributor setup) are in docs/INSTALLATION.md. Every CLI subcommand is also exposed as a script alias (pycodekg-analyze, pycodekg-build, pycodekg-mcp, …) for use in Makefiles and Poetry projects.


How retrieval works

Search is hybrid by design. A query like "authentication flow" runs in two phases:

  1. Vector phase — the query is embedded with a local sentence-transformer (cached after first download) and LanceDB returns the k closest functions, classes, and modules by cosine similarity.
  2. Graph expansion phase — each seed hit is expanded hop BFS steps along the typed edges (CONTAINS, CALLS, IMPORTS, INHERITS, RESOLVES_TO) so call chains and module relationships surface alongside the names that matched.

Structure is treated as ground truth; the embeddings are strictly an acceleration layer. When the graph and the vector index disagree, the graph wins. This is why fan-in lookups are accurate even for same-named symbols across modules — RESOLVES_TO edges bridge call sites through their import aliases, and callers() does a two-phase reverse traversal that grep simply cannot replicate.

The graph is built around four node kinds (module, class, function, method) and five edge relations. Schema and edge semantics are documented in docs/CHEATSHEET.md.


What you can actually do with it

If you want to… Reach for Detail
Get a thorough architectural report pycodekg analyze docs/Analyze.md
Generate a coherent architecture description pycodekg architecture docs/Architecture_usage.md
Track metrics across releases pycodekg snapshot save / list / diff docs/SNAPSHOTS.md
Identify the most structurally important code pycodekg centrality (SIR PageRank) docs/CODERANK.md
Pull source-grounded context for an LLM pycodekg pack "..." --format md docs/CHEATSHEET.md
Run a hybrid semantic + structural query pycodekg query "..." docs/CHEATSHEET.md
Browse the graph interactively pycodekg viz (Streamlit) docs/INSTALLATION.md
See call graphs in 3-D pycodekg viz3d --layout funnel docs/VIZ3D.md
Wire it into Claude / Copilot / Cline pycodekg mcp docs/MCP.md

If you only read one doc after this one, read docs/Analyze.md — that's where most of the day-to-day value lives.


Architecture

src/pycode_kg/
├── visitor.py                       # AST extraction (three-pass: structure, calls, dataflow)
├── graph.py                         # GraphBuilder: file discovery + dispatch
├── store.py                         # SQLite persistence + canonical edges
├── index.py                         # LanceDB semantic index
├── pycodekg.py                      # Public façade
├── pycodekg_query.py                # Hybrid query
├── pycodekg_snippet_packer.py       # Source-grounded packs
├── pycodekg_thorough_analysis.py    # `analyze` engine
├── architecture.py                  # `architecture` description generator
├── ranking/                         # PageRank, bridge centrality, framework nodes
├── snapshots.py                     # Temporal metric snapshots
├── analysis/                        # Coupling, cycles, orphans, hotspots
├── cli/                             # All `pycodekg-*` entry points
├── mcp_server.py                    # MCP server (nineteen tools)
├── app.py                           # Streamlit web app
├── viz3d.py / layout3d.py           # PyVista/PyQt5 3-D viewer
└── viz3d_timeline.py                # Metric history timeline

The MCP server, the CLI, and the Streamlit app are thin wrappers over the same store + index + ranking core — there is exactly one code path for each capability. The latest architectural deep-dive is in docs/analysis_v0.19.0.md, produced (of course) by pycodekg analyze against this very repo.


Documentation map

Doc What it covers
docs/INSTALLATION.md All install variants, MCP setup, contributor setup, troubleshooting
docs/Analyze.md The analyze command — every metric, every flag, interpretation guide
docs/Architecture_usage.md Generating coherent architecture descriptions
docs/SNAPSHOTS.md Temporal metric snapshots, diffing across releases
docs/CODERANK.md SIR PageRank, bridge centrality, framework hubs
docs/MCP.md MCP server setup for Claude / Kilo / Copilot / Cline, tool reference
docs/CHEATSHEET.md Every CLI flag and every MCP tool — one page
docs/VIZ3D.md The 3-D PyVista viewer and layouts
CHANGELOG.md Release history

Citation

If you use PyCodeKG in your research or project, please cite it:

DOI

Suchanek, E. G. (2026). PyCodeKG: A Knowledge Graph for Python Codebases (Version 0.19.0) [Software]. Flux-Frontiers. https://doi.org/10.5281/zenodo.19834777

@software{suchanek_pycode_kg,
  author    = {Suchanek, Eric G.},
  title     = {{PyCodeKG}: A Knowledge Graph for Python Codebases},
  version   = {0.19.0},
  year      = {2026},
  publisher = {Flux-Frontiers},
  url       = {https://github.com/Flux-Frontiers/pycode_kg},
  doi       = {10.5281/zenodo.19834777},
}

License

Elastic License 2.0 — free for non-commercial and internal use; commercial redistribution or hosting requires a license from Flux-Frontiers.


Support & acknowledgments

  • IssuesGitHub Issues
  • Sister projects DocKG and MetaboKG
  • LanceDB, sentence-transformers, PyVista, Streamlit, and FastMCP for the foundations

Built for Python developers and AI agents that work alongside them — egs · Last updated May 2026

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycode_kg-0.19.0.tar.gz (177.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycode_kg-0.19.0-py3-none-any.whl (200.2 kB view details)

Uploaded Python 3

File details

Details for the file pycode_kg-0.19.0.tar.gz.

File metadata

  • Download URL: pycode_kg-0.19.0.tar.gz
  • Upload date:
  • Size: 177.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0

File hashes

Hashes for pycode_kg-0.19.0.tar.gz
Algorithm Hash digest
SHA256 53cd95a434b803634b0884bc17132cff33d71b334568daed999c80e9c7f80dcc
MD5 880702f1451f497e5c27d64b107083af
BLAKE2b-256 9e0017cfdca99203b455eaad0b9a3edd744dc178c5524434843f0e7d3fbbddac

See more details on using hashes here.

File details

Details for the file pycode_kg-0.19.0-py3-none-any.whl.

File metadata

  • Download URL: pycode_kg-0.19.0-py3-none-any.whl
  • Upload date:
  • Size: 200.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0

File hashes

Hashes for pycode_kg-0.19.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5aece8f96890c4645f6982e5075bc4bb8ff2f1b7d4b345fdfefb94557e817a06
MD5 91c6997b261f7552e3ab90813a15e2b8
BLAKE2b-256 9449094b967bb16a71c3d562f5f703fc3a68b8b47579a9df1c1dcb031ac6857d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page