Python library & AI coding-assistant skill: turn folders of code, docs, papers, images, or video into a queryable knowledge graph (Claude Code, Codex, OpenCode, Cursor, and more).

These details have not been verified by PyPI

Project links

Project description

stratum

An AI coding assistant skill. Type /stratum in Claude Code, Codex, OpenCode, Cursor, Gemini CLI, GitHub Copilot CLI, Aider, OpenClaw, Factory Droid, or Trae - it reads your files, builds a knowledge graph, and gives you back structure you didn't know was there. Understand a codebase faster. Find the "why" behind architectural decisions.

Fully multimodal. Drop in code, PDFs, markdown, screenshots, diagrams, whiteboard photos, images in other languages, or video and audio files - stratum extracts concepts and relationships from all of it and connects them into one graph. Videos are transcribed with Whisper using a domain-aware prompt derived from your corpus. 20 languages supported via tree-sitter AST (Python, JS, TS, Go, Rust, Java, C, C++, Ruby, C#, Kotlin, Scala, PHP, Swift, Lua, Zig, PowerShell, Elixir, Objective-C, Julia).

Andrej Karpathy keeps a /raw folder where he drops papers, tweets, screenshots, and notes. stratum is the answer to that problem - 71.5x fewer tokens per query vs reading the raw files, persistent across sessions, honest about what it found vs guessed.

/stratum .                        # works on any folder - your codebase, notes, papers, anything

stratum-out/
├── graph.html       interactive graph - click nodes, search, filter by community
├── GRAPH_REPORT.md  god nodes, surprising connections, suggested questions
├── graph.json       persistent graph - query weeks later without re-reading
└── cache/           SHA256 cache - re-runs only process changed files

Add a .stratumignore file to exclude folders you don't want in the graph:

# .stratumignore
vendor/
node_modules/
dist/
*.generated.py

Same syntax as .gitignore. Patterns match against file paths relative to the folder you run stratum on.

How it works

stratum runs in three passes. First, a deterministic AST pass extracts structure from code files (classes, functions, imports, call graphs, docstrings, rationale comments) with no LLM needed. Second, video and audio files are transcribed locally with faster-whisper using a domain-aware prompt derived from corpus god nodes — transcripts are cached so re-runs are instant. Third, Claude subagents run in parallel over docs, papers, images, and transcripts to extract concepts, relationships, and design rationale. The results are merged into a NetworkX graph, clustered with Leiden community detection, and exported as interactive HTML, queryable JSON, and a plain-language audit report.

Clustering is graph-topology-based — no embeddings. Leiden finds communities by edge density. The semantic similarity edges that Claude extracts (semantically_similar_to, marked INFERRED) are already in the graph, so they influence community detection directly. The graph structure is the similarity signal — no separate embedding step or vector database needed.

Every relationship is tagged EXTRACTED (found directly in source), INFERRED (reasonable inference, with a confidence score), or AMBIGUOUS (flagged for review). You always know what was found vs guessed.

Install

Requires: Python 3.10+ and one of: Claude Code, Codex, OpenCode, Cursor, Gemini CLI, GitHub Copilot CLI, Aider, OpenClaw, Factory Droid, or Trae

pip install stratum-graph && stratum install

PyPI: The distribution is stratum-graph (pip install stratum-graph). The Python import package and the CLI remain stratum after install (stratum install, /stratum, …). The canonical repo is Abhijeetsingh610/stratum. The unrelated name stratum on PyPI is a different package.

Platform support

Platform	Install command
Claude Code (Linux/Mac)	`stratum install`
Claude Code (Windows)	`stratum install` (auto-detected) or `stratum install --platform windows`
Codex	`stratum install --platform codex`
OpenCode	`stratum install --platform opencode`
GitHub Copilot CLI	`stratum install --platform copilot`
Aider	`stratum install --platform aider`
OpenClaw	`stratum install --platform claw`
Factory Droid	`stratum install --platform droid`
Trae	`stratum install --platform trae`
Trae CN	`stratum install --platform trae-cn`
Gemini CLI	`stratum install --platform gemini`
Cursor	`stratum cursor install`

Codex users also need multi_agent = true under [features] in ~/.codex/config.toml for parallel extraction. Factory Droid uses the Task tool for parallel subagent dispatch. OpenClaw and Aider use sequential extraction (parallel agent support is still early on those platforms). Trae uses the Agent tool for parallel subagent dispatch and does not support PreToolUse hooks — AGENTS.md is the always-on mechanism.

Then open your AI coding assistant and type:

/stratum .

Note: Codex uses $ instead of / for skill calling, so type $stratum . instead.

Make your assistant always use the graph (recommended)

After building a graph, run this once in your project:

Platform	Command
Claude Code	`stratum claude install`
Codex	`stratum codex install`
OpenCode	`stratum opencode install`
GitHub Copilot CLI	`stratum copilot install`
Aider	`stratum aider install`
OpenClaw	`stratum claw install`
Factory Droid	`stratum droid install`
Trae	`stratum trae install`
Trae CN	`stratum trae-cn install`
Cursor	`stratum cursor install`
Gemini CLI	`stratum gemini install`

Claude Code does two things: writes a CLAUDE.md section telling Claude to read stratum-out/GRAPH_REPORT.md before answering architecture questions, and installs a PreToolUse hook (settings.json) that fires before every Glob and Grep call. If a knowledge graph exists, Claude sees: "stratum: Knowledge graph exists. Read GRAPH_REPORT.md for god nodes and community structure before searching raw files." — so Claude navigates via the graph instead of grepping through every file.

Codex writes to AGENTS.md and also installs a PreToolUse hook in .codex/hooks.json that fires before every Bash tool call — same always-on mechanism as Claude Code.

OpenCode writes to AGENTS.md and also installs a tool.execute.before plugin (.opencode/plugins/stratum.js + opencode.json registration) that fires before bash tool calls and injects the graph reminder into tool output when the graph exists.

Cursor writes .cursor/rules/stratum.mdc with alwaysApply: true — Cursor includes it in every conversation automatically, no hook needed.

Gemini CLI copies the skill to ~/.gemini/skills/stratum/SKILL.md, writes a GEMINI.md section, and installs a BeforeTool hook in .gemini/settings.json that fires before file-read tool calls — same always-on mechanism as Claude Code.

Aider and OpenClaw, Factory Droid, Trae write the same rules to AGENTS.md in your project root. These platforms don't support tool hooks, so AGENTS.md is the always-on mechanism.

GitHub Copilot CLI copies the skill to ~/.copilot/skills/stratum/SKILL.md. Run stratum copilot install to set it up.

Uninstall with the matching uninstall command (e.g. stratum claude uninstall).

Always-on vs explicit trigger — what's the difference?

The always-on hook surfaces GRAPH_REPORT.md — a one-page summary of god nodes, communities, and surprising connections. Your assistant reads this before searching files, so it navigates by structure instead of keyword matching. That covers most everyday questions.

/stratum query, /stratum path, and /stratum explain go deeper: they traverse the raw graph.json hop by hop, trace exact paths between nodes, and surface edge-level detail (relation type, confidence score, source location). Use them when you want a specific question answered from the graph rather than a general orientation.

Think of it this way: the always-on hook gives your assistant a map. The /stratum commands let it navigate the map precisely.

Using `graph.json` with an LLM

graph.json is not meant to be pasted into a prompt all at once. The useful workflow is:

Start with stratum-out/GRAPH_REPORT.md for the high-level overview.
Use stratum query to pull a smaller subgraph for the specific question you want to answer.
Give that focused output to your assistant instead of dumping the full raw corpus.

For example, after running stratum on a project:

stratum query "show the auth flow" --graph stratum-out/graph.json
stratum query "what connects DigestAuth to Response?" --graph stratum-out/graph.json

The output includes node labels, edge types, confidence tags, source files, and source locations. That makes it a good intermediate context block for an LLM:

Use this graph query output to answer the question. Prefer the graph structure
over guessing, and cite the source files when possible.

If your assistant supports tool calling or MCP, use the graph directly instead of pasting text. stratum can expose graph.json as an MCP server:

python -m stratum.serve stratum-out/graph.json

That gives the assistant structured graph access for repeated queries such as query_graph, get_node, get_neighbors, and shortest_path.

Manual install (curl)

mkdir -p ~/.claude/skills/stratum
curl -fsSL https://raw.githubusercontent.com/Abhijeetsingh610/stratum/v4/stratum/skill.md \
  > ~/.claude/skills/stratum/SKILL.md

Add to ~/.claude/CLAUDE.md:

- **stratum** (`~/.claude/skills/stratum/SKILL.md`) - any input to knowledge graph. Trigger: `/stratum`
When the user types `/stratum`, invoke the Skill tool with `skill: "stratum"` before doing anything else.

Usage

/stratum                          # run on current directory
/stratum ./raw                    # run on a specific folder
/stratum ./raw --mode deep        # more aggressive INFERRED edge extraction
/stratum ./raw --update           # re-extract only changed files, merge into existing graph
/stratum ./raw --directed          # build directed graph (preserves edge direction: source→target)
/stratum ./raw --cluster-only     # rerun clustering on existing graph, no re-extraction
/stratum ./raw --no-viz           # skip HTML, just produce report + JSON
/stratum ./raw --obsidian                          # also generate Obsidian vault (opt-in)
/stratum ./raw --obsidian --obsidian-dir ~/vaults/myproject  # write vault to a specific directory

/stratum add https://arxiv.org/abs/1706.03762        # fetch a paper, save, update graph
/stratum add https://x.com/karpathy/status/...       # fetch a tweet
/stratum add <video-url>                              # download audio, transcribe, add to graph
/stratum add https://... --author "Name"             # tag the original author
/stratum add https://... --contributor "Name"        # tag who added it to the corpus

/stratum query "what connects attention to the optimizer?"
/stratum query "what connects attention to the optimizer?" --dfs   # trace a specific path
/stratum query "what connects attention to the optimizer?" --budget 1500  # cap at N tokens
/stratum path "DigestAuth" "Response"
/stratum explain "SwinTransformer"

/stratum ./raw --watch            # auto-sync via skill (background)
stratum watch .                   # same idea from the terminal: auto-rebuild on code saves (needs pip install stratum-graph[watch])
/stratum ./raw --wiki             # build agent-crawlable wiki (index.md + article per community)
/stratum ./raw --svg              # export graph.svg
/stratum ./raw --graphml          # export graph.graphml (Gephi, yEd)
/stratum ./raw --neo4j            # generate cypher.txt for Neo4j
/stratum ./raw --neo4j-push bolt://localhost:7687    # push directly to a running Neo4j instance
/stratum ./raw --mcp              # start MCP stdio server

# git hooks - platform-agnostic, rebuild graph on commit and branch switch
stratum hook install
stratum hook uninstall
stratum hook status

# always-on assistant instructions - platform-specific
stratum claude install            # CLAUDE.md + PreToolUse hook (Claude Code)
stratum claude uninstall
stratum codex install             # AGENTS.md (Codex)
stratum opencode install          # AGENTS.md + tool.execute.before plugin (OpenCode)
stratum cursor install            # .cursor/rules/stratum.mdc (Cursor)
stratum cursor uninstall
stratum gemini install            # GEMINI.md + BeforeTool hook (Gemini CLI)
stratum gemini uninstall
stratum copilot install           # skill file (GitHub Copilot CLI)
stratum copilot uninstall
stratum aider install             # AGENTS.md (Aider)
stratum aider uninstall
stratum claw install              # AGENTS.md (OpenClaw)
stratum droid install             # AGENTS.md (Factory Droid)
stratum trae install              # AGENTS.md (Trae)
stratum trae uninstall
stratum trae-cn install           # AGENTS.md (Trae CN)
stratum trae-cn uninstall

# query the graph directly from the terminal (no AI assistant needed)
stratum query "what connects attention to the optimizer?"
stratum query "show the auth flow" --dfs
stratum query "what is CfgNode?" --budget 500
stratum query "..." --graph path/to/graph.json

Works with any mix of file types:

Type	Extensions	Extraction
Code	`.py .ts .js .jsx .tsx .go .rs .java .c .cpp .rb .cs .kt .scala .php .swift .lua .zig .ps1 .ex .exs .m .mm .jl`	AST via tree-sitter + call-graph + docstring/comment rationale
Docs	`.md .txt .rst`	Concepts + relationships + design rationale via Claude
Office	`.docx .xlsx`	Converted to markdown then extracted via Claude (requires `pip install stratum-graph[office]`)
Papers	`.pdf`	Citation mining + concept extraction
Images	`.png .jpg .webp .gif`	Claude vision - screenshots, diagrams, any language
Video / Audio	`.mp4 .mov .mkv .webm .avi .m4v .mp3 .wav .m4a .ogg`	Transcribed locally with faster-whisper, transcript fed into Claude extraction (requires `pip install stratum-graph[video]`)
YouTube / URLs	any video URL	Audio downloaded via yt-dlp, then same Whisper pipeline (requires `pip install stratum-graph[video]`)

Video and audio corpus

Drop video or audio files into your corpus folder alongside your code and docs — stratum picks them up automatically:

pip install 'stratum-graph[video]'   # one-time setup
/stratum ./my-corpus            # transcribes any video/audio files it finds

Add a YouTube video (or any public video URL) directly:

/stratum add <video-url>

yt-dlp downloads audio-only (fast, small), Whisper transcribes it locally, and the transcript is fed into the same extraction pipeline as your other docs. Transcripts are cached in stratum-out/transcripts/ so re-runs skip already-transcribed files.

For better accuracy on technical content, use a larger model:

/stratum ./my-corpus --whisper-model medium

Audio never leaves your machine. All transcription runs locally.

What you get

God nodes - highest-degree concepts (what everything connects through)

Surprising connections - ranked by composite score. Code-paper edges rank higher than code-code. Each result includes a plain-English why.

Suggested questions - 4-5 questions the graph is uniquely positioned to answer

The "why" - docstrings, inline comments (# NOTE:, # IMPORTANT:, # HACK:, # WHY:), and design rationale from docs are extracted as rationale_for nodes. Not just what the code does - why it was written that way.

Confidence scores - every INFERRED edge has a confidence_score (0.0-1.0). You know not just what was guessed but how confident the model was. EXTRACTED edges are always 1.0.

Semantic similarity edges - cross-file conceptual links with no structural connection. Two functions solving the same problem without calling each other, a class in code and a concept in a paper describing the same algorithm.

Hyperedges - group relationships connecting 3+ nodes that pairwise edges can't express. All classes implementing a shared protocol, all functions in an auth flow, all concepts from a paper section forming one idea.

Token benchmark - printed automatically after every run. On a mixed corpus (Karpathy repos + papers + images): 71.5x fewer tokens per query vs reading raw files. The first run extracts and builds the graph (this costs tokens). Every subsequent query reads the compact graph instead of raw files — that's where the savings compound. The SHA256 cache means re-runs only re-process changed files.

Auto-sync — In the assistant, /stratum <path> --watch runs the skill’s watcher. From the shell, stratum watch . (with pip install stratum-graph[watch]) watches the directory and rebuilds on code saves (AST only). Doc/image changes still need a full /stratum --update pass for LLM extraction.

Git hooks (stratum hook install) - installs post-commit and post-checkout hooks. Graph rebuilds automatically after every commit and every branch switch. If a rebuild fails, the hook exits with a non-zero code so git surfaces the error instead of silently continuing. No background process needed.

Wiki (--wiki) - Wikipedia-style markdown articles per community and god node, with an index.md entry point. Point any agent at index.md and it can navigate the knowledge base by reading files instead of parsing JSON.

Benchmarks

Why this matters

An AI coding assistant that doesn't have a graph has to either (a) dump the entire codebase into the context window every session, or (b) re-discover the structure via grep/read on every question. Both are expensive and lossy. With stratum, the LLM queries a compact, typed graph and gets back only the subgraph relevant to the question — the same knowledge, a fraction of the tokens.

Headline numbers

Corpus	Files	Naive baseline	Per-query cost	Reduction
Karpathy repos + 5 papers + 4 images	52	see note A	subgraph BFS depth=3	71.5x
`encode/httpx` (real open-source repo)	23	36,632 tokens (real word count)	3,082 tokens	11.9x
`encode/httpx` (char-based token est.)	23	71,100 tokens (bytes ÷ 4)	3,082 tokens	23.1x
stratum source + Transformer paper	4	see note A	subgraph BFS depth=3	5.4x

Note A — methodology: the built-in stratum benchmark command estimates the corpus size from the graph itself (nodes × 50 words, matching the industry-standard reference implementation). This is the number reported in the auto-printed summary after every build. For the httpx row we also measured against the actual raw source — both numbers are valid; they answer different questions. The synthetic number is reproducible on any graph; the real-corpus number is reproducible when you have the original files.

Run it yourself:

git clone --depth 1 https://github.com/encode/httpx.git
cd httpx
# Build a graph first (e.g. `/stratum .` in your assistant, optional --wiki)
stratum benchmark stratum-out/graph.json

Head-to-head vs graphify

Both tools started from the same fork point and share the query engine, so single-shot query output is identical (29,170 vs 29,171 bytes across 5 test questions on the same httpx graph). Where they diverge is what happens around the query:

Dimension	graphify	stratum	Measured on
Markdown files produced	12	113 (9.4× more)	httpx (144-node graph)
Per-entity wiki pages	0	99	one per real symbol
`synthesis.md`, `log.md`, `<!-- manual -->` blocks	❌	✓	compounding knowledge
Warm rebuild on unchanged corpus	rewrites all pages (7.6 ms)	0 disk writes (48 ms)	`.render_cache.json`
1-node change rebuild	rewrites all pages	rewrites only affected pages	fingerprint invalidation
MCP tools for agents	7	15	includes `file_answer`, `evolve_synthesis`, `lint_wiki`, `get_contradictions`
Embedding fallback when keywords miss	❌	✓	`[embed]` extra (`pip install stratum-graph[embed]`)
Karpathy gist coverage (3 layers + 4 ops)	~60%	~95%	see `PLAN.md`

What the 71.5x actually means in practice

Suppose an LLM agent is asked "how does httpx handle digest auth and how does it interact with the connection pool?":

Without the graph — the agent has to read _auth.py (12 KB), _client.py (66 KB), _transports/default.py (14 KB), and trace imports into _config.py (9 KB). That's ~103 KB (~25,800 tokens) of raw source dumped into context, and it still has to parse the structure itself.
With stratum — the agent calls query_graph("digest auth connection pool"), gets back a BFS subgraph with 12–15 relevant nodes (DigestAuth, Client, HTTPTransport, ConnectionPool, plus the typed edges between them), then can pull the specific entity pages for the ones it needs. ~3,000 tokens. The relationships come pre-computed — the LLM doesn't re-derive them every query.

That's ~8.6× on a single multi-file question, 11.9× averaged across a question set on httpx, and 71.5× on the Karpathy corpus with docs + papers + images mixed in. The real ergonomic win: agents that would otherwise exhaust their context in three turns now stay productive across a full debugging session.

Worked examples

Corpus	Files	Reduction	Output
Karpathy repos + 5 papers + 4 images	52	71.5x	`worked/karpathy-repos/`
stratum source + Transformer paper	4	5.4x	`worked/mixed-corpus/`
`encode/httpx` (this benchmark)	23	11.9x (real words) / 23.1x (chars)	`bench/`
httpx (synthetic Python library)	6	~1x	`worked/httpx/`

Token reduction scales with corpus size. 6 files fits in a context window anyway, so graph value there is structural clarity, not compression. At 52 files (code + papers + images) you get 71x+. Each worked/ folder has the raw input files and the actual output (GRAPH_REPORT.md, graph.json) so you can run it yourself and verify the numbers.

Privacy

stratum sends file contents to your AI coding assistant's underlying model API for semantic extraction of docs, papers, and images — Anthropic (Claude Code), OpenAI (Codex), or whichever provider your platform uses. Code files are processed locally via tree-sitter AST — no file contents leave your machine for code. Video and audio files are transcribed locally with faster-whisper — audio never leaves your machine. No telemetry, usage tracking, or analytics of any kind. The only network calls are to your platform's model API during extraction, using your own API key.

Tech stack

NetworkX + Leiden (graspologic) + tree-sitter + vis.js. Semantic extraction via Claude (Claude Code), GPT-4 (Codex), or whichever model your platform runs. No Neo4j required, no server, runs entirely locally.

What we are building next

stratum is the graph layer. We are building Penpax on top of it — an on-device digital twin that connects your meetings, browser history, files, emails, and code into one continuously updating knowledge graph. No cloud, no training on your data. Join the waitlist.

Python library & PyPI

This repo is a normal Python package. Install from PyPI:

pip install stratum-graph                # PyPI distribution; provides `import stratum` + `stratum` CLI
pip install "stratum-graph[watch,mcp]"   # optional extras (watchdog, MCP server)

Maintain: build release artifacts

pip install build twine
python -m build                    # dist/stratum-*.tar.gz and .whl
twine check dist/*

Publish (requires a PyPI account and API token):

twine upload dist/*

Use TestPyPI first if you want a dry run: twine upload --repository testpypi dist/*.

Contributing

Worked examples are the most trust-building contribution. Run /stratum on a real corpus, save output to worked/{slug}/, write an honest review.md evaluating what the graph got right and wrong, submit a PR.

Extraction bugs - open an issue with the input file, the cache entry (stratum-out/cache/), and what was missed or invented.

See ARCHITECTURE.md for module responsibilities and how to add a language.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.3

May 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stratum_graph-0.4.3.tar.gz (284.4 kB view details)

Uploaded May 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

stratum_graph-0.4.3-py3-none-any.whl (251.4 kB view details)

Uploaded May 14, 2026 Python 3

File details

Details for the file stratum_graph-0.4.3.tar.gz.

File metadata

Download URL: stratum_graph-0.4.3.tar.gz
Upload date: May 14, 2026
Size: 284.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for stratum_graph-0.4.3.tar.gz
Algorithm	Hash digest
SHA256	`8fe1bc1a1467b56f1d886f84445f8b06b8f4e3804477c6f95cf7923ed16d160d`
MD5	`909decc9d81f5874606feada3f932fe9`
BLAKE2b-256	`41c393e00dd1bb89a05788b7660d20e2215909de3535897429ad6c64a2a169c4`

See more details on using hashes here.

File details

Details for the file stratum_graph-0.4.3-py3-none-any.whl.

File metadata

Download URL: stratum_graph-0.4.3-py3-none-any.whl
Upload date: May 14, 2026
Size: 251.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for stratum_graph-0.4.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`40ca3819ef48132f4312123910ab7c3030579e7bfa2b29a72c921b35b8232444`
MD5	`c355c9a27170f14400c738c1a43e0c1e`
BLAKE2b-256	`18b9eca728b8d95f52e7461f00cd0bb1b410ee1c4ac215e07712f05895c01593`

See more details on using hashes here.

stratum-graph 0.4.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

stratum

How it works

Install

Platform support

Make your assistant always use the graph (recommended)

Using graph.json with an LLM

Usage

Video and audio corpus

What you get

Benchmarks

Why this matters

Headline numbers

Head-to-head vs graphify

What the 71.5x actually means in practice

Worked examples

Privacy

Tech stack

What we are building next

Python library & PyPI

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Using `graph.json` with an LLM