Skip to main content

Generate OKF v0.1 knowledge bundles from codebases — Claude skill + OpenCode integration

Project description

okf-generator

PyPI Downloads Python Tests Last commit MIT Claude Skill PRs Welcome Site

Map any codebase into an interactive knowledge graph — for AI agents, local SLMs, and human architectural review.

Installation · Quick Start · Architecture · Agents · Local AI · CI/CD · Languages · FAQ


Visual Showcase

okf-generator demo

okf generate scans any repo using tree-sitter AST parsers, resolves cross-references across 10 languages, and outputs a structured knowledge graph. Explore it interactively or consume it programmatically — no LLM required.

# Generate a knowledge bundle from any codebase
okf generate ./my_project ./okf_bundle

# Explore as an interactive HTML dashboard
okf visualize ./okf_bundle

# Browse via local HTTP
okf serve ./okf_bundle --open

# Look up any concept in milliseconds
okf lookup WorldBankConnector

Quick Start

# Install
pip install okf-generator

# Generate a bundle from your project
okf generate ./my_project ./okf_bundle

# Look up a concept (zero LLM, instant)
okf lookup WorldBankConnector

# List all dependencies
okf lookup --deps

# Interactive bundle setup wizard
okf init

# Visualize as interactive HTML
okf visualize ./okf_bundle

Installation

# One-liner (macOS / Linux)
curl -fsSL https://raw.githubusercontent.com/UmairBaig8/okf-generator/main/scripts/install.sh | bash

# Or via pip
pip install okf-generator                        # core (offline extraction)
pip install "okf-generator[llm]"                  # with LLM enrichment + training pairs

Why — Code-Level Knowledge Graphs

AI coding agents waste enormous amounts of context re-reading entire files to find one function signature or dependency version. Cloud models with 200K token windows mask this cost; local SLMs (Gemma, Llama, Phi) on a MacBook run out of memory immediately.

okf-generator solves this by converting source code into a deterministic, cross-referenced knowledge graph. Using tree-sitter AST parsers across 10 languages, every function, class, module, and dependency becomes a structured node with typed edges (calls, called-by, imports, depends-on).

# Before touching WorldBankConnector, get its full graph context
okf lookup WorldBankConnector
CLASS: WorldBankConnector
Source      : StockAI/RnD/python/connectors/economic_data.py  line 51
Description : Fetches World Bank development indicators via wbdata API.
Methods     : get_indicator, search
Signature   : class WorldBankConnector
Calls       : [wbdata.get_indicator, pandas.DataFrame]
Called-by   : [DataPipeline.fetch_economic]

No re-reading the file. No guessing. No LLM call required.

Before and after comparison


How It Works

okf-generator pipeline

1. Scan — tree-sitter AST parsers extract every function, class, method, and module with signature, params, docstring, and return types across 10 languages.

2. Link — the cross-reference linker resolves two edge types:

  • Imports → Dependencies — module imports matched against the dependency index.
  • Calls → Callees — function call sites resolved to concept IDs.

3. Write — outputs an OKF v0.1 bundle: structured markdown files (one per concept) mirroring the source tree.

4. Consume — 8 commands: lookup, pairs, diff, visualize, mcp, serve, init, summarize.

LLM enrichment is optional, resumable, and works with any OpenAI-compatible endpoint (Claude, Ollama, llama.cpp). Extraction itself is fully deterministic and offline-capable.

Used by / Built for

okf-generator was originally built to index a large, multi-domain codebase (StockAI/TrainLLMs) spanning Python data connectors, ML pipelines, and SQL schemas — the kind of project where giving an agent the whole repo as context is both slow and unaffordable in tokens. If you are working in a sprawling codebase and tired of re-explaining your own code to your AI agent every session, this is the tool that problem was built to solve.


Bundle at a Glance

The output mirrors your source tree — dependencies get their own organized namespace:

okf_bundle/
├── SUMMARY.md                        ← bird's-eye view for AI agents
├── index.md                          ← root navigation
├── log.md                            ← generation history
├── _dependencies/                    ← all dependency concepts
│   ├── index.md                      ← lists ecosystems: pip, npm, cargo, ...
│   ├── pip/
│   │   ├── index.md
│   │   ├── requests.md               ← Dependency concept
│   │   └── flask.md
│   └── npm/
│       ├── index.md
│       ├── express.md
│       └── react.md
└── StockAI/
    └── RnD/
        └── python/
            └── connectors/
                ├── index.md          ← lists all concepts in this folder
                ├── economic_data.md  ← Module concept
                └── economic_data/
                    ├── WorldBankConnector.md   ← Class
                    ├── get_indicator.md        ← Function
                    └── search.md               ← Function

Each file is OKF v0.1 conformant with YAML frontmatter:

---
type: Class
title: WorldBankConnector
description: Fetches World Bank development indicators via wbdata API.
resource: StockAI/RnD/python/connectors/economic_data.py
tags:
  - lang:python
  - type:Class
  - module:StockAI
  - domain:RnD
  - git:branch:main
  - git:repo:TrainLLMs
timestamp: '2026-05-23T09:01:21Z'
---

Interactive Visualization

okf visualize generates a self-contained HTML dashboard — no server, no installation, works offline:

okf visualize ./okf_bundle ./viz.html
# Open viz.html in any browser

The visualization uses D3.js with:

  • Force-directed graph — color-coded nodes by concept type (Class, Function, Module, Dependency)
  • Relationship edges — calls, called-by, imports, related
  • Search/filter — by name, type, ecosystem
  • Tooltip on hover — description + resource location
  • Pan/zoom — navigate large graphs
  • Dark/light theme — toggle at runtime

Multi-bundle monorepo support

If your bundle contains sub-bundles (detected by SUMMARY.md in subdirectories), the viz adds a bundle selector dropdown in the topbar to filter by project. Each sub-bundle's dependencies and source files are scoped under its own namespace.

# Combined viz with bundle switcher (cross-bundle edges preserved)
okf visualize ./okf_bundle

# Standalone viz per sub-bundle (smaller, faster)
okf visualize ./okf_bundle/AgentBox agentbox.html
okf visualize ./okf_bundle/StockAI stockai.html

The bundled viz is ideal for exploring relationships across projects; per-bundle viz files are better for focused navigation on a single project.


For AI Agents

Every concept in the bundle is deterministic, typed, and cross-referenced — agents get surgical precision without burning context:

Capability How
Manifest coverage 19 formats incl. Dockerfile, Containerfile, docker-compose.yml
Smart config okf config / .okfconfig — global + per-section settings, no env vars
Quick setup wizard okf init — interactive prompts for source, bundle, LLM enrichment
Pre-commit hook Auto-regenerates bundle on commit when source files change
Docker image ghcr.io/umairbaig8/okf-generator/okf-generator — CI-ready
Zero-LLM lookups okf lookup <Name> returns full concept detail in milliseconds
Fuzzy / camelCase search okf lookup repo finds UserRepository; okf lookup ur matches acronyms
Type filters `okf lookup --type Function
Ecosystem queries okf lookup --tag ecosystem:pip
Source file queries okf lookup --file path/to/file.py
JSON output okf lookup --json <Name> for programmatic agent use
MCP protocol okf mcp ./okf_bundle exposes via Model Context Protocol
Summary map cat ./okf_bundle/SUMMARY.md primes full context

Quick setup for any agent:

Add to your agent instructions or custom rules:

This project has an OKF knowledge bundle at ./okf_bundle/.
- Use `okf lookup <Name>` for full concept context.
- Use `okf lookup --type <Type>` to filter by type.
- Read `SUMMARY.md` for the full knowledge map.

Token efficiency

Optimization Agent impact
Incremental access — one concept, not whole files Saves 80-95% token cost vs reading source
Structured metadata in YAML frontmatter Agent extracts info without parsing code
Cross-reference edges (calls/called-by) Multi-hop reasoning without grep
Deterministic types Agent filters by type precisely

Full agent integration guide — OpenCode commands, Cursor rules, Copilot instructions, MCP setup: docs/agent-integration.md

Automated agent setup — okf install claude, okf install opencode, okf install cursor, etc: see Agent Installation.


For Local AI & SLMs

Cloud models have massive context windows. Local SLMs (Gemma 3 4B, Llama 3.2, Phi-3) running on a MacBook Pro or Air do not — they run out of memory if you try to feed an entire repository.

okf lookup solves this with exact-symbol retrieval: the agent sends a 50-token query and gets back a 200-token concept card. No embeddings, no vector DB, no RAG pipeline. This makes local coding assistants viable for enterprise-scale codebases.

# Enrichment with a local llama.cpp server (MacBook-friendly)
OKF_ENRICH=1 \
OKF_BASE_URL="http://localhost:8080/v1" \
OKF_API_KEY="llamabarn" \
OKF_MODEL="ggml-org/gemma-3-4b-it-qat-GGUF:Q4_0" \
OKF_MAX_WORKERS=2 \
okf generate ./my_project ./okf_bundle

Enrichment works with any OpenAI-compatible endpoint — Ollama, llama.cpp, vLLM, or cloud APIs (Claude, GPT). It is resumable: interrupt and rerun freely, already-enriched concepts are skipped.


For CI/CD Pipelines

Deterministic + fully offline = ideal for automated pipelines:

# .github/workflows/okf-bundle.yml
name: Generate OKF Bundle
on:
  push:
    branches: [main]
jobs:
  generate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install okf-generator
      - run: okf generate ./src ./okf_bundle
      - uses: actions/upload-artifact@v4
        with:
          name: okf-bundle
          path: ./okf_bundle

Push bundles to S3/GCS/Azure for centralized multi-tenant access. Serve them as static websites for zero-infrastructure browsing.

Full CI/CD guide — GitLab, pre-commit hooks, S3 static hosting, monorepo strategies: docs/ci-cd.md


Language & Manifest Coverage

Code Languages (12)

Language Parser Extracts
Python stdlib ast Functions, classes, methods, params, return types, docstrings, decorators, inheritance, type params
JavaScript / TypeScript tree-sitter Functions, arrow fns, classes, methods, JSDoc, generics, heritage (extends/implements)
Go tree-sitter Funcs, methods, structs, interfaces, GoDoc, type params (Go 1.18+)
Java tree-sitter Classes, methods, constructors, Javadoc, generics, inheritance (extends/implements), annotations
Rust tree-sitter Fns, structs, enums, traits, impl blocks, ///, generics, attributes
Swift tree-sitter Classes, structs, enums, protocols, generics, methods, properties, doc comments
Kotlin tree-sitter Classes, data classes, objects, enums, interfaces, generics, functions, constructor params
Ruby tree-sitter Defs, classes, modules, # comments, superclass
C tree-sitter Functions, structs with /** doc comments
C++ tree-sitter Functions, classes, structs, methods, templates, base classes
C# tree-sitter Classes, methods, generics, attributes, base types
SQL tree-sitter Tables, views, functions, indexes, types, triggers

Manifest / Build Formats (17)

requirements.txt · pyproject.toml · package.json · Cargo.toml · Cargo.lock · yarn.lock · pnpm-lock.yaml · go.mod · go.sum · poetry.lock · composer.json · pom.xml · Gemfile · build.gradle / .kts · Package.swift · project.clj · mix.exs · Dockerfile / Containerfile · docker-compose.yml

Full table with parser details + architectural query examples: docs/languages-and-manifests.md

Architectural query example — find every microservice depending on a deprecated Rust crate:

okf lookup --type Dependency --tag ecosystem:cargo --compact
okf lookup --type Dependency openssl

Same logic works for pip, npm, go, maven — any of the 17 supported formats. Pin a vulnerable package version across every service in seconds.


CLI Reference

okf --help              Show available commands
okf <command> --help    Show options for a specific command
okf --version           Show version
Command Usage
generate okf generate <source_dir> [output_dir]
lookup okf lookup <query>
diff okf diff <old_bundle> <new_bundle>
pairs okf pairs <bundle_dir> [output_file]
summarize okf summarize <bundle_dir>
install okf install [claude | opencode | copilot | cursor | windsurf | cline]
init okf init [dir]
visualize okf visualize <bundle_dir> [output.html]
mcp okf mcp <bundle_dir>
serve okf serve [dir] [--port] [--open]

Full options, environment variables, and examples: docs/cli-reference.md


Training Data

Convert your OKF bundle into JSONL training pairs for fine-tuning:

# 5 pair types: codegen, qa, doc, summarize, crosslink
okf pairs ./okf_bundle ./train.jsonl

Each pair is in chat format compatible with most fine-tuning pipelines.

  • Static pairs (no LLM): SKIP_SYNTH=1 okf pairs ...
  • LLM-synthesized pairs: set SYNTH_MODEL, QA_PER_CONCEPT, PAIR_TYPES

Python API

from okf.generator import scan_codebase, write_bundle, write_summary
from okf.lookup import load_bundle, search

concepts = scan_codebase("./my_project")
write_bundle(concepts, "./okf_bundle", "my_project", ["initial generation"])
write_summary("my_project", concepts, "./okf_bundle", {})

bundle = load_bundle("./okf_bundle")
results = search(bundle, tokens=["WorldBankConnector"])

Full API reference with Concept dataclass: docs/python-api.md


Agent Installation

Install integration for any AI agent in one command:

# Install for all detected agents
okf install all

# Or pick specific agents
okf install claude      # Claude Code skill
okf install opencode    # OpenCode /lookup command
okf install copilot     # GitHub Copilot instructions
okf install cursor      # Cursor rules
okf install windsurf    # Windsurf rules
okf install cline       # Cline rules

What each install does:

Agent Files created Effect
Claude Code ~/.config/opencode/skills/okf-generator/SKILL.md Auto-triggers on phrases like "index my codebase"
OpenCode .opencode/commands/lookup.md /lookup NAME=<ConceptName>
Copilot .github/copilot-instructions.md Auto-loaded in VS Code
Cursor .cursorrules Auto-loaded by Cursor
Windsurf .windsurfrules Auto-loaded by Windsurf
Cline .clinerules Auto-loaded by Cline

How It Compares

okf-generator Other OKF producers
Language coverage 12 languages (Python, JS/TS, Go, Java, Rust, Swift, Kotlin, Ruby, SQL, C, C++, C#) Usually 1 language or doc-only
Cross-reference linking Imports → dependencies, function calls → caller/callee across all languages Not typically supported
Dependency/manifest parsing 17 formats (pip, npm, cargo, go, maven, gradle, composer, rubygems, swiftpm, clojars, hex, +7) Not typically supported
Extraction Zero-LLM, deterministic, offline Often LLM-required for every concept
Optional enrichment Any OpenAI-compatible endpoint (Claude, local llama.cpp, Ollama) Often locked to one vendor
Training data export Built-in JSONL pair generator (5 pair types) Not typically included
Agent compatibility Any agent that can run a CLI (Claude Code, Cursor, Windsurf, Copilot, OpenCode, Cline) Often single-agent focused

If you are choosing between OKF producers: pick okf-generator when you want broad language + dependency coverage with zero mandatory LLM cost, and you want the bundle to double as a fine-tuning data source.


FAQ

Does this require an API key or internet connection? No. Core extraction (okf generate) is fully offline and deterministic — no LLM call is made unless you explicitly enable OKF_ENRICH=1.

How is this different from RAG / vector search? RAG retrieves chunks by semantic similarity, which is approximate and can miss exact symbols. okf lookup is exact: it indexes real functions, classes, modules, and dependencies by name and resolves to the precise concept, with zero embedding/vector infrastructure required.

What happens if my language is not supported? Unsupported files are skipped, not dropped silently — log.md records what was scanned. Adding a new language is a self-contained tree-sitter grammar mapping; see CONTRIBUTING.md — it is a listed good-first-issue.

Does this work on monorepos / very large codebases? Yes — the bundle mirrors your source tree, so scanning is linear in file count. For very large repos, scope okf generate to a subdirectory if you only need part of the codebase indexed.

Can I use this without any LLM at all, ever? Yes. okf generate + okf lookup together form a complete, zero-LLM workflow. LLM enrichment and okf pairs synthesis are optional layers on top.

Is the bundle safe to commit to git? Yes, and that is the intended workflow — bundles are plain markdown, diff cleanly, and version alongside the code they describe.


Contributing

git clone https://github.com/UmairBaig8/okf-generator
cd okf-generator
pip install -e ".[dev]"
pytest tests/

Good first issues: adding a new language parser, improving fuzzy search scoring, adding incremental/diff-based regeneration.

See CONTRIBUTING.md for full guidelines.


Acknowledgments

okf-generator is an independent, third-party implementation of the Open Knowledge Format (OKF) v0.1, a knowledge-representation spec introduced by Google Cloud in June 2026. See the full v0.1 specification for the conformance rules this generator targets.

This project is not built, maintained, or endorsed by Google.


License

MIT — Copyright © 2026 Umair Baig

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

okf_generator-0.1.34.tar.gz (171.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

okf_generator-0.1.34-py3-none-any.whl (108.5 kB view details)

Uploaded Python 3

File details

Details for the file okf_generator-0.1.34.tar.gz.

File metadata

  • Download URL: okf_generator-0.1.34.tar.gz
  • Upload date:
  • Size: 171.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for okf_generator-0.1.34.tar.gz
Algorithm Hash digest
SHA256 077316be43ed6ded7e612f5282472f5ea762ff53e955a55ade49cf3e90770505
MD5 45d431d83ac6b1bc78e2c7c0e2fad1d4
BLAKE2b-256 c93153c2cbfebed4f941301827b06dbf8e2fc85ffa5608a405fcaa2dd74d236e

See more details on using hashes here.

Provenance

The following attestation bundles were made for okf_generator-0.1.34.tar.gz:

Publisher: publish.yml on UmairBaig8/okf-generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file okf_generator-0.1.34-py3-none-any.whl.

File metadata

  • Download URL: okf_generator-0.1.34-py3-none-any.whl
  • Upload date:
  • Size: 108.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for okf_generator-0.1.34-py3-none-any.whl
Algorithm Hash digest
SHA256 466b345fd14e1b5b118577ebc7e55c1c3750b26477f5e09d2444daa86fcae247
MD5 dcec77072ebb6a8e129dc1bc855d1962
BLAKE2b-256 8e8e258f032c51384eb549cc08002905c396597b45ec4de898a61fb368342abe

See more details on using hashes here.

Provenance

The following attestation bundles were made for okf_generator-0.1.34-py3-none-any.whl:

Publisher: publish.yml on UmairBaig8/okf-generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page