Skip to main content

Generate OKF v0.1 knowledge bundles from codebases — Claude skill + OpenCode integration

Project description

okf-generator banner

PyPI version Downloads Python Tests Last commit License: MIT OKF v0.1 Claude Skill PRs Welcome

Index any codebase into a structured OKF v0.1 knowledge bundle — then look up exact concepts for any AI coding agent.

Why this exists · Demo · Installation · Quick Start · How it compares · CLI Reference · AI Agent Integration · FAQ · Contributing · Acknowledgments


Why this exists

#why-this-exists

AI coding agents waste enormous amounts of context re-reading entire files to find one function, class, or dependency version. Ask an agent "what does WorldBankConnector do?" and it either guesses from a stale memory of your codebase, or burns thousands of tokens reading the whole file to find a 12-line answer.

okf-generator solves this by converting your source code into the Open Knowledge Format (OKF) v0.1 — a knowledge-representation spec introduced by Google Cloud in June 2026 (full v0.1 spec) — a directory of small, structured markdown files, one per concept (function, class, module, dependency). An agent then asks a surgical question and gets a surgical answer:

# Before touching WorldBankConnector, look it up
okf lookup WorldBankConnector

# CLASS: WorldBankConnector
# Source      : StockAI/RnD/python/connectors/economic_data.py  line 51
# Description : Fetches World Bank development indicators via wbdata API.
# Methods     : get_indicator, search
# Signature   : class WorldBankConnector

No re-reading the file. No guessing. No LLM call required to get the answer.

Before and after comparison

Demo

#demo

demo

How it compares

#how-it-compares

The OKF ecosystem is moving fast — here's where okf-generator sits relative to other producers:

okf-generator Other OKF producers
Language coverage 7 languages (Python, JS/TS, Go, Java, Rust, Ruby, SQL) Usually 1 language or doc-only
Cross-reference linking Imports → dependencies, function calls → caller/callee across all languages Not typically supported
Dependency/manifest parsing 12 formats (pip, npm, cargo, go, maven, gradle, composer, rubygems, swiftpm, clojars, hex, +1) Not typically supported
Extraction Zero-LLM, deterministic, offline Often LLM-required for every concept
Optional enrichment Any OpenAI-compatible endpoint (Claude, local llama.cpp, Ollama) Often locked to one vendor
Training data export Built-in JSONL pair generator (5 pair types) Not typically included
Agent compatibility Any agent that can run a CLI (Claude Code, Cursor, Windsurf, Copilot, OpenCode, Cline) Often single-agent focused

If you're choosing between OKF producers: pick okf-generator when you want broad language + dependency coverage with zero mandatory LLM cost, and you want the bundle to double as a fine-tuning data source.

Used by / Built for

#used-by--built-for

okf-generator was originally built to index a large, multi-domain codebase (StockAI/TrainLLMs) spanning Python data connectors, ML pipelines, and SQL schemas — the kind of project where giving an agent the whole repo as context is both slow and unaffordable in tokens. If you're working in a sprawling codebase and tired of re-explaining your own code to your AI agent every session, this is the tool that problem was built to solve.

Installation

#installation

One-liner — paste into any terminal:

curl -fsSL https://raw.githubusercontent.com/UmairBaig8/okf-generator/main/scripts/install.sh | bash

This installs okf-generator[llm] + the Claude Code skill in one shot. Requirements: Python 3.11+ with pip.

Or manually:

# Core (extraction only — no LLM required)
pip install okf-generator

# With LLM enrichment + training pair generation
pip install "okf-generator[llm]"

Quick Start

#quick-start

# 1. Generate a knowledge bundle from your codebase
okf generate ./my_project ./okf_bundle

# 2. Look up a concept (works instantly, zero LLM)
okf lookup WorldBankConnector

# 3. Find all concepts from one file
okf lookup --file src/connectors/economic_data.py

# 4. List all dependencies for a given ecosystem
okf lookup --type Dependency --tag ecosystem:pip

# 5. Generate training pairs from the bundle
okf pairs ./okf_bundle ./train.jsonl

# 6. Regenerate SUMMARY.md after enrichment
okf summarize ./okf_bundle

How it works

#how-it-works

flowchart LR
    A[Your codebase] -->|okf generate| B[Scanners<br/>AST · tree-sitter · regex]
    B --> C[Concepts<br/>Function · Class · Module · Dependency]
    C --> D[OKF Bundle<br/>markdown + YAML frontmatter]
    D -->|okf lookup| E[AI Agent]
    D -->|okf pairs| F[JSONL training data]

Extraction is fully deterministic and offline-capable — no LLM call is required to produce a usable bundle. LLM enrichment is an optional second pass that improves descriptions, and it's resumable: interrupt it anytime and rerun without redoing work already done.

Bundle Layout

#bundle-layout

The output mirrors your source tree — dependencies get their own organized namespace:

okf_bundle/
├── SUMMARY.md                        ← bird's-eye view for AI agents
├── index.md                          ← root navigation
├── log.md                            ← generation history
├── _dependencies/                    ← all dependency concepts
│   ├── index.md                      ← lists ecosystems: pip, npm, cargo, ...
│   ├── pip/
│   │   ├── index.md
│   │   ├── requests.md               ← Dependency concept
│   │   └── flask.md
│   └── npm/
│       ├── index.md
│       ├── express.md
│       └── react.md
└── StockAI/
    └── RnD/
        └── python/
            └── connectors/
                ├── index.md          ← lists all concepts in this folder
                ├── economic_data.md  ← Module concept
                └── economic_data/
                    ├── WorldBankConnector.md   ← Class
                    ├── get_indicator.md        ← Function
                    └── search.md               ← Function

Each file is OKF v0.1 conformant:

---
type: Class
title: WorldBankConnector
description: Fetches World Bank development indicators via wbdata API.
resource: StockAI/RnD/python/connectors/economic_data.py
tags:
  - lang:python
  - type:Class
  - module:StockAI
  - domain:RnD
  - git:branch:main
  - git:repo:TrainLLMs
timestamp: '2026-05-23T09:01:21Z'
---

# WorldBankConnector

...signature, docstring, params, returns, methods, related concepts...

CLI Reference

#cli-reference

okf generate

#okf-generate

okf generate <source_dir> [output_dir]

Options:
  --summarize <bundle_dir>   Regenerate SUMMARY.md only (no re-scan)

Environment variables (LLM enrichment):
  OKF_ENRICH=1               Enable LLM enrichment
  OKF_BASE_URL               OpenAI-compat base URL (default: https://api.anthropic.com/v1)
  OKF_API_KEY                API key
  OKF_MODEL                  Model name (default: claude-sonnet-4-6)
  OKF_MAX_WORKERS            Parallel workers (default: 2)

okf lookup

#okf-lookup

okf lookup [query] [options]

Options:
  --bundle PATH     Bundle directory (default: ./okf_bundle)
  --file PATH       Filter by source file
  --type TYPE       Filter by concept type: Function | Class | Module | Dependency
  --tag TAG         Filter by tag, repeatable: --tag lang:python or --tag ecosystem:npm
  --limit N         Max results (default: 10)
  --compact         One-line output per result
  --json            JSON output for programmatic use
  --full            Raw .md file content
  --min-score N     Minimum relevance score 0-1 (default: 0.1)
  --no-cache        Bypass and skip writing the lookup cache

okf pairs

#okf-pairs

okf pairs <bundle_dir> [output_file]

Environment variables:
  SKIP_SYNTH=1          Static pairs only (no LLM)
  SYNTH_BASE_URL        API endpoint
  SYNTH_API_KEY         API key
  SYNTH_MODEL           Model name
  MAX_WORKERS           Parallel workers (default: 3)
  QA_PER_CONCEPT        Q&A pairs per concept (default: 3)
  PAIR_TYPES            Comma-separated: codegen,qa,doc,summarize,crosslink

Supported Languages & Manifests

#supported-languages--manifests

Code Languages

#code-languages

Language Parser Extracts
Python stdlib ast Functions, classes, params, return types, docstrings
JavaScript / TypeScript tree-sitter Functions, arrow fns, classes, JSDoc
Go tree-sitter Funcs, methods, structs, interfaces, GoDoc
Java tree-sitter Classes, methods, constructors, Javadoc
Rust tree-sitter Fns, structs, enums, traits, impl blocks, ///
Ruby tree-sitter Defs, classes, modules, # comments
SQL regex (dialect-tolerant) CREATE TABLE/VIEW/FUNCTION/PROCEDURE/INDEX, preceding --//* */ comments

Manifest / Build Files

#manifest--build-files

Format Parser Extracts
requirements.txt regex pip package names + version constraints
pyproject.toml tomllib PEP 621 deps + optional-dependencies + Poetry legacy
package.json json npm/Node dependencies + devDependencies
Cargo.toml tomllib Rust crate deps + dev/build-dependencies
go.mod regex Go module deps + // indirect flag
composer.json json PHP packages (skips php/ext-* platform entries)
pom.xml xml.etree.ElementTree Maven dependencies + test/provided scope → dev
Gemfile regex Ruby gems + group :test/:development → dev
build.gradle / .kts regex Gradle deps (Groovy + Kotlin DSL) + testImplementation → dev
Package.swift regex SwiftPM packages from .package(url:from:)
project.clj regex Clojars deps + :dev profile
mix.exs regex Hex packages + only: :dev/:test → dev

LLM Enrichment

#llm-enrichment

Works with any OpenAI-compatible endpoint — Claude, Ollama, llama.cpp, etc:

# Using a local llama.cpp server
OKF_ENRICH=1 \
OKF_BASE_URL="http://localhost:8080/v1" \
OKF_API_KEY="llamabarn" \
OKF_MODEL="ggml-org/gemma-3-4b-it-qat-GGUF:Q4_0" \
OKF_MAX_WORKERS=2 \
okf generate ./my_project ./okf_bundle

Enrichment is resumable — interrupt and rerun freely. Already-enriched concepts are skipped.

AI Agent Integration

#ai-agent-integration

okf-generator works with any AI coding agent — the output is plain markdown files that every agent can read.

OpenCode / Claude Code

#opencode--claude-code

# Tell your agent about the bundle
cat >> AGENTS.md << 'EOF'
## OKF Knowledge Bundle
Before working on any class or function, look it up:
  okf lookup --bundle ./okf_bundle <ConceptName>
EOF

# Add a custom command (OpenCode)
mkdir -p .opencode/commands
echo "RUN okf lookup --bundle ./okf_bundle \$NAME" > .opencode/commands/lookup.md

Then: /lookup NAME=WorldBankConnector

Cursor / Windsurf / Cline

#cursor--windsurf--cline

Add to .cursorrules or agent instructions:

Before editing a function or class, run:
  okf lookup --bundle ./okf_bundle <Name>
To see dependencies:
  okf lookup --bundle ./okf_bundle --type Dependency

GitHub Copilot

#github-copilot

Reference OKF bundle files in your /.github/copilot-instructions.md:

Project knowledge is indexed in ./okf_bundle/
  - okf lookup <Name> returns full concept context
  - okf lookup --type Dependency returns dependency info

Any agent with RUN capability

#any-agent-with-run-capability

# Prime full context
cat ./okf_bundle/SUMMARY.md

# Look up a specific concept
okf lookup --bundle ./okf_bundle WorldBankConnector

# List dependencies
okf lookup --bundle ./okf_bundle --type Dependency

# JSON for programmatic agent use
okf lookup --bundle ./okf_bundle --json WorldBankConnector

See docs/opencode-integration.md for full OpenCode setup.

Python API

#python-api

from okf.generator import scan_codebase, write_bundle, write_summary
from okf.lookup import load_bundle, search

# Generate bundle
concepts = scan_codebase("./my_project")
write_bundle(concepts, "./okf_bundle", "my_project", ["initial generation"])
write_summary("my_project", concepts, "./okf_bundle", {})

# Search concepts
bundle = load_bundle("./okf_bundle")
results = search(bundle, tokens=["WorldBankConnector"])
print(results[0]["description"])

Training Data

#training-data

Convert your OKF bundle into JSONL training pairs for fine-tuning:

# 5 pair types: codegen, qa, doc, summarize, crosslink
okf pairs ./okf_bundle ./train.jsonl

Each pair is in chat format compatible with most fine-tuning pipelines.

Claude Skill

#claude-skill

Install the skill in one step:

curl -fsSL https://raw.githubusercontent.com/UmairBaig8/okf-generator/main/scripts/install.sh | bash

Or via pip:

pip install okf-generator && okf install-skill

Once installed, Claude Code automatically triggers the skill on phrases like:

"Index my codebase" → generates OKF bundle "Look up WorldBankConnector" → returns exact concept "Generate training pairs from my bundle" → outputs JSONL

The same .md output works with any agent — no vendor lock-in. Point Cursor, Windsurf, Cline, or Copilot at your bundle and they get the same structured knowledge.

FAQ

#faq

Does this require an API key or internet connection? No. Core extraction (okf generate) is fully offline and deterministic — no LLM call is made unless you explicitly enable OKF_ENRICH=1.

How is this different from RAG / vector search? RAG retrieves chunks by semantic similarity, which is approximate and can miss exact symbols. okf lookup is exact: it indexes real functions, classes, modules, and dependencies by name and resolves to the precise concept, with zero embedding/vector infrastructure required.

What happens if my language isn't supported? Unsupported files are skipped, not dropped silently from the bundle log — log.md records what was scanned. Adding a new language is a self-contained tree-sitter grammar mapping; see CONTRIBUTING.md for a starting point — it's a listed good-first-issue.

Does this work on monorepos / very large codebases? Yes — the bundle mirrors your source tree, so scanning is linear in file count. For very large repos, scope okf generate to a subdirectory if you only need part of the codebase indexed.

Can I use this without any LLM at all, ever? Yes. okf generate + okf lookup together form a complete, zero-LLM workflow. LLM enrichment and okf pairs synthesis are optional layers on top.

Is the bundle safe to commit to git? Yes, and that's the intended workflow — bundles are plain markdown, diff cleanly, and version alongside the code they describe.

Contributing

#contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

git clone https://github.com/UmairBaig8/okf-generator
cd okf-generator
pip install -e ".[dev]"
pytest tests/

Good first issues: adding a new language parser, improving fuzzy search scoring, adding incremental/diff-based regeneration.

Acknowledgments

#acknowledgments

okf-generator is an independent, third-party implementation of the Open Knowledge Format (OKF) v0.1, a knowledge-representation spec introduced by Google Cloud in June 2026. See the full v0.1 specification for the conformance rules this generator targets. This project is not built, maintained, or endorsed by Google.

License

#license

MIT — Copyright © 2026 Umair Baig

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

okf_generator-0.1.16.tar.gz (63.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

okf_generator-0.1.16-py3-none-any.whl (54.9 kB view details)

Uploaded Python 3

File details

Details for the file okf_generator-0.1.16.tar.gz.

File metadata

  • Download URL: okf_generator-0.1.16.tar.gz
  • Upload date:
  • Size: 63.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for okf_generator-0.1.16.tar.gz
Algorithm Hash digest
SHA256 5ddb7ed8c8c45ce74bd87d9676423504d20f0b62981462b3bda1f32296285438
MD5 7972fce1802bf5178589099c6554b67f
BLAKE2b-256 adae059509fbdcacd06d9890a0b2457a8c287464f8443f6732e226941eddf54e

See more details on using hashes here.

Provenance

The following attestation bundles were made for okf_generator-0.1.16.tar.gz:

Publisher: publish.yml on UmairBaig8/okf-generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file okf_generator-0.1.16-py3-none-any.whl.

File metadata

  • Download URL: okf_generator-0.1.16-py3-none-any.whl
  • Upload date:
  • Size: 54.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for okf_generator-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 d7f992dbc3114a51b0dec0246c184c1c04899d7c8e791abbf3ceed8ebc44786e
MD5 7abde611eb5b8e2c74842fa2aec3c748
BLAKE2b-256 d39f0e4823164861cf580a91b3619fec2c364905507d9fe2a5990e4922d60c3e

See more details on using hashes here.

Provenance

The following attestation bundles were made for okf_generator-0.1.16-py3-none-any.whl:

Publisher: publish.yml on UmairBaig8/okf-generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page