A standalone, extensible RAG/MCP library for building AI-powered documentation search.
Project description
A standalone, extensible RAG/MCP pipeline for building AI-powered documentation search. Fetch docs from GitHub, generate llms-full.txt bundles, chunk and embed them, index into Milvus, and serve via an MCP server — all from one CLI.
Table of Contents
Features
- Flexible RAG pipeline: run the full flow (fetch → generate llms-full.txt → chunk → embed → index → serve) or use only the steps you need
- MCP server: exposes search tools consumable by Claude, Cursor, and any MCP-compatible client
- Extensible: subclass
OpenCraneConfigto add custom fence types, chunking strategies, and YAML tree walkers - CLI: every pipeline step is a subcommand; works in CI/CD and non-Python projects
Credits
OpenCrane was born from a real-world use case at Cennso — building AI-powered search over telco product documentation.
This project stands on the shoulders of some excellent open-source work:
- Milvus — vector database powering similarity search
- Docling — document parsing and chunking
- sentence-transformers — embedding generation
- rank-bm25 — BM25 keyword search that complements vector similarity search
- Model Context Protocol — MCP server standard that makes the search tools consumable by AI clients
Quick start
Scaffold a new project without installing anything:
uvx --from "opencrane @ git+https://github.com/derberg/OpenCrane.git" opencrane init
This creates .opencrane/, Dockerfile, and docker-compose.yml in the current directory. Edit .opencrane/sources.yaml to point at your docs, then run docker compose up.
Installation
# with pip
pip install git+https://github.com/derberg/OpenCrane.git
# with uv
uv pip install git+https://github.com/derberg/OpenCrane.git
Usage
CLI
All commands accept --config myproject.config:MyConfig to load a custom OpenCraneConfig subclass.
opencrane init — scaffold a new project
opencrane init [--podman] [--force]
Creates the .opencrane/ directory and container files in the current directory:
| Generated file | Description |
|---|---|
.opencrane/config.py |
OpenCraneConfig subclass template with commented extension points |
.opencrane/sources.yaml |
Source mapping template with commented remote and local examples |
.opencrane/README.md |
Quick reference for the .opencrane/ directory |
Dockerfile |
Multi-stage build: deps → model download → Milvus index → runtime |
docker-compose.yml |
Builds and runs the MCP server on port 8000 |
| Flag | Description |
|---|---|
--podman |
Generate Containerfile instead of Dockerfile; README uses podman commands |
--force |
Overwrite existing files (default: skip) |
Convention: OpenCrane auto-discovers
.opencrane/config.pyas the project config, so no--configflag orOPENCRANE_CONFIGenv var is needed when using the.opencrane/layout.
opencrane build — full pipeline
opencrane build [--config CLASS] [--sources-dir PATH]... [--llmstxt-dir PATH]
[--chunks-file PATH] [--embeddings-file PATH]
Runs all steps in sequence: fetch → llms → chunk → embed → index.
| Flag | Description |
|---|---|
--sources-dir PATH |
Source directory to process; repeat for multiple dirs (overrides AI_DOCS_SOURCES_DIRS env var) |
--llmstxt-dir PATH |
Output directory for llms-full.txt files, and input directory for the chunk step (overrides AI_DOCS_LLMSTXT_DIR env var) |
--chunks-file PATH |
Output path for chunks JSON, and input for the embed step (overrides AI_DOCS_CHUNKS_FILE env var) |
--embeddings-file PATH |
Output path for embeddings JSON (overrides AI_DOCS_EMBEDDINGS_FILE env var) |
opencrane fetch — fetch docs from GitHub
opencrane fetch [--config CLASS] [--org NAME] [--repo PATH_KEY]
| Flag | Description |
|---|---|
--org NAME |
GitHub organisation to fetch from (overrides ORG_NAME env var) |
--repo PATH_KEY |
Fetch only this one repo by its path key in .opencrane/sources.yaml, e.g. external-sources/my-repo (overrides FETCH_REPO env var) |
opencrane llms — generate llms-full.txt bundles
opencrane llms [--config CLASS] [--sources-dir PATH]... [--llmstxt-dir PATH] [--force]
| Flag | Description |
|---|---|
--sources-dir PATH |
Source directory to process; repeat for multiple dirs (overrides AI_DOCS_SOURCES_DIRS env var) |
--llmstxt-dir PATH |
Output directory for llms-full.txt files (overrides AI_DOCS_LLMSTXT_DIR env var) |
--force |
Regenerate even if no git changes are detected in source directories |
opencrane tokens — token count report
opencrane tokens [--source-dir PATH] [--output-file PATH]
| Flag | Description |
|---|---|
--source-dir PATH |
Directory containing llmstxt output to count (overrides TOKEN_SOURCE_DIR env var) |
--output-file PATH |
Output path for the markdown report (overrides TOKEN_OUTPUT_FILE env var) |
opencrane chunk — chunk docs into .opencrane/chunks.json
opencrane chunk [--config CLASS] [--llmstxt-dir PATH] [--chunks-file PATH]
| Flag | Description |
|---|---|
--llmstxt-dir PATH |
Directory containing llms-full.txt (overrides AI_DOCS_LLMSTXT_DIR env var) |
--chunks-file PATH |
Output path for chunks JSON (overrides AI_DOCS_CHUNKS_FILE env var) |
opencrane embed — generate embeddings
opencrane embed [--config CLASS] [--chunks-file PATH] [--embeddings-file PATH]
| Flag | Description |
|---|---|
--chunks-file PATH |
Input chunks JSON file (overrides AI_DOCS_CHUNKS_FILE env var) |
--embeddings-file PATH |
Output embeddings JSON file (overrides AI_DOCS_EMBEDDINGS_FILE env var) |
opencrane index — load into Milvus
opencrane index [--config CLASS]
opencrane serve — start MCP server
opencrane serve [--config CLASS] [--transport stdio|http]
| Flag | Description |
|---|---|
--transport stdio |
(default) stdio transport for local MCP clients. Prints integration instructions for Claude Code, Cursor, Windsurf, VS Code, Zed, and Docker/Podman on startup |
--transport http |
HTTP transport on port 8000 (Streamable HTTP, stateless). Used inside Docker/Podman containers. Port configurable via MCP_HTTP_PORT env var |
opencrane inspect — launch MCP Inspector
opencrane inspect [--config CLASS]
Launches the MCP Inspector web UI connected to the server via stdio — no Docker required. Requires npx (Node.js).
Web UI available at http://localhost:5173.
Default file and directory names
OpenCrane uses these defaults for all pipeline output. Override them with CLI flags (one-off) or environment variables (persistent):
| File / directory | Default | CLI flag | Env var |
|---|---|---|---|
| llms-full.txt output dir | .opencrane/llmstxt |
--llmstxt-dir |
AI_DOCS_LLMSTXT_DIR |
| Chunks file | .opencrane/chunks.json |
--chunks-file |
AI_DOCS_CHUNKS_FILE |
| Embeddings file | .opencrane/embeddings.json |
--embeddings-file |
AI_DOCS_EMBEDDINGS_FILE |
| Token report output | .opencrane/llmstxt/README.md |
--output-file |
TOKEN_OUTPUT_FILE |
| Source mapping file | .opencrane/sources.yaml |
— | MAPPING_FILE |
| Milvus database file (Lite mode) | (server mode) | — | MILVUS_DB_PATH |
Environment variables
CLI flags take precedence over environment variables. Use env vars for persistent defaults (e.g. in CI/CD), and flags for one-off overrides.
fetch and llms steps — shared configuration for source tracking:
| Variable | Default | Description |
|---|---|---|
MAPPING_FILE |
.opencrane/sources.yaml |
Path to the source mapping file used by fetch (to record cloned repos) and llms (to embed source links) |
fetch step — only needed if you use opencrane fetch to pull docs from GitHub:
| Variable | Default | Description |
|---|---|---|
ORG_NAME |
`` | GitHub organisation to fetch repositories from (see also --org flag) |
FETCH_REPO |
`` | Restrict fetch to a single repo by path key (see also --repo flag) |
GITHUB_TOKEN |
`` | GitHub API token for authenticated requests |
DOCS_TOPIC |
documentation |
GitHub topic used to discover repositories automatically within the org |
AUTO_DISCOVERY_ORGS |
`` | Whitelist of orgs where topic-based auto-discovery is enabled |
TARGET_DIR |
external-sources |
Local directory where fetched docs are stored |
llms step — only needed if you use opencrane llms to generate llms-full.txt bundles:
| Variable | Default | Description |
|---|---|---|
AI_DOCS_SOURCES_DIRS |
TARGET_DIR |
Required when not using opencrane fetch. Comma-separated list of source directories to process (see also --sources-dir flag) |
AI_DOCS_LLMSTXT_DIR |
.opencrane/llmstxt |
Output directory for generated llms-full.txt files (see also --llmstxt-dir flag) |
tokens step — only needed if you use opencrane tokens:
| Variable | Default | Description |
|---|---|---|
TOKEN_SOURCE_DIR |
.opencrane/llmstxt |
Directory containing llmstxt output to count (see also --source-dir flag) |
TOKEN_OUTPUT_FILE |
.opencrane/llmstxt/README.md |
Output path for the markdown report (see also --output-file flag) |
chunk step — only needed if you use opencrane chunk:
| Variable | Default | Description |
|---|---|---|
AI_DOCS_LLMSTXT_DIR |
.opencrane/llmstxt |
Directory containing llms-full.txt (see also --llmstxt-dir flag) |
AI_DOCS_CHUNKS_FILE |
.opencrane/chunks.json |
Output path for the generated chunks (see also --chunks-file flag) |
embed step — only needed if you use opencrane embed:
| Variable | Default | Description |
|---|---|---|
AI_DOCS_CHUNKS_FILE |
.opencrane/chunks.json |
Input chunks JSON file (see also --chunks-file flag) |
AI_DOCS_EMBEDDINGS_FILE |
.opencrane/embeddings.json |
Output path for the generated embeddings (see also --embeddings-file flag) |
EMBEDDING_MODEL |
nomic-ai/nomic-embed-text-v1.5 |
HuggingFace embedding model to use |
index and serve steps — needed when loading into Milvus or running the MCP server:
OpenCrane supports two Milvus modes. Set MILVUS_DB_PATH to use Milvus Lite (a local file, no server needed — good for local dev). Leave it unset to connect to a Milvus server via MILVUS_HOST and MILVUS_PORT.
| Variable | Default | Description |
|---|---|---|
MILVUS_DB_PATH |
`` | Path to a local Milvus Lite database file (e.g. ./milvus.db). When set, MILVUS_HOST and MILVUS_PORT are ignored |
MILVUS_HOST |
localhost |
Milvus server host (server mode only) |
MILVUS_PORT |
19530 |
Milvus server port (server mode only) |
MILVUS_COLLECTION |
ai_docs_chunks_v1 |
Milvus collection name |
HYBRID_ALPHA |
0.6 |
Weight of vector search vs keyword search (1.0 = pure vector, 0.0 = pure BM25) |
Source mapping file (.opencrane/sources.yaml)
OpenCrane maintains a file called .opencrane/sources.yaml that records where each documentation source lives and where its content can be found online. It is used by the fetch step (to track cloned repos) and by the llms step (to embed source links in llms-full.txt). The fetch step populates it automatically; for manually managed sources you can edit it directly.
Each entry supports the following fields:
| Field | Required | Description |
|---|---|---|
github_url |
Yes (for fetch) |
GitHub repository URL — used by opencrane fetch to clone the repo and as a fallback source link in llms-full.txt |
docs_path |
No | Path within the repo where docs are stored (e.g. docs) |
docs_url |
No | Base URL of the published documentation site (e.g. https://docs.example.com/product). When set, this is used instead of github_url when embedding source links in llms-full.txt — lets AI agents point users to rendered docs rather than raw GitHub files. If neither is set, no source links are embedded. |
manual |
No | When true, the entry is user-managed and will not be overwritten by opencrane fetch auto-discovery |
Example:
sources:
external-sources/my-product:
github_url: https://github.com/myorg/my-product
docs_path: docs
docs_url: https://docs.myorg.com/my-product
manual: true
Extending OpenCrane
Subclass OpenCraneConfig to register project-specific extensions:
# myproject/config.py
from opencrane import OpenCraneConfig
from opencrane.fences import CodeFenceConfig
from opencrane.rag.services.yaml_chunker import YamlChunkingStrategy
from opencrane.rag.services.code_chunker import CodeChunkingStrategy
from opencrane.rag.services.prose_chunker import ProseChunkingStrategy
from myproject.strategies.custom import CustomChunkingStrategy
from myproject.walkers.terraform import TerraformTreeWalker
def my_openapi_handler(content: str) -> str:
# content is the raw text inside the ```openapi ... ``` block
# process it however you like and return the replacement string
return f"```yaml\n{content}\n```\n"
class MyConfig(OpenCraneConfig):
fence_types = {
"openapi": CodeFenceConfig(fence_type="openapi", handler=my_openapi_handler),
}
chunking_strategies = [
YamlChunkingStrategy(),
CustomChunkingStrategy(),
CodeChunkingStrategy(),
ProseChunkingStrategy(),
]
yaml_tree_walkers = [
*OpenCraneConfig.yaml_tree_walkers, # keep CRD, OpenAPI, JSON Schema
TerraformTreeWalker,
]
Then use it:
opencrane build --config myproject.config:MyConfig
Extension points
| Extension point | Pipeline step | What it does |
|---|---|---|
fence_types |
llms |
Register custom fence language identifiers and control how matching blocks are transformed during llms-full.txt generation |
chunking_strategies |
chunk |
Add or replace chunking strategies for different content types |
yaml_tree_walkers |
chunk |
Add walkers for custom YAML document formats |
Built-in YAML tree walkers
K8sCRDTreeWalker— Kubernetes CustomResourceDefinitionsOpenAPITreeWalker— OpenAPI 3.x specsJsonSchemaTreeWalker— JSON Schema documents
Writing a custom fence type
Register a fence language identifier and provide a handler function. When a ```my-type ... ``` block is encountered during llms generation, OpenCrane calls your handler with the raw block content plus the file context, and replaces the block with the returned string.
from pathlib import Path
from opencrane.fences import CodeFenceConfig
def my_handler(content: str, file_path: Path, project_dir: Path, project_name: str) -> str:
# content — raw text inside the fence block
# file_path — path of the markdown file containing the block
# project_dir — root directory of the project being processed
# project_name — name of the project (used for source URL building)
# return the full replacement string
return f"```yaml\n# processed\n{content}\n```\n"
fence_types = {
"my-type": CodeFenceConfig(fence_type="my-type", handler=my_handler),
}
To inline a file referenced by path inside the block, use get_github_url from opencrane.fences to add a source annotation:
from pathlib import Path
from opencrane.fences import CodeFenceConfig, get_github_url
def inline_handler(content: str, file_path: Path, project_dir: Path, project_name: str) -> str:
target = (file_path.parent / content.strip()).resolve()
language = "json" if target.suffix == ".json" else "yaml"
gh_url = get_github_url(Path(project_name) / target.relative_to(project_dir), project_name)
file_content = target.read_text(encoding="utf-8").rstrip("\n")
if gh_url:
return f"```{language}\n# Source: {gh_url}\n{file_content}\n```\n"
return f"```{language}\n{file_content}\n```\n"
fence_types = {
"my-type": CodeFenceConfig(fence_type="my-type", handler=inline_handler),
}
Writing a custom YAML tree walker
from opencrane.walkers.base import YamlTreeWalker
class TerraformTreeWalker(YamlTreeWalker):
@classmethod
def can_handle(cls, doc: dict) -> bool:
return "terraform" in doc
def walk(self):
# return List[Chunk]
...
Development
git clone https://github.com/derberg/OpenCrane.git
cd OpenCrane
# with pip
pip install -e ".[dev]"
# with uv
uv sync --extra dev
pytest
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file opencrane-0.1.0.tar.gz.
File metadata
- Download URL: opencrane-0.1.0.tar.gz
- Upload date:
- Size: 84.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b3867530456d91382ae3ae2ae40e0734ef848f62f18db2491ad152f8df000fb
|
|
| MD5 |
5cf9a87b603d3a1113e8af2f9bea3de9
|
|
| BLAKE2b-256 |
d437eb54d681ba0dbbc2ef3e914bc59af935244669fe9600b093d0527405cdf4
|
Provenance
The following attestation bundles were made for opencrane-0.1.0.tar.gz:
Publisher:
publish-pypi.yml on derberg/OpenCrane
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
opencrane-0.1.0.tar.gz -
Subject digest:
6b3867530456d91382ae3ae2ae40e0734ef848f62f18db2491ad152f8df000fb - Sigstore transparency entry: 1161358606
- Sigstore integration time:
-
Permalink:
derberg/OpenCrane@6be3733e959bd79b52b9ac70d6f68ac4c5433988 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/derberg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@6be3733e959bd79b52b9ac70d6f68ac4c5433988 -
Trigger Event:
release
-
Statement type:
File details
Details for the file opencrane-0.1.0-py3-none-any.whl.
File metadata
- Download URL: opencrane-0.1.0-py3-none-any.whl
- Upload date:
- Size: 105.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
65a8a18dd3613aee29f28f67bbc16a8b68c5a8724a40a42e2e6a0b67589be30f
|
|
| MD5 |
570bb96ccda9b5329cb0d5bd47990c2a
|
|
| BLAKE2b-256 |
7417c0ed4a78e211a22698d1257e9946cde5fa4940d1ff2f196b31fc3c366e88
|
Provenance
The following attestation bundles were made for opencrane-0.1.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on derberg/OpenCrane
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
opencrane-0.1.0-py3-none-any.whl -
Subject digest:
65a8a18dd3613aee29f28f67bbc16a8b68c5a8724a40a42e2e6a0b67589be30f - Sigstore transparency entry: 1161358664
- Sigstore integration time:
-
Permalink:
derberg/OpenCrane@6be3733e959bd79b52b9ac70d6f68ac4c5433988 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/derberg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@6be3733e959bd79b52b9ac70d6f68ac4c5433988 -
Trigger Event:
release
-
Statement type: