opencrane

A standalone, extensible RAG/MCP library for building AI-powered documentation search.

Project description

A standalone, extensible RAG/MCP pipeline for building AI-powered documentation search. Fetch docs from GitHub, generate llms-full.txt bundles, chunk and embed them, index into Milvus, and serve via an MCP server — all from one CLI.

Features
Credits
Quick start
Installation
Usage
- CLI
  - init
  - build
  - fetch
  - llms
  - tokens
  - chunk
  - embed
  - index
  - serve
  - inspect
- Default file and directory names
- Environment variables
- Source mapping file
Extending OpenCrane
Development
License

Features

Flexible RAG pipeline: run the full flow (fetch → generate llms-full.txt → chunk → embed → index → serve) or use only the steps you need
MCP server: exposes search tools consumable by Claude, Cursor, and any MCP-compatible client
Extensible: subclass OpenCraneConfig to add custom fence types, chunking strategies, and YAML tree walkers
CLI: every pipeline step is a subcommand; works in CI/CD and non-Python projects

Credits

OpenCrane was born from a real-world use case at Cennso — building AI-powered search over telco product documentation.

This project stands on the shoulders of some excellent open-source work:

Milvus — vector database powering similarity search
Docling — document parsing and chunking
sentence-transformers — embedding generation
rank-bm25 — BM25 keyword search that complements vector similarity search
Model Context Protocol — MCP server standard that makes the search tools consumable by AI clients

Quick start

Scaffold a new project without installing anything:

uvx --from "opencrane @ git+https://github.com/derberg/OpenCrane.git" opencrane init

This creates .opencrane/, Dockerfile, and docker-compose.yml in the current directory. Edit .opencrane/sources.yaml to point at your docs, then run docker compose up.

Installation

# with pip
pip install git+https://github.com/derberg/OpenCrane.git

# with uv
uv pip install git+https://github.com/derberg/OpenCrane.git

Usage

CLI

All commands accept --config myproject.config:MyConfig to load a custom OpenCraneConfig subclass.

`opencrane init` — scaffold a new project

opencrane init [--podman] [--force]

Creates the .opencrane/ directory and container files in the current directory:

Generated file	Description
`.opencrane/config.py`	`OpenCraneConfig` subclass template with commented extension points
`.opencrane/sources.yaml`	Source mapping template with commented remote and local examples
`.opencrane/README.md`	Quick reference for the `.opencrane/` directory
`Dockerfile`	Multi-stage build: deps → model download → Milvus index → runtime
`docker-compose.yml`	Builds and runs the MCP server on port 8000

Flag	Description
`--podman`	Generate `Containerfile` instead of `Dockerfile`; README uses `podman` commands
`--force`	Overwrite existing files (default: skip)

Convention: OpenCrane auto-discovers .opencrane/config.py as the project config, so no --config flag or OPENCRANE_CONFIG env var is needed when using the .opencrane/ layout.

`opencrane build` — full pipeline

opencrane build [--config CLASS] [--sources-dir PATH]... [--llmstxt-dir PATH]
                [--chunks-file PATH] [--embeddings-file PATH]

Runs all steps in sequence: fetch → llms → chunk → embed → index.

Flag	Description
`--sources-dir PATH`	Source directory to process; repeat for multiple dirs (overrides `AI_DOCS_SOURCES_DIRS` env var)
`--llmstxt-dir PATH`	Output directory for llms-full.txt files, and input directory for the chunk step (overrides `AI_DOCS_LLMSTXT_DIR` env var)
`--chunks-file PATH`	Output path for chunks JSON, and input for the embed step (overrides `AI_DOCS_CHUNKS_FILE` env var)
`--embeddings-file PATH`	Output path for embeddings JSON (overrides `AI_DOCS_EMBEDDINGS_FILE` env var)

`opencrane fetch` — fetch docs from GitHub

opencrane fetch [--config CLASS] [--org NAME] [--repo PATH_KEY]

Flag	Description
`--org NAME`	GitHub organisation to fetch from (overrides `ORG_NAME` env var)
`--repo PATH_KEY`	Fetch only this one repo by its path key in `.opencrane/sources.yaml`, e.g. `external-sources/my-repo` (overrides `FETCH_REPO` env var)

`opencrane llms` — generate llms-full.txt bundles

opencrane llms [--config CLASS] [--sources-dir PATH]... [--llmstxt-dir PATH] [--force]

Flag	Description
`--sources-dir PATH`	Source directory to process; repeat for multiple dirs (overrides `AI_DOCS_SOURCES_DIRS` env var)
`--llmstxt-dir PATH`	Output directory for llms-full.txt files (overrides `AI_DOCS_LLMSTXT_DIR` env var)
`--force`	Regenerate even if no git changes are detected in source directories

`opencrane tokens` — token count report

opencrane tokens [--source-dir PATH] [--output-file PATH]

Flag	Description
`--source-dir PATH`	Directory containing llmstxt output to count (overrides `TOKEN_SOURCE_DIR` env var)
`--output-file PATH`	Output path for the markdown report (overrides `TOKEN_OUTPUT_FILE` env var)

`opencrane chunk` — chunk docs into .opencrane/chunks.json

opencrane chunk [--config CLASS] [--llmstxt-dir PATH] [--chunks-file PATH]

Flag	Description
`--llmstxt-dir PATH`	Directory containing llms-full.txt (overrides `AI_DOCS_LLMSTXT_DIR` env var)
`--chunks-file PATH`	Output path for chunks JSON (overrides `AI_DOCS_CHUNKS_FILE` env var)

`opencrane embed` — generate embeddings

opencrane embed [--config CLASS] [--chunks-file PATH] [--embeddings-file PATH]

Flag	Description
`--chunks-file PATH`	Input chunks JSON file (overrides `AI_DOCS_CHUNKS_FILE` env var)
`--embeddings-file PATH`	Output embeddings JSON file (overrides `AI_DOCS_EMBEDDINGS_FILE` env var)

`opencrane index` — load into Milvus

opencrane index [--config CLASS]

`opencrane serve` — start MCP server

opencrane serve [--config CLASS] [--transport stdio|http]

Flag	Description
`--transport stdio`	(default) stdio transport for local MCP clients. Prints integration instructions for Claude Code, Cursor, Windsurf, VS Code, Zed, and Docker/Podman on startup
`--transport http`	HTTP transport on port 8000 (Streamable HTTP, stateless). Used inside Docker/Podman containers. Port configurable via `MCP_HTTP_PORT` env var

`opencrane inspect` — launch MCP Inspector

opencrane inspect [--config CLASS]

Launches the MCP Inspector web UI connected to the server via stdio — no Docker required. Requires npx (Node.js).

Web UI available at http://localhost:5173.

Default file and directory names

OpenCrane uses these defaults for all pipeline output. Override them with CLI flags (one-off) or environment variables (persistent):

File / directory	Default	CLI flag	Env var
llms-full.txt output dir	`.opencrane/llmstxt`	`--llmstxt-dir`	`AI_DOCS_LLMSTXT_DIR`
Chunks file	`.opencrane/chunks.json`	`--chunks-file`	`AI_DOCS_CHUNKS_FILE`
Embeddings file	`.opencrane/embeddings.json`	`--embeddings-file`	`AI_DOCS_EMBEDDINGS_FILE`
Token report output	`.opencrane/llmstxt/README.md`	`--output-file`	`TOKEN_OUTPUT_FILE`
Source mapping file	`.opencrane/sources.yaml`	—	`MAPPING_FILE`
Milvus database file (Lite mode)	(server mode)	—	`MILVUS_DB_PATH`

Environment variables

CLI flags take precedence over environment variables. Use env vars for persistent defaults (e.g. in CI/CD), and flags for one-off overrides.

fetch and llms steps — shared configuration for source tracking:

Variable	Default	Description
`MAPPING_FILE`	`.opencrane/sources.yaml`	Path to the source mapping file used by `fetch` (to record cloned repos) and `llms` (to embed source links)

fetch step — only needed if you use opencrane fetch to pull docs from GitHub:

Variable	Default	Description
`ORG_NAME`	``	GitHub organisation to fetch repositories from (see also `--org` flag)
`FETCH_REPO`	``	Restrict fetch to a single repo by path key (see also `--repo` flag)
`GITHUB_TOKEN`	``	GitHub API token for authenticated requests
`DOCS_TOPIC`	`documentation`	GitHub topic used to discover repositories automatically within the org
`AUTO_DISCOVERY_ORGS`	``	Whitelist of orgs where topic-based auto-discovery is enabled
`TARGET_DIR`	`external-sources`	Local directory where fetched docs are stored

llms step — only needed if you use opencrane llms to generate llms-full.txt bundles:

Variable	Default	Description
`AI_DOCS_SOURCES_DIRS`	`TARGET_DIR`	Required when not using `opencrane fetch`. Comma-separated list of source directories to process (see also `--sources-dir` flag)
`AI_DOCS_LLMSTXT_DIR`	`.opencrane/llmstxt`	Output directory for generated llms-full.txt files (see also `--llmstxt-dir` flag)

tokens step — only needed if you use opencrane tokens:

Variable	Default	Description
`TOKEN_SOURCE_DIR`	`.opencrane/llmstxt`	Directory containing llmstxt output to count (see also `--source-dir` flag)
`TOKEN_OUTPUT_FILE`	`.opencrane/llmstxt/README.md`	Output path for the markdown report (see also `--output-file` flag)

chunk step — only needed if you use opencrane chunk:

Variable	Default	Description
`AI_DOCS_LLMSTXT_DIR`	`.opencrane/llmstxt`	Directory containing llms-full.txt (see also `--llmstxt-dir` flag)
`AI_DOCS_CHUNKS_FILE`	`.opencrane/chunks.json`	Output path for the generated chunks (see also `--chunks-file` flag)

embed step — only needed if you use opencrane embed:

Variable	Default	Description
`AI_DOCS_CHUNKS_FILE`	`.opencrane/chunks.json`	Input chunks JSON file (see also `--chunks-file` flag)
`AI_DOCS_EMBEDDINGS_FILE`	`.opencrane/embeddings.json`	Output path for the generated embeddings (see also `--embeddings-file` flag)
`EMBEDDING_MODEL`	`nomic-ai/nomic-embed-text-v1.5`	HuggingFace embedding model to use

index and serve steps — needed when loading into Milvus or running the MCP server:

OpenCrane supports two Milvus modes. Set MILVUS_DB_PATH to use Milvus Lite (a local file, no server needed — good for local dev). Leave it unset to connect to a Milvus server via MILVUS_HOST and MILVUS_PORT.

Variable	Default	Description
`MILVUS_DB_PATH`	``	Path to a local Milvus Lite database file (e.g. `./milvus.db`). When set, `MILVUS_HOST` and `MILVUS_PORT` are ignored
`MILVUS_HOST`	`localhost`	Milvus server host (server mode only)
`MILVUS_PORT`	`19530`	Milvus server port (server mode only)
`MILVUS_COLLECTION`	`ai_docs_chunks_v1`	Milvus collection name
`HYBRID_ALPHA`	`0.6`	Weight of vector search vs keyword search (1.0 = pure vector, 0.0 = pure BM25)

Source mapping file (`.opencrane/sources.yaml`)

OpenCrane maintains a file called .opencrane/sources.yaml that records where each documentation source lives and where its content can be found online. It is used by the fetch step (to track cloned repos) and by the llms step (to embed source links in llms-full.txt). The fetch step populates it automatically; for manually managed sources you can edit it directly.

Each entry supports the following fields:

Field	Required	Description
`github_url`	Yes (for `fetch`)	GitHub repository URL — used by `opencrane fetch` to clone the repo and as a fallback source link in llms-full.txt
`docs_path`	No	Path within the repo where docs are stored (e.g. `docs`)
`docs_url`	No	Base URL of the published documentation site (e.g. `https://docs.example.com/product`). When set, this is used instead of `github_url` when embedding source links in llms-full.txt — lets AI agents point users to rendered docs rather than raw GitHub files. If neither is set, no source links are embedded.
`manual`	No	When `true`, the entry is user-managed and will not be overwritten by `opencrane fetch` auto-discovery

Example:

sources:
  external-sources/my-product:
    github_url: https://github.com/myorg/my-product
    docs_path: docs
    docs_url: https://docs.myorg.com/my-product
    manual: true

Extending OpenCrane

Subclass OpenCraneConfig to register project-specific extensions:

# myproject/config.py
from opencrane import OpenCraneConfig
from opencrane.fences import CodeFenceConfig
from opencrane.rag.services.yaml_chunker import YamlChunkingStrategy
from opencrane.rag.services.code_chunker import CodeChunkingStrategy
from opencrane.rag.services.prose_chunker import ProseChunkingStrategy
from myproject.strategies.custom import CustomChunkingStrategy
from myproject.walkers.terraform import TerraformTreeWalker

def my_openapi_handler(content: str) -> str:
    # content is the raw text inside the ```openapi ... ``` block
    # process it however you like and return the replacement string
    return f"```yaml\n{content}\n```\n"

class MyConfig(OpenCraneConfig):
    fence_types = {
        "openapi": CodeFenceConfig(fence_type="openapi", handler=my_openapi_handler),
    }
    chunking_strategies = [
        YamlChunkingStrategy(),
        CustomChunkingStrategy(),
        CodeChunkingStrategy(),
        ProseChunkingStrategy(),
    ]
    yaml_tree_walkers = [
        *OpenCraneConfig.yaml_tree_walkers,  # keep CRD, OpenAPI, JSON Schema
        TerraformTreeWalker,
    ]

Then use it:

opencrane build --config myproject.config:MyConfig

Extension points

Extension point	Pipeline step	What it does
`fence_types`	`llms`	Register custom fence language identifiers and control how matching blocks are transformed during llms-full.txt generation
`chunking_strategies`	`chunk`	Add or replace chunking strategies for different content types
`yaml_tree_walkers`	`chunk`	Add walkers for custom YAML document formats

Built-in YAML tree walkers

K8sCRDTreeWalker — Kubernetes CustomResourceDefinitions
OpenAPITreeWalker — OpenAPI 3.x specs
JsonSchemaTreeWalker — JSON Schema documents

Writing a custom fence type

Register a fence language identifier and provide a handler function. When a ```my-type ... ``` block is encountered during llms generation, OpenCrane calls your handler with the raw block content plus the file context, and replaces the block with the returned string.

from pathlib import Path
from opencrane.fences import CodeFenceConfig

def my_handler(content: str, file_path: Path, project_dir: Path, project_name: str) -> str:
    # content      — raw text inside the fence block
    # file_path    — path of the markdown file containing the block
    # project_dir  — root directory of the project being processed
    # project_name — name of the project (used for source URL building)
    # return the full replacement string
    return f"```yaml\n# processed\n{content}\n```\n"

fence_types = {
    "my-type": CodeFenceConfig(fence_type="my-type", handler=my_handler),
}

To inline a file referenced by path inside the block, use get_github_url from opencrane.fences to add a source annotation:

from pathlib import Path
from opencrane.fences import CodeFenceConfig, get_github_url

def inline_handler(content: str, file_path: Path, project_dir: Path, project_name: str) -> str:
    target = (file_path.parent / content.strip()).resolve()
    language = "json" if target.suffix == ".json" else "yaml"
    gh_url = get_github_url(Path(project_name) / target.relative_to(project_dir), project_name)
    file_content = target.read_text(encoding="utf-8").rstrip("\n")
    if gh_url:
        return f"```{language}\n# Source: {gh_url}\n{file_content}\n```\n"
    return f"```{language}\n{file_content}\n```\n"

fence_types = {
    "my-type": CodeFenceConfig(fence_type="my-type", handler=inline_handler),
}

Writing a custom YAML tree walker

from opencrane.walkers.base import YamlTreeWalker

class TerraformTreeWalker(YamlTreeWalker):
    @classmethod
    def can_handle(cls, doc: dict) -> bool:
        return "terraform" in doc

    def walk(self):
        # return List[Chunk]
        ...

Development

git clone https://github.com/derberg/OpenCrane.git
cd OpenCrane

# with pip
pip install -e ".[dev]"

# with uv
uv sync --extra dev

pytest

License

Apache-2.0

Project details

Release history Release notifications | RSS feed

0.18.1

May 22, 2026

0.18.0

May 22, 2026

0.17.4

May 14, 2026

0.17.2

May 14, 2026

0.17.1

May 11, 2026

0.17.0

May 11, 2026

0.16.1

May 7, 2026

0.16.0

May 5, 2026

0.15.0

Apr 23, 2026

0.14.0

Apr 14, 2026

0.13.0

Apr 7, 2026

0.12.0

Apr 2, 2026

0.11.2

Apr 2, 2026

0.11.1

Apr 1, 2026

0.11.0

Apr 1, 2026

0.10.1

Mar 31, 2026

0.10.0

Mar 31, 2026

0.9.7

Mar 30, 2026

0.9.6

Mar 30, 2026

0.9.5

Mar 30, 2026

0.9.4

Mar 30, 2026

0.9.3

Mar 30, 2026

0.9.2

Mar 25, 2026

0.9.1

Mar 25, 2026

0.9.0

Mar 25, 2026

0.8.0

Mar 25, 2026

0.7.6

Mar 24, 2026

0.7.5

Mar 24, 2026

0.7.4

Mar 24, 2026

0.7.3

Mar 24, 2026

0.7.2

Mar 24, 2026

0.7.0

Mar 24, 2026

0.6.3

Mar 24, 2026

0.6.2

Mar 24, 2026

0.6.1

Mar 24, 2026

0.6.0

Mar 24, 2026

0.4.0

Mar 24, 2026

0.3.4

Mar 24, 2026

0.3.3

Mar 24, 2026

0.3.2

Mar 24, 2026

0.3.1

Mar 24, 2026

0.3.0

Mar 23, 2026

0.2.0

Mar 23, 2026

This version

0.1.0

Mar 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencrane-0.1.0.tar.gz (84.1 kB view details)

Uploaded Mar 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

opencrane-0.1.0-py3-none-any.whl (105.4 kB view details)

Uploaded Mar 23, 2026 Python 3

File details

Details for the file opencrane-0.1.0.tar.gz.

File metadata

Download URL: opencrane-0.1.0.tar.gz
Upload date: Mar 23, 2026
Size: 84.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for opencrane-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6b3867530456d91382ae3ae2ae40e0734ef848f62f18db2491ad152f8df000fb`
MD5	`5cf9a87b603d3a1113e8af2f9bea3de9`
BLAKE2b-256	`d437eb54d681ba0dbbc2ef3e914bc59af935244669fe9600b093d0527405cdf4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencrane-0.1.0.tar.gz:

Publisher: publish-pypi.yml on derberg/OpenCrane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: opencrane-0.1.0.tar.gz
- Subject digest: 6b3867530456d91382ae3ae2ae40e0734ef848f62f18db2491ad152f8df000fb
- Sigstore transparency entry: 1161358606
- Sigstore integration time: Mar 23, 2026
Source repository:
- Permalink: derberg/OpenCrane@6be3733e959bd79b52b9ac70d6f68ac4c5433988
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/derberg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@6be3733e959bd79b52b9ac70d6f68ac4c5433988
- Trigger Event: release

File details

Details for the file opencrane-0.1.0-py3-none-any.whl.

File metadata

Download URL: opencrane-0.1.0-py3-none-any.whl
Upload date: Mar 23, 2026
Size: 105.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for opencrane-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`65a8a18dd3613aee29f28f67bbc16a8b68c5a8724a40a42e2e6a0b67589be30f`
MD5	`570bb96ccda9b5329cb0d5bd47990c2a`
BLAKE2b-256	`7417c0ed4a78e211a22698d1257e9946cde5fa4940d1ff2f196b31fc3c366e88`

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencrane-0.1.0-py3-none-any.whl:

Publisher: publish-pypi.yml on derberg/OpenCrane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: opencrane-0.1.0-py3-none-any.whl
- Subject digest: 65a8a18dd3613aee29f28f67bbc16a8b68c5a8724a40a42e2e6a0b67589be30f
- Sigstore transparency entry: 1161358664
- Sigstore integration time: Mar 23, 2026
Source repository:
- Permalink: derberg/OpenCrane@6be3733e959bd79b52b9ac70d6f68ac4c5433988
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/derberg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@6be3733e959bd79b52b9ac70d6f68ac4c5433988
- Trigger Event: release

opencrane 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Table of Contents

Features

Credits

Quick start

Installation

Usage

CLI

opencrane init — scaffold a new project

opencrane build — full pipeline

opencrane fetch — fetch docs from GitHub

opencrane llms — generate llms-full.txt bundles

opencrane tokens — token count report

opencrane chunk — chunk docs into .opencrane/chunks.json

opencrane embed — generate embeddings

opencrane index — load into Milvus

opencrane serve — start MCP server

opencrane inspect — launch MCP Inspector

Default file and directory names

Environment variables

Source mapping file (.opencrane/sources.yaml)

Extending OpenCrane

Extension points

Built-in YAML tree walkers

Writing a custom fence type

Writing a custom YAML tree walker

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`opencrane init` — scaffold a new project

`opencrane build` — full pipeline

`opencrane fetch` — fetch docs from GitHub

`opencrane llms` — generate llms-full.txt bundles

`opencrane tokens` — token count report

`opencrane chunk` — chunk docs into .opencrane/chunks.json

`opencrane embed` — generate embeddings

`opencrane index` — load into Milvus

`opencrane serve` — start MCP server

`opencrane inspect` — launch MCP Inspector

Source mapping file (`.opencrane/sources.yaml`)