Skip to main content

A standalone, extensible RAG/MCP library for building AI-powered documentation search.

Project description

OpenCrane logo

A standalone, extensible RAG/MCP pipeline for building AI-powered documentation search. Fetch docs from GitHub, generate llms-full.txt bundles, chunk and embed them, index into Milvus, and serve via an MCP server — all from one CLI.

Table of Contents

Features

  • Flexible RAG pipeline: run the full flow (fetch → generate llms-full.txt → chunk → embed → index → serve) or use only the steps you need
  • MCP server: exposes search tools consumable by Claude, Cursor, and any MCP-compatible client
  • Extensible: subclass OpenCraneConfig to add custom fence types, chunking strategies, and YAML tree walkers
  • CLI: every pipeline step is a subcommand; works in CI/CD and non-Python projects

Credits

OpenCrane was born from a real-world use case at Cennso — building AI-powered search over telco product documentation.

This project stands on the shoulders of some excellent open-source work:

Quick start

Scaffold a new project without installing anything:

uvx opencrane init

This creates .opencrane/, Dockerfile, and docker-compose.yml in the current directory and walks you through adding documentation sources interactively. Then run opencrane build and opencrane serve.

Installation

# with pip
pip install opencrane

# with uv
uv pip install opencrane

# with uvx (no install needed)
uvx opencrane <command>

Usage

CLI

All commands accept --config myproject.config:MyConfig to load a custom OpenCraneConfig subclass.

opencrane init — scaffold a new project

opencrane init [--podman] [--force] [--no-add]

Creates the .opencrane/ directory and container files in the current directory:

Generated file Description
.opencrane/config.py OpenCraneConfig subclass template with commented extension points
.opencrane/sources.yaml Source mapping template with commented remote and local examples
.opencrane/README.md Quick reference for the .opencrane/ directory
Dockerfile Multi-stage build: deps → model download → Milvus index → runtime
docker-compose.yml Builds and runs the MCP server on port 8000
Flag Description
--podman Generate Containerfile instead of Dockerfile; README uses podman commands
--force Overwrite existing files (default: skip)
--no-add Skip the interactive source addition prompt (useful for CI/scripts)

Convention: OpenCrane auto-discovers .opencrane/config.py as the project config, so no --config flag or OPENCRANE_CONFIG env var is needed when using the .opencrane/ layout.

After scaffolding, init prompts you to add documentation sources interactively (same flow as opencrane add). Use --no-add to skip the prompt.

opencrane add — add documentation sources

opencrane add

Interactively add documentation sources to your project. The command loops, asking for each source:

  1. GitHub repository — adds an entry to .opencrane/sources.yaml with the repo URL, docs path, and optional published docs URL. The fetch step will clone it on the next opencrane build.
  2. Existing llms.txt file — provide a URL or local file path. OpenCrane downloads/copies it into .opencrane/llmstxt/<name>/llms-full.txt, ready for chunking. No fetch or llms step needed for these sources.

After each source, you're asked whether to add another or finish.

opencrane build — full pipeline

opencrane build [--config CLASS] [--sources-dir PATH]... [--llmstxt-dir PATH]
                [--chunks-file PATH] [--embeddings-file PATH]

Runs all steps in sequence: fetch → llms → chunk → embed → index.

Flag Description
--sources-dir PATH Source directory to process; repeat for multiple dirs (overrides AI_DOCS_SOURCES_DIRS env var)
--llmstxt-dir PATH Output directory for llms-full.txt files, and input directory for the chunk step (overrides AI_DOCS_LLMSTXT_DIR env var)
--chunks-file PATH Output path for chunks JSON, and input for the embed step (overrides AI_DOCS_CHUNKS_FILE env var)
--embeddings-file PATH Output path for embeddings JSON (overrides AI_DOCS_EMBEDDINGS_FILE env var)

opencrane fetch — fetch docs from GitHub

opencrane fetch [--config CLASS] [--org NAME] [--repo PATH_KEY]
Flag Description
--org NAME GitHub organisation to fetch from (overrides ORG_NAME env var)
--repo PATH_KEY Fetch only this one repo by its path key in .opencrane/sources.yaml, e.g. external-sources/my-repo (overrides FETCH_REPO env var)

opencrane llms — generate llms-full.txt bundles

opencrane llms [--config CLASS] [--sources-dir PATH]... [--llmstxt-dir PATH] [--force]
Flag Description
--sources-dir PATH Source directory to process; repeat for multiple dirs (overrides AI_DOCS_SOURCES_DIRS env var)
--llmstxt-dir PATH Output directory for llms-full.txt files (overrides AI_DOCS_LLMSTXT_DIR env var)
--force Regenerate even if no git changes are detected in source directories

opencrane tokens — token count report

opencrane tokens [--source-dir PATH] [--output-file PATH]
Flag Description
--source-dir PATH Directory containing llmstxt output to count (overrides TOKEN_SOURCE_DIR env var)
--output-file PATH Output path for the markdown report (overrides TOKEN_OUTPUT_FILE env var)

opencrane chunk — chunk docs into .opencrane/chunks.json

opencrane chunk [--config CLASS] [--llmstxt-dir PATH] [--chunks-file PATH]
Flag Description
--llmstxt-dir PATH Directory containing llms-full.txt (overrides AI_DOCS_LLMSTXT_DIR env var)
--chunks-file PATH Output path for chunks JSON (overrides AI_DOCS_CHUNKS_FILE env var)

opencrane embed — generate embeddings

opencrane embed [--config CLASS] [--chunks-file PATH] [--embeddings-file PATH]
Flag Description
--chunks-file PATH Input chunks JSON file (overrides AI_DOCS_CHUNKS_FILE env var)
--embeddings-file PATH Output embeddings JSON file (overrides AI_DOCS_EMBEDDINGS_FILE env var)

opencrane index — load into Milvus

opencrane index [--config CLASS]

opencrane serve — start MCP server

opencrane serve [--config CLASS] [--transport stdio|http]
Flag Description
--transport stdio (default) stdio transport for local MCP clients. Prints integration instructions for Claude Code, Cursor, Windsurf, VS Code, Zed, and Docker/Podman on startup
--transport http HTTP transport on port 8000 (Streamable HTTP, stateless). Used inside Docker/Podman containers. Port configurable via MCP_HTTP_PORT env var

opencrane pack — package for distribution

opencrane pack [--name NAME] [--output PATH] [--version VERSION]

Packages the built MCP server and data into a standalone Python package that others can run via uvx. After packing, share a one-liner:

# From PyPI (after publishing)
claude mcp add my-docs -- uvx my-docs-mcp

# From GitHub
claude mcp add my-docs -- uvx --from "git+https://github.com/you/my-docs-mcp" my-docs-mcp

# From local path
claude mcp add my-docs -- uvx --from .opencrane/pack/my-docs-mcp my-docs-mcp

The generated package includes the Milvus database and chunk index — recipients don't need to rebuild anything. The embedding model is downloaded automatically on first use.

Run opencrane build before packing. Use --version to bump the version when re-packing updated docs (so uvx pulls the new version instead of serving its cache).

Install the optional build dependency for wheel generation: pip install opencrane[pack].

opencrane inspect — launch MCP Inspector

opencrane inspect [--config CLASS]

Launches the MCP Inspector web UI connected to the server via stdio — no Docker required. Requires npx (Node.js).

Web UI available at http://localhost:5173.

Debugging

Enable verbose logging for any command:

LOG_LEVEL=DEBUG opencrane build
LOG_LEVEL=DEBUG opencrane add

Default file and directory names

OpenCrane uses these defaults for all pipeline output. Override them with CLI flags (one-off) or environment variables (persistent):

File / directory Default CLI flag Env var
llms-full.txt output dir .opencrane/llmstxt --llmstxt-dir AI_DOCS_LLMSTXT_DIR
Chunks file .opencrane/chunks.json --chunks-file AI_DOCS_CHUNKS_FILE
Embeddings file .opencrane/embeddings.json --embeddings-file AI_DOCS_EMBEDDINGS_FILE
Token report output .opencrane/llmstxt/README.md --output-file TOKEN_OUTPUT_FILE
Source mapping file .opencrane/sources.yaml MAPPING_FILE
Milvus database file (Lite mode) (server mode) MILVUS_DB_PATH

Environment variables

CLI flags take precedence over environment variables. Use env vars for persistent defaults (e.g. in CI/CD), and flags for one-off overrides.

fetch and llms steps — shared configuration for source tracking:

Variable Default Description
MAPPING_FILE .opencrane/sources.yaml Path to the source mapping file used by fetch (to record cloned repos) and llms (to embed source links)

fetch step — only needed if you use opencrane fetch to pull docs from GitHub:

Variable Default Description
ORG_NAME `` GitHub organisation to fetch repositories from (see also --org flag)
FETCH_REPO `` Restrict fetch to a single repo by path key (see also --repo flag)
GITHUB_TOKEN `` GitHub API token for authenticated requests
DOCS_TOPIC documentation GitHub topic used to discover repositories automatically within the org
AUTO_DISCOVERY_ORGS `` Whitelist of orgs where topic-based auto-discovery is enabled
TARGET_DIR external-sources Local directory where fetched docs are stored

llms step — only needed if you use opencrane llms to generate llms-full.txt bundles:

Variable Default Description
AI_DOCS_SOURCES_DIRS TARGET_DIR Required when not using opencrane fetch. Comma-separated list of source directories to process (see also --sources-dir flag)
AI_DOCS_LLMSTXT_DIR .opencrane/llmstxt Output directory for generated llms-full.txt files (see also --llmstxt-dir flag)

tokens step — only needed if you use opencrane tokens:

Variable Default Description
TOKEN_SOURCE_DIR .opencrane/llmstxt Directory containing llmstxt output to count (see also --source-dir flag)
TOKEN_OUTPUT_FILE .opencrane/llmstxt/README.md Output path for the markdown report (see also --output-file flag)

chunk step — only needed if you use opencrane chunk:

Variable Default Description
AI_DOCS_LLMSTXT_DIR .opencrane/llmstxt Directory containing llms-full.txt (see also --llmstxt-dir flag)
AI_DOCS_CHUNKS_FILE .opencrane/chunks.json Output path for the generated chunks (see also --chunks-file flag)

embed step — only needed if you use opencrane embed:

Variable Default Description
AI_DOCS_CHUNKS_FILE .opencrane/chunks.json Input chunks JSON file (see also --chunks-file flag)
AI_DOCS_EMBEDDINGS_FILE .opencrane/embeddings.json Output path for the generated embeddings (see also --embeddings-file flag)
EMBEDDING_MODEL nomic-ai/nomic-embed-text-v1.5 HuggingFace embedding model to use

index and serve steps — needed when loading into Milvus or running the MCP server:

OpenCrane supports two Milvus modes. Set MILVUS_DB_PATH to use Milvus Lite (a local file, no server needed — good for local dev). Leave it unset to connect to a Milvus server via MILVUS_HOST and MILVUS_PORT.

Variable Default Description
MILVUS_DB_PATH `` Path to a local Milvus Lite database file (e.g. ./milvus.db). When set, MILVUS_HOST and MILVUS_PORT are ignored
MILVUS_HOST localhost Milvus server host (server mode only)
MILVUS_PORT 19530 Milvus server port (server mode only)
MILVUS_COLLECTION ai_docs_chunks_v1 Milvus collection name
HYBRID_ALPHA 0.6 Weight of vector search vs keyword search (1.0 = pure vector, 0.0 = pure BM25)

Source mapping file (.opencrane/sources.yaml)

OpenCrane maintains a file called .opencrane/sources.yaml that records where each documentation source lives and where its content can be found online. It is used by the fetch step (to track cloned repos) and by the llms step (to embed source links in llms-full.txt). The fetch step populates it automatically; for manually managed sources you can edit it directly.

Each entry supports the following fields:

Field Required Description
url Yes (for fetch) GitHub repository URL — used by opencrane fetch to clone the repo and as a fallback source link in llms-full.txt
docs_path No Path within the repo where docs are stored (e.g. docs)
docs_url No Base URL of the published documentation site (e.g. https://docs.example.com/product). When set, this is used instead of url when embedding source links in llms-full.txt — lets AI agents point users to rendered docs rather than raw GitHub files. If neither is set, no source links are embedded.
manual No When true, the entry is user-managed and will not be overwritten by opencrane fetch auto-discovery

Example:

sources:
  external-sources/my-product:
    url: https://github.com/myorg/my-product
    docs_path: docs
    docs_url: https://docs.myorg.com/my-product
    manual: true

Extending OpenCrane

Subclass OpenCraneConfig to register project-specific extensions:

# myproject/config.py
from opencrane import OpenCraneConfig
from opencrane.fences import CodeFenceConfig
from opencrane.rag.services.yaml_chunker import YamlChunkingStrategy
from opencrane.rag.services.code_chunker import CodeChunkingStrategy
from opencrane.rag.services.prose_chunker import ProseChunkingStrategy
from myproject.strategies.custom import CustomChunkingStrategy
from myproject.walkers.terraform import TerraformTreeWalker

def my_openapi_handler(content: str) -> str:
    # content is the raw text inside the ```openapi ... ``` block
    # process it however you like and return the replacement string
    return f"```yaml\n{content}\n```\n"

class MyConfig(OpenCraneConfig):
    fence_types = {
        "openapi": CodeFenceConfig(fence_type="openapi", handler=my_openapi_handler),
    }
    chunking_strategies = [
        YamlChunkingStrategy(),
        CustomChunkingStrategy(),
        CodeChunkingStrategy(),
        ProseChunkingStrategy(),
    ]
    yaml_tree_walkers = [
        *OpenCraneConfig.yaml_tree_walkers,  # keep CRD, OpenAPI, JSON Schema
        TerraformTreeWalker,
    ]

Then use it:

opencrane build --config myproject.config:MyConfig

Extension points

Extension point Pipeline step What it does
fence_types llms Register custom fence language identifiers and control how matching blocks are transformed during llms-full.txt generation
chunking_strategies chunk Add or replace chunking strategies for different content types
yaml_tree_walkers chunk Add walkers for custom YAML document formats

Built-in YAML tree walkers

  • K8sCRDTreeWalker — Kubernetes CustomResourceDefinitions
  • OpenAPITreeWalker — OpenAPI 3.x specs
  • JsonSchemaTreeWalker — JSON Schema documents

Writing a custom fence type

Register a fence language identifier and provide a handler function. When a ```my-type ... ``` block is encountered during llms generation, OpenCrane calls your handler with the raw block content plus the file context, and replaces the block with the returned string.

from pathlib import Path
from opencrane.fences import CodeFenceConfig

def my_handler(content: str, file_path: Path, project_dir: Path, project_name: str) -> str:
    # content      — raw text inside the fence block
    # file_path    — path of the markdown file containing the block
    # project_dir  — root directory of the project being processed
    # project_name — name of the project (used for source URL building)
    # return the full replacement string
    return f"```yaml\n# processed\n{content}\n```\n"

fence_types = {
    "my-type": CodeFenceConfig(fence_type="my-type", handler=my_handler),
}

To inline a file referenced by path inside the block, use get_source_url from opencrane.fences to add a source annotation:

from pathlib import Path
from opencrane.fences import CodeFenceConfig, get_source_url

def inline_handler(content: str, file_path: Path, project_dir: Path, project_name: str) -> str:
    target = (file_path.parent / content.strip()).resolve()
    language = "json" if target.suffix == ".json" else "yaml"
    gh_url = get_source_url(Path(project_name) / target.relative_to(project_dir), project_name)
    file_content = target.read_text(encoding="utf-8").rstrip("\n")
    if gh_url:
        return f"```{language}\n# Source: {gh_url}\n{file_content}\n```\n"
    return f"```{language}\n{file_content}\n```\n"

fence_types = {
    "my-type": CodeFenceConfig(fence_type="my-type", handler=inline_handler),
}

Writing a custom YAML tree walker

from opencrane.walkers.base import YamlTreeWalker

class TerraformTreeWalker(YamlTreeWalker):
    @classmethod
    def can_handle(cls, doc: dict) -> bool:
        return "terraform" in doc

    def walk(self):
        # return List[Chunk]
        ...

Development

git clone https://github.com/derberg/OpenCrane.git
cd OpenCrane

# with pip
pip install -e ".[dev]"

# with uv
uv sync --extra dev

pytest

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencrane-0.9.6.tar.gz (93.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opencrane-0.9.6-py3-none-any.whl (113.7 kB view details)

Uploaded Python 3

File details

Details for the file opencrane-0.9.6.tar.gz.

File metadata

  • Download URL: opencrane-0.9.6.tar.gz
  • Upload date:
  • Size: 93.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for opencrane-0.9.6.tar.gz
Algorithm Hash digest
SHA256 5347e53d08e9615bf662175558ea533c724837ff123d002dcfc4a55a5b1d35a8
MD5 e93b514588fc9ebf1f70f87ea69cfd6b
BLAKE2b-256 c0ff87b8f5be84414de8f3c7a0cc77ee83b46c669ad5936181f6c49cf0d063ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencrane-0.9.6.tar.gz:

Publisher: publish-pypi.yml on derberg/OpenCrane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file opencrane-0.9.6-py3-none-any.whl.

File metadata

  • Download URL: opencrane-0.9.6-py3-none-any.whl
  • Upload date:
  • Size: 113.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for opencrane-0.9.6-py3-none-any.whl
Algorithm Hash digest
SHA256 a99105b88ce54c2994466af6b571b792c4c16d0f522e53adc4a0b553345babd0
MD5 629638d4d1c3934ed352f290454bf2af
BLAKE2b-256 ca14eeccdd7549b0252d90778040b3e6d94154d9971ca2360713273d9f3dc0ef

See more details on using hashes here.

Provenance

The following attestation bundles were made for opencrane-0.9.6-py3-none-any.whl:

Publisher: publish-pypi.yml on derberg/OpenCrane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page