MCP Server for intelligent documentation scraping, vectorization, and semantic search with dynamic ontology extraction

These details have not been verified by PyPI

Project links

Project description

MCP Doc Builder

Intelligent Documentation Scraping, Vectorization, and Semantic Search for AI Coding Assistants.

Overview

MCP Doc Builder is a Model Context Protocol (MCP) server that provides:

Intelligent Web Scraping: LLM-guided crawler that intelligently decides which documentation pages to index
Semantic Vectorization: Gemini text-embedding-004 for semantic search across documentation
Dynamic Ontology: Automatically extracts concepts and relationships from documentation
Knowledge Graph: Neo4j-based storage with full graph traversal capabilities
Hybrid Search: Combined vector similarity and fulltext search for optimal results

Features

Intelligent Crawling

LLM-powered link evaluation decides which pages to follow
Respects rate limits to avoid overwhelming documentation servers
Configurable depth (1-5 hops from root URL)
Smart content extraction with trafilatura

Semantic Search

Gemini text-embedding-004 for 768-dimensional vectors
Neo4j Vector Index for fast similarity search
Fulltext search with Lucene
Hybrid search combining both methods

Dynamic Ontology

Automatic concept extraction (APIs, patterns, entities)
Relationship inference (uses, extends, requires, etc.)
Chunk-to-concept linking
Concept co-occurrence analysis

MCP Integration

6 tools for complete documentation management
Resources for graph exploration
Workflow prompts for common tasks

Quick Start

1. Prerequisites

Python 3.11+
Docker (for Neo4j)
LiteLLM Gateway or Gemini API key

2. Installation

You can install doc-builder-mcp globally using pipx (recommended) or in a local virtual environment.

Option 1: One-Line Install (Recommended)

# Install the package
pipx install doc-builder-mcp

# Run the interactive Setup Wizard
doc-mcp-setup

The wizard will:

Check for Docker and Neo4j.
Ask for your LiteLLM / Gemini Credentials.
Configure the LLM Mode (LiteLLM vs Gemini Direct).
Generate a secure .env file.

❓ Don't have pipx? Click here to install it

macOS:

brew install pipx
pipx ensurepath

Windows:

winget install pipx
pipx ensurepath

Linux (Debian/Ubuntu):

sudo apt install pipx
pipx ensurepath

Restart your terminal after installing pipx.

Alternative: Standard Pip

If you prefer not to use pipx:

pip install doc-builder-mcp
doc-mcp-setup

Option 2: Manual Development Setup

If you want to contribute or modify the code:

git clone https://github.com/Hexecu/mcp-doc-builder.git
cd mcp-doc-builder
make full-setup

3. Setup

Run the interactive setup wizard:

doc-mcp-setup

Or manually configure:

cp ../.env.example ../.env
# Edit .env with your configuration

4. Start Neo4j

Start the Neo4j database natively with docker or using the provided Makefile:

make neo4j-up

This uses the docker-compose.yml to start the Neo4j instance.

5. Run the Server

# STDIO mode (for IDE integration)
make server-stdio

# HTTP mode (for API access)
make server

Configuration

Environment Variables

Variable	Description	Default
`NEO4J_URI`	Neo4j connection URI	`bolt://localhost:7688`
`NEO4J_USERNAME`	Neo4j username	`neo4j`
`NEO4J_PASSWORD`	Neo4j password	-
`LLM_MODE`	`litellm`, `gemini_direct`, or `both`	`litellm`
`LITELLM_BASE_URL`	LiteLLM Gateway URL	-
`LITELLM_API_KEY`	LiteLLM API key	-
`LITELLM_MODEL`	Model name	`gemini-2.5-flash`
`CRAWLER_MAX_DEPTH`	Maximum crawl depth	`2`
`CRAWLER_RATE_LIMIT`	Seconds between requests	`1.0`
`CRAWLER_MAX_PAGES`	Max pages per source	`500`

MCP Tools

doc_ingest

Ingest and index a documentation website.

{
  "url": "https://nextjs.org/docs",
  "name": "Next.js Docs",
  "max_depth": 2
}

doc_search

Search indexed documentation.

{
  "query": "how to use React hooks",
  "limit": 10,
  "search_mode": "hybrid"
}

doc_context

Get comprehensive context for a topic.

{
  "topic": "authentication in Next.js",
  "include_related": true
}

doc_sources

List all indexed documentation sources.

doc_refresh

Refresh/re-index a documentation source.

{
  "source_id": "source_abc123",
  "force": false
}

doc_ontology

Explore extracted concepts and relationships.

{
  "source_id": "source_abc123",
  "concept": "useState"
}

IDE Integration

You can use this MCP server with any compatible AI coding assistant.

Antigravity (Google Deepmind)

Open Antigravity Settings or locate your configuration file.
Add the MCP server configuration:

{
  "mcpServers": {
    "doc-builder": {
      "command": "doc-mcp",
      "args": [],
      "env": {
        "NEO4J_URI": "bolt://localhost:7688",
        "NEO4J_PASSWORD": "your-password",
        "LITELLM_API_KEY": "your-key"
      }
    }
  }
}

Recommended Antigravity Custom Rule

To maximize the utility of doc-builder, add this specific rule to your Antigravity global profile or project guidelines (e.g., in .gemini/rules.md or global settings):

Documentation Strategy Rule: "Whenever you encounter a new, unknown, or recently updated library, framework, or API that is not fully covered by your base training, you MUST proactively use the doc_ingest MCP tool to scrape and index its official documentation. Before writing complex implementation code for unfamiliar tools, always query doc_search or doc_context to understand the best practices and latest syntax."

Cursor

Cursor supports MCP natively. To add the server:

Open Cursor Settings (Cmd/Ctrl + Shift + J) > Features > MCP.
Click + Add new MCP server.
Set the Type to command.
Set the Name to doc-builder.
Set the Command to doc-mcp (assuming you installed via pipx).
Add the necessary environment variables (NEO4J_PASSWORD, LITELLM_API_KEY, etc.) directly in the Cursor UI environment section.

VS Code (with Claude Dev / Roo Code)

If you use Claude Dev, Roo Code, or similar MCP clients in VS Code:

Open the MCP configuration file (usually found at ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json on Mac).
Add the server entry:

{
  "mcpServers": {
    "doc-builder": {
      "command": "doc-mcp",
      "args": [],
      "env": {
        "NEO4J_URI": "bolt://localhost:7688",
        "NEO4J_PASSWORD": "your-password",
        "LITELLM_API_KEY": "your-key"
      }
    }
  }
}

Architecture

mcp-doc-builder/
├── docker-compose.yml        # Neo4j container
├── .env.example              # Configuration template
└── server/
    ├── pyproject.toml        # Python package
    └── src/doc_builder/
        ├── main.py           # MCP server entry
        ├── config.py         # Settings
        ├── cli/              # Setup wizard & status
        ├── crawler/          # Web scraping
        │   ├── spider.py     # Async crawler
        │   ├── parser.py     # HTML parsing
        │   └── agent.py      # LLM link evaluation
        ├── vector/           # Vectorization
        │   ├── embedder.py   # Gemini embeddings
        │   ├── chunker.py    # Smart chunking
        │   └── indexer.py    # Neo4j vector index
        ├── ontology/         # Knowledge extraction
        │   ├── extractor.py  # Concept extraction
        │   ├── metatag.py    # Metatag processing
        │   └── linker.py     # Relationship building
        ├── kg/               # Neo4j graph
        │   ├── neo4j.py      # Async client
        │   ├── repo.py       # Query repository
        │   └── schema.cypher # Database schema
        ├── llm/              # LLM integration
        │   ├── client.py     # LiteLLM wrapper
        │   └── prompts/      # Prompt templates
        ├── mcp/              # MCP protocol
        │   ├── tools.py      # Tool definitions
        │   ├── resources.py  # Resource handlers
        │   └── prompts.py    # Workflow prompts
        └── security/         # Auth & validation

Graph Schema

Nodes (Doc* prefixed for namespace separation)

DocSource: Documentation root (URL, name, status)
DocPage: Individual pages with metadata
DocChunk: Vectorized content chunks with embeddings
DocConcept: Extracted concepts (APIs, patterns, entities)
DocMetatag: Page metatags (og:, twitter:, etc.)
DocCrawlJob: Crawl job tracking

Relationships

(DocSource)-[:CONTAINS]->(DocPage)
(DocPage)-[:LINKS_TO]->(DocPage)
(DocPage)-[:HAS_CHUNK]->(DocChunk)
(DocChunk)-[:MENTIONS]->(DocConcept)
(DocConcept)-[:RELATES_TO]->(DocConcept)

CLI Commands

# Interactive setup
doc-mcp-setup

# Health check
doc-mcp-status --doctor

# Run server
doc-mcp

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Type checking
mypy src/

# Linting
ruff check src/

License

MIT

Related Projects

MCP KG Memory: Knowledge graph memory for AI coding assistants
Model Context Protocol: MCP specification

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.4

Mar 10, 2026

This version

0.1.3

Mar 10, 2026

0.1.2

Mar 10, 2026

0.1.1

Mar 10, 2026

0.1.0

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_builder_mcp-0.1.3.tar.gz (78.5 kB view details)

Uploaded Mar 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doc_builder_mcp-0.1.3-py3-none-any.whl (96.8 kB view details)

Uploaded Mar 10, 2026 Python 3

File details

Details for the file doc_builder_mcp-0.1.3.tar.gz.

File metadata

Download URL: doc_builder_mcp-0.1.3.tar.gz
Upload date: Mar 10, 2026
Size: 78.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for doc_builder_mcp-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`623f80b30bb74d17e1ba49a91de2926e1d7360b3910a645cb0c3b04d399264c9`
MD5	`c34ea30851fada835a23002811cb0c46`
BLAKE2b-256	`a30cec1a178650a2080dbdd049f8d0a69630330c1f68e343cee9a3c07e481370`

See more details on using hashes here.

File details

Details for the file doc_builder_mcp-0.1.3-py3-none-any.whl.

File metadata

Download URL: doc_builder_mcp-0.1.3-py3-none-any.whl
Upload date: Mar 10, 2026
Size: 96.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for doc_builder_mcp-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c909b692bf1061f06ed73a6ce0ce9756fb384e565fd06e671ea508fc8ceab008`
MD5	`95a302905d0b034bce08a6864c23d354`
BLAKE2b-256	`44f35eeef3a6cba6958f577c1d674fddc6b7b0a2718e0b21d2d07dd8042ebe5a`

See more details on using hashes here.

doc-builder-mcp 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MCP Doc Builder

Overview

Features

Intelligent Crawling

Semantic Search

Dynamic Ontology

MCP Integration

Quick Start

1. Prerequisites

2. Installation

Option 1: One-Line Install (Recommended)

Alternative: Standard Pip

Option 2: Manual Development Setup

3. Setup

4. Start Neo4j

5. Run the Server

Configuration

Environment Variables

MCP Tools

doc_ingest

doc_search

doc_context

doc_sources

doc_refresh

doc_ontology

IDE Integration

Antigravity (Google Deepmind)

Recommended Antigravity Custom Rule

Cursor

VS Code (with Claude Dev / Roo Code)

Architecture

Graph Schema

Nodes (Doc* prefixed for namespace separation)

Relationships

CLI Commands

Development

License

Related Projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes