Skip to main content

A Model Context Protocol (MCP) server for Confluence RAG with ChromaDB vector search

Project description

Confluence RAG Data Pipeline with MCP Protocol

A Model Context Protocol (MCP) server that provides relevant context from Confluence pages using RAG (Retrieval Augmented Generation).

PyPI version Python 3.9+ License: MIT

🚀 Quick Start

# Install from PyPI
pip install confluence-scraper-mcp

# Set environment variables
export CONFLUENCE_BASE_URL="https://your-domain.atlassian.net"
export CONFLUENCE_TOKEN="your-api-token"
export CONFLUENCE_SPACE_KEY="your-space-key"

# Run as MCP server
confluence-scraper-mcp

# Or run as web server
confluence-scraper-mcp --web

Features

  • 🔍 Semantic Search: Uses ChromaDB for vector-based document retrieval
  • 🔗 MCP Integration: Full Model Context Protocol implementation
  • 📚 Confluence Native: Direct integration with Confluence API
  • 🏷️ Smart Filtering: Filter by spaces, labels, and metadata
  • 📎 Rich Content: Handles attachments and comments
  • 🌐 Dual Mode: Run as MCP server or REST API
  • 📦 Easy Install: Available on PyPI

Requirements

  • Python 3.9 or higher
  • Confluence API access token
  • ChromaDB for vector storage

Installation

  1. Install from PyPI (Recommended):

    pip install confluence-scraper-mcp
    
  2. Install UV if you haven't already:

    curl -LsSf https://astral.sh/uv/install.sh | sh
    
  3. Clone and Setup Project (Development):

    git clone <repository-url>
    cd confluence-scraper-mcp
    # Create virtual environment
    uv venv .venv
    # Activate virtual environment
    source .venv/bin/activate
    # Install dependencies
    uv pip install -r requirements.txt
    
  4. Configure Environment:

    • Create a .env file in the project root:
    touch .env
    
    • Add the following configuration (adjust values as needed):
    # Required settings
    CONFLUENCE_BASE_URL=https://your-domain.atlassian.net
    CONFLUENCE_TOKEN=your-api-token
    CONFLUENCE_SPACE_KEY=optional-space-key
    
    # Optional settings (with defaults)
    INITIAL_CRAWL=false
    CHROMA_PERSIST_DIR=./data/chroma
    EMBEDDING_MODEL="all-MiniLM-L6-v2"
    MAX_PAGES=1000
    INCLUDE_ATTACHMENTS=true
    INCLUDE_COMMENTS=true
    

Usage

Command Line Interface (After PyPI Installation)

# Run as MCP server (stdio mode) - default
confluence-scraper-mcp

# Run as web server
confluence-scraper-mcp --web

Development Mode

  1. Using uvx (Recommended):

    # Development mode with auto-reload
    uvx uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
    
    # Run tests
    uvx pytest
    
    # Code formatting and checks
    uvx black .
    uvx isort .
    uvx mypy .
    
  2. Alternative: Using Virtual Environment:

    # Activate virtual environment
    source .venv/bin/activate
    
    # Then run commands as usual
    uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
    
  3. Initial Setup:

    # Start initial crawl of Confluence pages
    curl -X POST http://localhost:8000/crawl
    
    # Verify server health
    curl http://localhost:8000/health
    
  4. Use the MCP API:

    # Get context for an LLM query
    curl -X POST http://localhost:8000/mcp/context \
      -H "Content-Type: application/json" \
      -d '{
        "messages": [{"role": "user", "content": "Tell me about project X"}],
        "query": "project X documentation",
        "max_context_length": 1000
      }'
    
    # The response will include relevant context from your Confluence pages
    
  5. Monitor and Maintain:

    # View logs
    tail -f logs/app.log
    
    # Re-crawl Confluence (e.g., after updates)
    curl -X POST http://localhost:8000/crawl
    

API Endpoints

  • GET /health: Health check endpoint
  • POST /crawl: Trigger Confluence crawl
  • POST /mcp/context: Get relevant context for a query

MCP (Model Context Protocol) Configuration

This server implements the Model Context Protocol (MCP) for seamless integration with AI assistants and LLM clients.

Quick MCP Setup

  1. Install the package:

    pip install confluence-scraper-mcp
    
  2. Copy the MCP configuration:

    # Copy the example configuration
    cp examples/mcp-client-config.json ~/.config/your-mcp-client/
    
  3. Update environment variables in the config:

    {
      "mcpServers": {
        "confluence-scraper-mcp": {
          "command": "confluence-scraper-mcp",
          "args": [],
          "env": {
            "CONFLUENCE_BASE_URL": "https://your-domain.atlassian.net",
            "CONFLUENCE_TOKEN": "your-api-token",
            "CONFLUENCE_SPACE_KEY": "your-space-key"
          }
        }
      }
    }
    

MCP Tools Available

The server provides several MCP tools:

  • confluence_search: Search Confluence pages using semantic search
  • confluence_get_page: Retrieve specific page content by ID or title
  • confluence_crawl: Trigger crawling and indexing of content

Example MCP Tool Usage

{
  "method": "tools/call",
  "params": {
    "name": "confluence_search",
    "arguments": {
      "query": "API authentication methods",
      "space_key": "DEV",
      "max_results": 3,
      "include_attachments": true
    }
  }
}

MCP Configuration Files

The package includes example configuration files:

  • examples/mcp.json: Complete MCP server specification
  • examples/mcp-client-config.json: Simple client configuration

See the MCP specification for more details on the protocol.

🤖 GitHub Copilot Integration

Quick Setup for Copilot

  1. Install the package:

    pip install confluence-scraper-mcp
    
  2. Configure VS Code Settings: Open VS Code settings (Cmd+,) and add to your settings.json:

    {
      "github.copilot.chat.mcpServers": {
        "confluence-rag": {
          "command": "confluence-scraper-mcp",
          "args": [],
          "env": {
            "CONFLUENCE_BASE_URL": "https://your-domain.atlassian.net",
            "CONFLUENCE_TOKEN": "your-api-token",
            "CONFLUENCE_SPACE_KEY": "your-space-key"
          }
        }
      }
    }
    
  3. Initial Setup:

    # Start server and crawl content
    confluence-scraper-mcp --web &
    curl -X POST http://localhost:8000/crawl
    
  4. Test with Copilot: Open Copilot Chat and ask: "How do we handle authentication in our system?"

Detailed Setup Guide

For complete setup instructions, see: 📖 Copilot Setup Guide

Using with Code Assistants

This MCP server specializes in Confluence documentation and uses RAG (Retrieval Augmented Generation) with ChromaDB:

Key Features:

  • 🔗 Confluence Integration: Direct API integration with page, attachment, and comment handling
  • 🔍 Semantic Search: ChromaDB vector search for meaning-based retrieval
  • 🏷️ Smart Filtering: Filter by space keys, labels, content types
  • 📊 Metadata Preservation: Maintains Confluence structure and relationships json { "endpoints": [ { "name": "API Documentation", "url": "http://localhost:8000/mcp/context", "options": { "max_context_length": 2000, "filter": { "space_key": "API", "labels": ["technical-docs", "api-reference"], "include_comments": true, "include_attachments": false, "semantic_ranking": { "weight": 0.7, "model": "all-MiniLM-L6-v2" } } }, "authentication": { "type": "none" } }, { "name": "Architecture Docs", "url": "http://localhost:8000/mcp/context", "options": { "max_context_length": 3000, "filter": { "space_key": "ARCH", "labels": ["architecture", "design"], "include_comments": false, "include_attachments": true, "semantic_ranking": { "weight": 0.8, "model": "all-MiniLM-L6-v2" } } }, "authentication": { "type": "none" } } ], "default_endpoint": "API Documentation" } - Add the path to this file in VS Code settings under "Copilot Chat: MCP Configuration File" - See examples/mcp.json for a full example with multiple endpoints and filtering options
  1. Usage with Copilot:

    • In VS Code, open Copilot Chat (Cmd+I)
    • Your queries will now include relevant context from your Confluence pages
    • Example: "How do I implement feature X?" will include context from related Confluence documentation
    • You can also use /doc command in Copilot Chat to explicitly search documentation
  2. Tips for Better Results:

    • Keep Confluence pages well-organized and up-to-date
    • Use descriptive titles and labels in Confluence
    • Re-crawl after significant documentation updates:
      curl -X POST http://localhost:8000/crawl
      

Development

  1. Install Development Dependencies:

    uv pip install -r requirements.txt
    
  2. Using uvx for Development: UV installs a command runner called uvx that can run Python scripts and modules without explicitly activating the virtual environment:

    # Run the FastAPI server
    uvx uvicorn app.main:app --reload
    
    # Run tests
    uvx pytest
    
    # Code formatting
    uvx black .
    uvx isort .
    uvx mypy .
    
  3. Environment Configuration: The project uses environment variables for configuration. Copy .env.example to .env and update the values:

    CONFLUENCE_BASE_URL=https://your-domain.atlassian.net
    CONFLUENCE_TOKEN=your-api-token
    CONFLUENCE_SPACE_KEY=your-space-key
    CHROMA_PERSIST_DIR=data/chroma
    CHROMA_COLLECTION_NAME=confluence_docs
    EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
    CHUNK_SIZE=512
    CHUNK_OVERLAP=50
    TOP_K=3
    SIMILARITY_THRESHOLD=0.7
    

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes:
    • Use uvx black . and uvx isort . to format code
    • Use uvx mypy . for type checking
    • Add tests for new features
    • Update documentation as needed
  4. Run tests (uvx pytest)
  5. Commit your changes (git commit -m 'Add some amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

License

MIT License. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

confluence_scraper_mcp-0.1.3.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

confluence_scraper_mcp-0.1.3-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file confluence_scraper_mcp-0.1.3.tar.gz.

File metadata

  • Download URL: confluence_scraper_mcp-0.1.3.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for confluence_scraper_mcp-0.1.3.tar.gz
Algorithm Hash digest
SHA256 008852596dff892cce788369517f09f6cd18991da0773a26f502e5a42df6448a
MD5 0e92415607500eac67688b5e136e6c7a
BLAKE2b-256 f0d6007b6235ebd4860f897761524a430981fb52c4afb6d62c49b9e15113fa1e

See more details on using hashes here.

File details

Details for the file confluence_scraper_mcp-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for confluence_scraper_mcp-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2a49ea71bb38ea0244b1bfb8e4c9d6940ac76929a840c2d01e29b2265eed8208
MD5 8c9ad20a6be9a5b037ae675980aabc46
BLAKE2b-256 4f9d59279da34b459382ce49dc7c560e366b819735421a36d4564810cda2ed3a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page