A Model Context Protocol (MCP) server for Confluence RAG with ChromaDB vector search
Project description
Confluence RAG Data Pipeline with MCP Protocol
A Model Context Protocol (MCP) server that provides relevant context from Confluence pages using RAG (Retrieval Augmented Generation).
🚀 Quick Start
# Install from PyPI
pip install confluence-scraper-mcp
# Set environment variables
export CONFLUENCE_BASE_URL="https://your-domain.atlassian.net"
export CONFLUENCE_TOKEN="your-api-token"
export CONFLUENCE_SPACE_KEY="your-space-key"
# Run as MCP server
confluence-scraper-mcp
# Or run as web server
confluence-scraper-mcp --web
Features
- 🔍 Semantic Search: Uses ChromaDB for vector-based document retrieval
- 🔗 MCP Integration: Full Model Context Protocol implementation
- 📚 Confluence Native: Direct integration with Confluence API
- 🏷️ Smart Filtering: Filter by spaces, labels, and metadata
- 📎 Rich Content: Handles attachments and comments
- 🌐 Dual Mode: Run as MCP server or REST API
- 📦 Easy Install: Available on PyPI
Requirements
- Python 3.9 or higher
- Confluence API access token
- ChromaDB for vector storage
Installation
-
Install from PyPI (Recommended):
pip install confluence-scraper-mcp
-
Install UV if you haven't already:
curl -LsSf https://astral.sh/uv/install.sh | sh
-
Clone and Setup Project (Development):
git clone <repository-url> cd confluence-scraper-mcp # Create virtual environment uv venv .venv # Activate virtual environment source .venv/bin/activate # Install dependencies uv pip install -r requirements.txt
-
Configure Environment:
- Create a
.envfile in the project root:
touch .env- Add the following configuration (adjust values as needed):
# Required settings CONFLUENCE_BASE_URL=https://your-domain.atlassian.net CONFLUENCE_TOKEN=your-api-token CONFLUENCE_SPACE_KEY=optional-space-key # Optional settings (with defaults) INITIAL_CRAWL=false CHROMA_PERSIST_DIR=./data/chroma EMBEDDING_MODEL="all-MiniLM-L6-v2" MAX_PAGES=1000 INCLUDE_ATTACHMENTS=true INCLUDE_COMMENTS=true
- Create a
Usage
Command Line Interface (After PyPI Installation)
# Run as MCP server (stdio mode) - default
confluence-scraper-mcp
# Run as web server
confluence-scraper-mcp --web
Development Mode
-
Using uvx (Recommended):
# Development mode with auto-reload uvx uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload # Run tests uvx pytest # Code formatting and checks uvx black . uvx isort . uvx mypy .
-
Alternative: Using Virtual Environment:
# Activate virtual environment source .venv/bin/activate # Then run commands as usual uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
-
Initial Setup:
# Start initial crawl of Confluence pages curl -X POST http://localhost:8000/crawl # Verify server health curl http://localhost:8000/health
-
Use the MCP API:
# Get context for an LLM query curl -X POST http://localhost:8000/mcp/context \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "Tell me about project X"}], "query": "project X documentation", "max_context_length": 1000 }' # The response will include relevant context from your Confluence pages
-
Monitor and Maintain:
# View logs tail -f logs/app.log # Re-crawl Confluence (e.g., after updates) curl -X POST http://localhost:8000/crawl
API Endpoints
GET /health: Health check endpointPOST /crawl: Trigger Confluence crawlPOST /mcp/context: Get relevant context for a query
MCP (Model Context Protocol) Configuration
This server implements the Model Context Protocol (MCP) for seamless integration with AI assistants and LLM clients.
Quick MCP Setup
-
Install the package:
pip install confluence-scraper-mcp
-
Copy the MCP configuration:
# Copy the example configuration cp examples/mcp-client-config.json ~/.config/your-mcp-client/
-
Update environment variables in the config:
{ "mcpServers": { "confluence-scraper-mcp": { "command": "confluence-scraper-mcp", "args": [], "env": { "CONFLUENCE_BASE_URL": "https://your-domain.atlassian.net", "CONFLUENCE_TOKEN": "your-api-token", "CONFLUENCE_SPACE_KEY": "your-space-key" } } } }
MCP Tools Available
The server provides several MCP tools:
confluence_search: Search Confluence pages using semantic searchconfluence_get_page: Retrieve specific page content by ID or titleconfluence_crawl: Trigger crawling and indexing of content
Example MCP Tool Usage
{
"method": "tools/call",
"params": {
"name": "confluence_search",
"arguments": {
"query": "API authentication methods",
"space_key": "DEV",
"max_results": 3,
"include_attachments": true
}
}
}
MCP Configuration Files
The package includes example configuration files:
examples/mcp.json: Complete MCP server specificationexamples/mcp-client-config.json: Simple client configuration
See the MCP specification for more details on the protocol.
🤖 GitHub Copilot Integration
Quick Setup for Copilot
-
Install the package:
pip install confluence-scraper-mcp
-
Configure VS Code Settings: Open VS Code settings (
Cmd+,) and add to yoursettings.json:{ "github.copilot.chat.mcpServers": { "confluence-rag": { "command": "confluence-scraper-mcp", "args": [], "env": { "CONFLUENCE_BASE_URL": "https://your-domain.atlassian.net", "CONFLUENCE_TOKEN": "your-api-token", "CONFLUENCE_SPACE_KEY": "your-space-key" } } } }
-
Initial Setup:
# Start server and crawl content confluence-scraper-mcp --web & curl -X POST http://localhost:8000/crawl
-
Test with Copilot: Open Copilot Chat and ask: "How do we handle authentication in our system?"
Detailed Setup Guide
For complete setup instructions, see: 📖 Copilot Setup Guide
Using with Code Assistants
This MCP server specializes in Confluence documentation and uses RAG (Retrieval Augmented Generation) with ChromaDB:
Key Features:
- 🔗 Confluence Integration: Direct API integration with page, attachment, and comment handling
- 🔍 Semantic Search: ChromaDB vector search for meaning-based retrieval
- 🏷️ Smart Filtering: Filter by space keys, labels, content types
- 📊 Metadata Preservation: Maintains Confluence structure and relationships
json { "endpoints": [ { "name": "API Documentation", "url": "http://localhost:8000/mcp/context", "options": { "max_context_length": 2000, "filter": { "space_key": "API", "labels": ["technical-docs", "api-reference"], "include_comments": true, "include_attachments": false, "semantic_ranking": { "weight": 0.7, "model": "all-MiniLM-L6-v2" } } }, "authentication": { "type": "none" } }, { "name": "Architecture Docs", "url": "http://localhost:8000/mcp/context", "options": { "max_context_length": 3000, "filter": { "space_key": "ARCH", "labels": ["architecture", "design"], "include_comments": false, "include_attachments": true, "semantic_ranking": { "weight": 0.8, "model": "all-MiniLM-L6-v2" } } }, "authentication": { "type": "none" } } ], "default_endpoint": "API Documentation" }- Add the path to this file in VS Code settings under "Copilot Chat: MCP Configuration File" - Seeexamples/mcp.jsonfor a full example with multiple endpoints and filtering options
-
Usage with Copilot:
- In VS Code, open Copilot Chat (Cmd+I)
- Your queries will now include relevant context from your Confluence pages
- Example: "How do I implement feature X?" will include context from related Confluence documentation
- You can also use
/doccommand in Copilot Chat to explicitly search documentation
-
Tips for Better Results:
- Keep Confluence pages well-organized and up-to-date
- Use descriptive titles and labels in Confluence
- Re-crawl after significant documentation updates:
curl -X POST http://localhost:8000/crawl
Development
-
Install Development Dependencies:
uv pip install -r requirements.txt
-
Using uvx for Development: UV installs a command runner called
uvxthat can run Python scripts and modules without explicitly activating the virtual environment:# Run the FastAPI server uvx uvicorn app.main:app --reload # Run tests uvx pytest # Code formatting uvx black . uvx isort . uvx mypy .
-
Environment Configuration: The project uses environment variables for configuration. Copy
.env.exampleto.envand update the values:CONFLUENCE_BASE_URL=https://your-domain.atlassian.net CONFLUENCE_TOKEN=your-api-token CONFLUENCE_SPACE_KEY=your-space-key CHROMA_PERSIST_DIR=data/chroma CHROMA_COLLECTION_NAME=confluence_docs EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 CHUNK_SIZE=512 CHUNK_OVERLAP=50 TOP_K=3 SIMILARITY_THRESHOLD=0.7
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Make your changes:
- Use
uvx black .anduvx isort .to format code - Use
uvx mypy .for type checking - Add tests for new features
- Update documentation as needed
- Use
- Run tests (
uvx pytest) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
MIT License. See LICENSE for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file confluence_scraper_mcp-0.1.3.tar.gz.
File metadata
- Download URL: confluence_scraper_mcp-0.1.3.tar.gz
- Upload date:
- Size: 27.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
008852596dff892cce788369517f09f6cd18991da0773a26f502e5a42df6448a
|
|
| MD5 |
0e92415607500eac67688b5e136e6c7a
|
|
| BLAKE2b-256 |
f0d6007b6235ebd4860f897761524a430981fb52c4afb6d62c49b9e15113fa1e
|
File details
Details for the file confluence_scraper_mcp-0.1.3-py3-none-any.whl.
File metadata
- Download URL: confluence_scraper_mcp-0.1.3-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a49ea71bb38ea0244b1bfb8e4c9d6940ac76929a840c2d01e29b2265eed8208
|
|
| MD5 |
8c9ad20a6be9a5b037ae675980aabc46
|
|
| BLAKE2b-256 |
4f9d59279da34b459382ce49dc7c560e366b819735421a36d4564810cda2ed3a
|