A cozy `nest` where all needed context is pre‑assembled for the LLM.
Project description
ContextNest 🦅
A cozy nest where all needed context is pre-assembled for the LLM. ContextNest provides a Model Context Protocol (MCP) server that enables AI assistants to access web scraping, knowledge base search, and document management capabilities.
Table of Contents
- Project Overview
- Architecture
- Installation
- Configuration
- Usage
- API Documentation
- Examples
- Detailed Module Documentation
- Contributing
- License
Project Overview
ContextNest is a Python-based MCP (Model Context Protocol) server that provides AI assistants with powerful context management capabilities. The system combines:
- Web Scraping: Automated content extraction from URLs using Playwright and BeautifulSoup
- Knowledge Base: Vector and full-text search capabilities with hybrid ranking
- Document Management: Storage and retrieval of documents in a DuckDB database
- MCP Integration: Seamless integration with AI assistants through the Model Context Protocol
Key Features
- Web Scraping: Extract content from web pages and convert to Markdown format
- Hybrid Search: Combine vector similarity search with full-text search (BM25) using Reciprocal Rank Fusion (RRF)
- CAPTCHA Handling: Automatic detection and handling of common CAPTCHA challenges
- Stealth Browsing: Anti-bot detection evasion techniques
- Resource Management: MCP resources for accessing stored files and metadata
- Logging: Comprehensive logging with structured messages using loguru
Architecture
The ContextNest architecture is modular and consists of several key components:
contextnest/
├── mcp_server.py # Main MCP server implementation
├── mcp_models.py # Pydantic models for tool inputs
├── mcp_logger.py # Structured logging implementation
├── server_tools.py # Core tool logic implementations
├── web_scraper/ # Web scraping functionality
│ ├── scraper.py # Main scraper class
│ ├── captcha_handler.py # CAPTCHA detection and handling
│ └── markdown_converter.py # HTML to Markdown conversion
└── micro_search/ # Search functionality
├── database.py # DuckDB database management
├── insertion.py # Document insertion logic
├── hybrid_search.py # Vector + BM25 search with RRF
└── db_preparation.py # Database setup and preparation
Core Components
MCP Server (mcp_server.py)
The main server implementation using FastMCP that exposes tools and resources to AI assistants. It defines the MCP endpoints and handles the communication protocol.
Web Scraper (web_scraper/)
A comprehensive web scraping module that includes:
- Playwright-based browser automation
- CAPTCHA detection and handling
- Stealth techniques to avoid bot detection
- HTML to Markdown conversion
- Human-like behavior simulation
Micro Search (micro_search/)
A powerful search module that provides:
- Vector similarity search using embeddings
- Full-text search with BM25 ranking
- Hybrid search combining both approaches with Reciprocal Rank Fusion
- DuckDB-based storage for documents and embeddings
MCP Logger (mcp_logger.py)
Specialized logging for MCP operations with structured, configurable output using loguru.
Installation
Prerequisites
- Python 3.13 or higher
- Node.js (for Playwright dependencies)
- System dependencies for Playwright (Chromium browser)
Setup
-
Clone the repository:
git clone <repository-url> cd contextnest
-
Install dependencies using uv (recommended):
uv syncOr using pip:
pip install -r requirements.txt
-
Install Playwright browser dependencies:
playwright install chromium
-
Set up the database:
python -m contextnest.micro_search.db_preparation
Install from PyPI
You can also install ContextNest directly from PyPI:
pip install contextnest
Or use uvx to run the MCP server directly:
uvx contextnest
MCP Client Configuration
To use ContextNest with MCP-compatible clients (Claude Desktop, Cursor, etc.), add this to your MCP configuration:
{
"mcpServers": {
"ContextNest": {
"command": "uvx",
"args": ["contextnest"]
}
}
}
Dependencies
ContextNest requires the following key dependencies:
fastmcp: Model Context Protocol implementationplaywright: Browser automation for web scrapingduckdb: Database for document storage and searchbeautifulsoup4: HTML parsingloguru: Structured loggingollama: Optional LLM integration for embeddingsgoogle-genai: Google AI client for embeddings
Configuration
ContextNest uses default configurations that work out of the box, but can be customized:
Configuration Files
Configuration is handled through the MCP protocol, with default settings in the code. The database schema and search parameters are configured in the micro_search module.
Usage
Running the MCP Server
To start the ContextNest MCP server:
python -m contextnest.mcp_server
The server will start and wait for MCP client connections. It provides the following tools and resources:
Available Tools
1. Web Scraping (web_scrape)
Scrapes a URL and converts it to Markdown format, automatically saving the result to the default output directory.
Input Parameters:
url(required): The URL to scrapesave_path(optional): Custom path to save the markdown locally
Example Usage:
# When prompted by the MCP client
{
"url": "https://example.com/article",
"save_path": "/custom/path/article.md"
}
2. Search (search)
Performs a hybrid search (Vector + BM25) on the knowledge base.
Input Parameters:
query(required): The search querylimit(optional): Maximum number of results to return (default: 5)k(optional): Smoothing constant for RRF (default: 60)vector_weight(optional): Weight for vector search results (default: 1.0)fts_weight(optional): Weight for full-text search results (default: 1.0)
Example Usage:
# When prompted by the MCP client
{
"query": "machine learning algorithms",
"limit": 10,
"k": 60,
"vector_weight": 1.0,
"fts_weight": 1.0
}
3. Insert Knowledge (insert_knowledge)
Inserts a document into the knowledge base. This process chunks, embeds, and stores the content in DuckDB. This tool runs as a background task since it can take seconds to minutes to complete.
Input Parameters:
url(required): The source URL of the contenttitle(optional): The title of the content (extracted from content if not provided)content(optional): The actual text content (scraped from URL if not provided)
Example Usage:
# When prompted by the MCP client
{
"url": "https://example.com/important-document",
"title": "Important Document Title",
"content": "Full text content here..."
}
4. Read Metadata (read_metadata)
Reads the application's metadata file to see database logical links and configurations.
Input Parameters: None
Available Resources
1. Output File Resource (contextnest://output/{filename})
Reads a markdown file from the ContextNest output directory.
Usage:
contextnest://output/filename.md
2. Metadata Resource (contextnest://metadata)
Reads the ContextNest metadata file.
Usage:
contextnest://metadata
MCP Prompt
ContextNest also provides a system prompt to guide LLMs on how to use the available tools:
You are an intelligent assistant with access to the ContextNest knowledge base. Use the available tools to answer requests:
- Use 'search' for finding relevant documents using hybrid search (Vector + BM25).
- Use 'web_scrape' to ingest new content from URLs if the search yields insufficient results.
- Use 'insert_knowledge' to explicitly save important information.
- Always cite your sources when providing answers from the knowledge base.
API Documentation
MCPLogger
Specialized logger for MCP operations with structured, configurable logging.
from contextnest.mcp_logger import MCPLogger, log_request, log_response, log_error
# Create a logger instance
logger = MCPLogger(level="INFO")
# Log MCP operations
logger.log_request("web_scrape", {"url": "https://example.com"})
logger.log_response("search", {"results": 5})
logger.log_error("insert_knowledge", Exception("Database error"))
# Use convenience functions that use the global logger
from contextnest.mcp_logger import info_mcp, debug_mcp, warning_mcp
info_mcp("Processing search query")
debug_mcp("Detailed debug information", query_time=0.123)
warning_mcp("Potential issue detected")
Web Scraper API
The web scraper provides both class-based and function-based interfaces:
from contextnest.web_scraper import WebScraper, scrape_url
# Class-based approach with full control
async with WebScraper(headless=True) as scraper:
markdown = await scraper.scrape("https://example.com")
# Function-based approach for simple scraping
markdown = await scrape_url("https://example.com", headless=True)
Micro Search API
The search functionality includes vector search, full-text search, and hybrid search:
from contextnest.micro_search import HybridSearch, hybrid_search, insert_document
# Insert a document
insert_document(
url="https://example.com/doc",
title="Document Title",
content="Document content..."
)
# Perform hybrid search using convenience function
query_embedding = [0.1, 0.2, 0.3, ...] # 768-dimensional vector
results = hybrid_search(
query="search query",
query_embedding=query_embedding,
limit=5
)
# Or use the class directly for more control
searcher = HybridSearch()
results = searcher.search(
query="search query",
query_embedding=query_embedding,
limit=5,
k=60,
vector_weight=1.0,
fts_weight=1.0
)
Server Tools API
The server tools module contains the core logic for all MCP operations:
from contextnest.server_tools import (
web_scrape_logic,
search_logic,
insert_knowledge_logic,
read_metadata_logic
)
# Each logic function can be called independently
result = await web_scrape_logic(input_data, ctx)
result = await search_logic(input_data, ctx)
result = await insert_knowledge_logic(input_data, ctx)
result = read_metadata_logic(input_data)
Examples
Full Cycle Example
The repository includes a full cycle example that demonstrates:
- Web scraping from a URL
- Content statistics analysis
- Document insertion into the database
- Hybrid search with vector + BM25 ranking
from examples.full_cycle_example import run_full_cycle_example
run_full_cycle_example()
The example uses the GitHub repository page for the AI Dev Tools Zoomcamp as a source, and performs a search for "what's the first question" to demonstrate the complete workflow. The example includes URL caching to skip scraping if the URL already exists in the database.
Custom Usage
import asyncio
from contextnest.web_scraper import WebScraper
from contextnest.micro_search import insert_document, hybrid_search
from contextnest.mcp_logger import info_mcp
async def custom_workflow():
# Scrape content from a URL
content = await scrape_url("https://example.com/article")
# Insert into knowledge base
insert_document(
url="https://example.com/article",
title="Example Article",
content=content
)
# Search for relevant content
query_embedding = [0.1, 0.2, 0.3] # Generated embedding
results = hybrid_search(
query="relevant topic",
query_embedding=query_embedding,
limit=3
)
info_mcp(f"Found {len(results)} results")
# Run the workflow
asyncio.run(custom_workflow())
Detailed Module Documentation
For more detailed information about specific modules, see the documentation in the docs/ directory:
- Web Scraper Module: Documentation for the web scraping functionality
- Micro Search Module: Documentation for the hybrid search system
- Main Documentation Index: Complete documentation index
Contributing
We welcome contributions to ContextNest! Here's how you can help:
Development Setup
- Fork the repository
- Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -e .
- Install Playwright dependencies:
playwright install
Code Standards
- Follow PEP 8 style guidelines
- Write type hints for all public functions
- Include docstrings for all classes and functions
- Add tests for new functionality
- Keep dependencies minimal and well-justified
Pull Request Process
- Create a feature branch from the main branch
- Make your changes with clear, descriptive commit messages
- Add tests if applicable
- Update documentation as needed
- Submit a pull request with a clear description of your changes
Reporting Issues
When reporting issues, please include:
- Python version
- Operating system
- Steps to reproduce the issue
- Expected vs. actual behavior
- Any relevant error messages
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
For support, please open an issue in the GitHub repository. For questions about the Model Context Protocol, refer to the official MCP documentation.
Made with ❤️ for the AI development community.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file contextnest-0.1.0.tar.gz.
File metadata
- Download URL: contextnest-0.1.0.tar.gz
- Upload date:
- Size: 145.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d7a04186de7837b1e55c8ac6e5c07e9ccb8925c39af10e90d0eec0f43b0fe52
|
|
| MD5 |
1687803d4ff5d5d8cb2812545937a984
|
|
| BLAKE2b-256 |
13b9b17d9b19f0caeebd2f93e90bf1ee5693c98ee037cc83dbc90086708f50d7
|
File details
Details for the file contextnest-0.1.0-py3-none-any.whl.
File metadata
- Download URL: contextnest-0.1.0-py3-none-any.whl
- Upload date:
- Size: 38.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b40cfe1dc4d16c78379ab0ff4b09e95f6530b4fcd93bf0b56407eede7f6e466
|
|
| MD5 |
a7499b1a65bcb7e840b8876e2364611a
|
|
| BLAKE2b-256 |
ad518047287722a07770dbac211d69446a964527ad3f714174a259f411df02a4
|