Unified Python package for RAG documentation workflows - Crawl, embed, store, and retrieve technical documentation for AI agents
Project description
Context Bridge ๐
Unified Python package for RAG-powered documentation management - Crawl, store, chunk, and retrieve technical documentation with vector + BM25 hybrid search.
๐ Table of Contents
- Overview
- Features
- Architecture
- Installation
- Quick Start
- Configuration
- Usage
- Database Schema
- Development
- Technical Documentation
- License
๐ฏ Overview
Context Bridge is a standalone Python package designed to help AI agents, LLMs, and developers manage technical documentation with RAG (Retrieval-Augmented Generation) capabilities. It bridges the gap between scattered online documentation and AI-ready, searchable knowledge bases.
What It Does
- Crawls technical documentation from URLs using Crawl4AI
- Organizes crawled pages into logical groups with size management
- Chunks Markdown content intelligently while preserving structure
- Embeds chunks using vector embeddings (Ollama/Gemini)
- Stores everything in PostgreSQL with vector and vchord_bm25
- Searches with hybrid vector + BM25 search for best results
- Serves via MCP (Model Context Protocol) for AI agent integration
- Manages through a Streamlit UI for human oversight
โจ Features
Core Capabilities
- ๐ท๏ธ Smart Crawling: Automatically detect and crawl documentation sites, sitemaps, and text files
- ๐ฆ Intelligent Chunking: Smart Markdown chunking that respects code blocks, paragraphs, and sentences
- ๐ Hybrid Search: Dual vector + BM25 search for superior retrieval accuracy
- ๐ Version Management: Track multiple versions of the same documentation
- ๐ฏ Document Organization: Manual page grouping with size constraints before chunking
- โก High Performance: PSQLPy for fast async PostgreSQL operations
- ๐ค AI-Ready: MCP server for seamless AI agent integration
- ๐จ User-Friendly: Streamlit UI for documentation management
Technical Features
- Vector Search: Powered by vector extension
- BM25 Full-Text Search: Using vchord_bm25 extension
- Async/Await: Fully asynchronous operations for scalability
- Configurable Embeddings: Support for Ollama (local) and Google Gemini (cloud)
- Type-Safe: Pydantic models for configuration and data validation
- Modular Design: Clean separation of concerns (repositories, services, managers)
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Context Bridge Architecture โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ Streamlit โ โ MCP Server โ โ Python API โ
โ UI โ โ (AI Agent) โ โ (Direct) โ
โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Service Layer โ
โ - CrawlingService โ
โ - ChunkingService โ
โ - EmbeddingService โ
โ - SearchService โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Repository Layer โ
โ - DocumentRepository โ
โ - PageRepository โ
โ - GroupRepository โ
โ - ChunkRepository โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PostgreSQL Manager โ
โ - Connection Pooling โ
โ - Transaction Management โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PostgreSQL Database โ
โ Extensions: โ
โ - vector (vector search) โ
โ - vchord_bm25 (BM25 search) โ
โ - pg_tokenizer (text tokenization) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
External Dependencies:
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ Crawl4AI โ โ Ollama โ
โ (Crawling) โ โ or Gemini โ
โ โ โ (Embeddings) โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
Workflow
1. Crawl Documentation
โ
2. Store Raw Pages
โ
3. Manual Organization (Group Pages)
โ
4. Smart Chunking
โ
5. Generate Embeddings
โ
6. Store with Vector + BM25 Indexes
โ
7. Hybrid Search (Vector + BM25)
๐ฆ Installation
Prerequisites
- Python 3.11+
- PostgreSQL 14+ with extensions:
vectorvchordpg_tokenizervchord_bm25
- Ollama (for local embeddings) or Google API Key (for Gemini)
Install from PyPI
pip install context-bridge
Install with Optional Dependencies
# With Gemini support
pip install context-bridge[gemini]
# With MCP server
pip install context-bridge[mcp]
# With Streamlit UI
pip install context-bridge[ui]
# All features
pip install context-bridge[all]
Running the Applications
MCP Server:
# Using the installed script
context-bridge-mcp
# Or run directly
python -m context_bridge_mcp
Streamlit UI:
# Using streamlit directly
streamlit run streamlit_app/app.py
# Or with uv
uv run streamlit run streamlit_app/app.py
Install from Source
git clone https://github.com/yourusername/context-bridge.git
cd context-bridge
pip install -e .
๐ Quick Start
1. Initialize Database
python -m context_bridge.database.init_databases
This will:
- Create required PostgreSQL extensions
- Create all necessary tables
- Set up vector and BM25 indexes
2. Basic Usage (Three Ways)
Option A: Direct Python (Recommended for PyPI users)
import asyncio
from context_bridge import ContextBridge, Config
async def main():
# Create config with your settings
config = Config(
postgres_host="localhost",
postgres_password="your_secure_password",
embedding_model="nomic-embed-text:latest"
)
# Use with context manager
async with ContextBridge(config=config) as bridge:
# Crawl documentation
result = await bridge.crawl_documentation(
name="Python Docs",
version="3.11",
source_url="https://docs.python.org/3/library/"
)
# Search documentation
search_results = await bridge.search(
query="async await tutorial",
document_id=result.document_id
)
for hit in search_results[:3]:
print(f"Score: {hit.score}, Content: {hit.content[:100]}...")
if __name__ == "__main__":
asyncio.run(main())
Option B: Environment Variables (Recommended for Docker/K8s)
# Set environment variables
export POSTGRES_HOST=postgres
export POSTGRES_PASSWORD=secure_password
export OLLAMA_BASE_URL=http://ollama:11434
export EMBEDDING_MODEL=nomic-embed-text:latest
# Or in docker-compose.yml
environment:
- POSTGRES_HOST=postgres
- POSTGRES_PASSWORD=secure_password
- EMBEDDING_MODEL=nomic-embed-text:latest
import asyncio
from context_bridge import ContextBridge
async def main():
# Config automatically loaded from environment variables
async with ContextBridge() as bridge:
result = await bridge.crawl_documentation(
name="Python Docs",
version="3.11",
source_url="https://docs.python.org/3/library/"
)
Option C: .env File (Convenient for local development)
Create .env file (git-ignored):
# PostgreSQL Configuration
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_secure_password
POSTGRES_DB=context_bridge
# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
EMBEDDING_MODEL=nomic-embed-text:latest
VECTOR_DIMENSION=768
# Search Configuration
SIMILARITY_THRESHOLD=0.7
BM25_WEIGHT=0.3
VECTOR_WEIGHT=0.7
# Chunking Configuration
CHUNK_SIZE=2000
MIN_COMBINED_CONTENT_SIZE=100
MAX_COMBINED_CONTENT_SIZE=3500000
# Crawling Configuration
CRAWL_MAX_DEPTH=3
CRAWL_MAX_CONCURRENT=5
Then in your code:
import asyncio
from context_bridge import ContextBridge
async def main():
# Config automatically loaded from .env file (if python-dotenv is available)
async with ContextBridge() as bridge:
result = await bridge.crawl_documentation(...)
To use .env files in development, install with dev dependencies:
pip install context-bridge[dev]
โ๏ธ Configuration
The package uses Pydantic for type-safe, type-hinted configuration. Context Bridge supports three configuration methods:
Configuration Methods (Priority Order)
- Direct Python instantiation (recommended for packaged installs)
- Environment variables (recommended for containers/CI)
- .env file (convenient for local development only)
Core Settings
| Setting | Default | Description | Python API |
|---|---|---|---|
POSTGRES_HOST |
localhost |
PostgreSQL host | postgres_host |
POSTGRES_PORT |
5432 |
PostgreSQL port | postgres_port |
POSTGRES_USER |
postgres |
PostgreSQL user | postgres_user |
POSTGRES_PASSWORD |
`` (empty) | PostgreSQL password (min 8 chars for prod) | postgres_password |
POSTGRES_DB |
context_bridge |
Database name | postgres_db |
DB_POOL_MAX |
10 |
Connection pool size | postgres_max_pool_size |
Embedding Settings
| Setting | Default | Description | Python API |
|---|---|---|---|
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama API URL | ollama_base_url |
EMBEDDING_MODEL |
nomic-embed-text:latest |
Ollama model name | embedding_model |
VECTOR_DIMENSION |
768 |
Embedding vector dimension | vector_dimension |
Search Settings
| Setting | Default | Description | Python API |
|---|---|---|---|
SIMILARITY_THRESHOLD |
0.7 |
Minimum similarity score | similarity_threshold |
BM25_WEIGHT |
0.3 |
BM25 weight in hybrid search | bm25_weight |
VECTOR_WEIGHT |
0.7 |
Vector weight in hybrid search | vector_weight |
Chunking Settings
| Setting | Default | Description | Python API |
|---|---|---|---|
CHUNK_SIZE |
2000 |
Default chunk size (bytes) | chunk_size |
MIN_COMBINED_CONTENT_SIZE |
100 |
Minimum combined page size (bytes) | min_combined_content_size |
MAX_COMBINED_CONTENT_SIZE |
3500000 |
Maximum combined page size (bytes) | max_combined_content_size |
Crawling Settings
| Setting | Default | Description | Python API |
|---|---|---|---|
CRAWL_MAX_DEPTH |
3 |
Maximum crawl depth | crawl_max_depth |
CRAWL_MAX_CONCURRENT |
10 |
Maximum concurrent crawl operations | crawl_max_concurrent |
๐ Usage
Crawling Documentation
from context_bridge.service.crawling_service import CrawlingService, CrawlConfig
from crawl4ai import AsyncWebCrawler
# Configure crawler
config = CrawlConfig(
max_depth=3, # How deep to follow links
max_concurrent=10, # Concurrent requests
memory_threshold=70.0 # Memory usage threshold
)
service = CrawlingService(config)
# Crawl a documentation site
async with AsyncWebCrawler(verbose=True) as crawler:
result = await service.crawl_webpage(
crawler,
"https://docs.example.com"
)
# Access results
for crawl_result in result.results:
print(f"URL: {crawl_result.url}")
print(f"Content length: {len(crawl_result.markdown)}")
Storing Documents
from context_bridge.repositories.document_repository import DocumentRepository
async with db_manager.connection() as conn:
doc_repo = DocumentRepository(conn)
# Create a new document
doc_id = await doc_repo.create(
name="Python Documentation",
version="3.11",
source_url="https://docs.python.org/3/",
description="Official Python 3.11 documentation"
)
# Store crawled pages
for page in crawled_pages:
await page_repo.create(
document_id=doc_id,
url=page.url,
content=page.markdown,
content_hash=hash(page.markdown)
)
Organizing Pages into Groups
from context_bridge.repositories.group_repository import GroupRepository
# User manually selects pages to group
page_ids = [1, 2, 3, 4, 5]
# Create a group
async with db_manager.connection() as conn:
group_repo = GroupRepository(conn)
group_id = await group_repo.create_group(
document_id=doc_id,
page_ids=page_ids,
min_size=1000, # Minimum total content size
max_size=50000 # Maximum total content size
)
Chunking and Embedding
from context_bridge.service.chunking_service import ChunkingService
from context_bridge.service.embedding import EmbeddingService
chunking_service = ChunkingService()
embedding_service = EmbeddingService(config)
# Get groups ready for chunking
eligible_groups = await group_repo.get_eligible_groups(doc_id)
for group in eligible_groups:
# Get combined content
content = await group_repo.get_group_content(group.id)
# Smart chunking
chunks = chunking_service.smart_chunk_markdown(
content,
chunk_size=2000
)
# Generate embeddings and store
for i, chunk_text in enumerate(chunks):
embedding = await embedding_service.get_embedding(chunk_text)
await chunk_repo.create(
document_id=doc_id,
group_id=group.id,
chunk_index=i,
content=chunk_text,
embedding=embedding
)
Searching Documents
Find Documents by Query
from context_bridge.repositories.document_repository import DocumentRepository
# Find relevant documents
documents = await doc_repo.find_by_query(
query="python asyncio tutorial",
limit=5
)
for doc in documents:
print(f"{doc.name} (v{doc.version})")
Search Document Content (Hybrid Search)
from context_bridge.repositories.chunk_repository import ChunkRepository
# Search within a specific document
chunks = await chunk_repo.hybrid_search(
document_id=doc_id,
version="3.11",
query="async await examples",
query_embedding=await embedding_service.get_embedding("async await examples"),
limit=10,
vector_weight=0.7,
bm25_weight=0.3
)
for chunk in chunks:
print(f"Score: {chunk.score}")
print(f"Content: {chunk.content[:200]}...")
Using the Streamlit UI
The Context Bridge includes a full-featured web interface for managing documentation:
# Install with UI support
pip install context-bridge[ui]
# Run the Streamlit application
uv run streamlit run streamlit_app/app.py
# Or use the installed script
context-bridge-ui
Features:
- Document Management: View, search, and delete documents
- Page Organization: Select and group crawled pages for processing
- Chunk Processing: Convert page groups into searchable chunks
- Hybrid Search: Search across all documentation with advanced filtering
Using the MCP Server
The Model Context Protocol server allows AI agents to interact with Context Bridge:
# Install with MCP support
pip install context-bridge[mcp]
# Run the MCP server
uv run python -m context_bridge_mcp
# Or use the installed script
context-bridge-mcp
Available Tools:
find_documents: Search for documents by querysearch_content: Perform hybrid vector + BM25 search within specific documents
Integration with AI Clients: The MCP server can be integrated with AI assistants like Claude Desktop for seamless documentation access.
For detailed usage instructions, see the MCP Server Usage Guide.
๐๏ธ Database Schema
Core Tables
-- Documents (versioned documentation)
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
version TEXT NOT NULL,
source_url TEXT,
description TEXT,
metadata JSONB DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(name, version)
);
-- Pages (raw crawled content)
CREATE TABLE pages (
id SERIAL PRIMARY KEY,
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
url TEXT NOT NULL UNIQUE,
content TEXT NOT NULL,
content_hash TEXT NOT NULL,
content_length INTEGER GENERATED ALWAYS AS (length(content)) STORED,
crawled_at TIMESTAMPTZ DEFAULT NOW(),
status TEXT DEFAULT 'pending' CHECK (status IN ('pending', 'processing', 'chunked', 'deleted')),
group_id UUID, -- For future grouping feature
metadata JSONB DEFAULT '{}'::jsonb
);
-- Chunks (embedded content)
CREATE TABLE chunks (
id SERIAL PRIMARY KEY,
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
group_id UUID, -- For future grouping feature
embedding VECTOR(768), -- Dimension must match config
bm25_vector bm25vector, -- Auto-generated by trigger
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(document_id, group_id, chunk_index)
);
-- Indexes
CREATE INDEX idx_pages_document ON pages(document_id);
CREATE INDEX idx_pages_status ON pages(status);
CREATE INDEX idx_pages_hash ON pages(content_hash);
CREATE INDEX idx_pages_group ON pages(group_id);
CREATE INDEX idx_chunks_document ON chunks(document_id);
CREATE INDEX idx_chunks_group ON chunks(group_id);
CREATE INDEX idx_chunks_vector ON chunks USING ivfflat(embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX idx_chunks_bm25 ON chunks USING bm25(bm25_vector bm25_ops);
๐ ๏ธ Development
Project Structure
context_bridge/ # Core package
โโโ __init__.py
โโโ config.py # Configuration management
โโโ core.py # Main ContextBridge API
โโโ database/
โ โโโ init_databases.py # Database initialization
โ โโโ postgres_manager.py # Connection pool manager
โโโ schema/
โ โโโ extensions.sql # PostgreSQL extensions & schema
โโโ repositories/ # Data access layer
โ โโโ document_repository.py
โ โโโ page_repository.py
โ โโโ group_repository.py
โ โโโ chunk_repository.py
โโโ service/ # Business logic layer
โ โโโ crawling_service.py
โ โโโ chunking_service.py
โ โโโ embedding.py
โ โโโ search_service.py
โ โโโ url_service.py
context_bridge_mcp/ # MCP Server (Model Context Protocol)
โโโ __init__.py
โโโ server.py # MCP server implementation
โโโ schemas.py # Tool input/output schemas
โโโ __main__.py # CLI entry point
streamlit_app/ # Streamlit Web UI
โโโ __init__.py
โโโ app.py # Main application
โโโ pages/ # Multi-page navigation
โ โโโ documents.py # Document management
โ โโโ crawled_pages.py # Page management
โ โโโ search.py # Search interface
โโโ components/ # Reusable UI components
โโโ utils/ # UI utilities and helpers
โโโ README.md # UI-specific documentation
docs/ # Documentation
โโโ guide/
โ โโโ MCP_SERVER_USAGE.md # MCP server usage guide
โโโ plan/ # Development plans
โ โโโ ui_and_mcp_implementation_plan.md
โโโ technical/ # Technical guides
โ โโโ crawl4ai_complete_guide.md
โ โโโ embedding_service.md
โ โโโ psqlpy-complete-guide.md
โ โโโ python_mcp_server_guide.md
โ โโโ python-testing-guide.md
โ โโโ smart_chunk_markdown_algorithm.md
โโโ memory_templates.yaml # Memory usage templates
tests/ # Test suite
โโโ conftest.py
โโโ integration/
โโโ unit/
โโโ e2e/ # End-to-end tests
โโโ conftest.py
โโโ test_streamlit_ui.py
### Running Tests
```bash
# Install dev dependencies
pip install -e ".[dev]"
# Run all tests
pytest
# Run with coverage
pytest --cov=context_bridge --cov-report=html
# Run specific test file
pytest tests/test_chunking_service.py -v
Code Quality
# Format code
black context_bridge tests
# Type checking
mypy context_bridge
# Linting
ruff check context_bridge
๐ Technical Documentation
Comprehensive technical guides are available in docs/:
Testing & Quality Assurance
- UI Testing Report - Comprehensive Playwright testing results and bug fixes
- MCP Server Usage Guide - How to use the MCP server with AI clients
Technical Guides (docs/technical/)
- Crawl4AI Guide - Complete crawling documentation
- Embedding Service - Ollama and Gemini embedding setup
- PSQLPy Guide - PostgreSQL driver usage
- MCP Server Guide - MCP server implementation
- Testing Guide - Testing best practices
- Smart Chunking Algorithm - Chunking implementation
Implementation Plans (docs/plan/)
- UI & MCP Implementation Plan - Development roadmap and progress
๐ค Contributing
Contributions are welcome! Please read our contributing guidelines and submit pull requests.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Crawl4AI - High-performance web crawler
- PSQLPy - Async PostgreSQL driver
- pgvector - Vector similarity search
- MCP - Model Context Protocol
๐ง Support
For questions, issues, or feature requests:
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: your.email@example.com
Built with โค๏ธ for AI agents and developers
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file context_bridge-0.1.1.tar.gz.
File metadata
- Download URL: context_bridge-0.1.1.tar.gz
- Upload date:
- Size: 630.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b455eab8911cf49ee8e1d44326ff42324b0b9661fd411cf7d7f927e072aa9bb7
|
|
| MD5 |
8ab13a0619c0a76a7f012a90c3d50905
|
|
| BLAKE2b-256 |
5599385ac52fe03a6c8ff1bf0fda818ad1ea1c2088c564f9b0d8926a895c01be
|
File details
Details for the file context_bridge-0.1.1-py3-none-any.whl.
File metadata
- Download URL: context_bridge-0.1.1-py3-none-any.whl
- Upload date:
- Size: 58.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f061a5a2c00a26470926e7c20aca99378426d66a1e7cb9c9025dd378ada412c2
|
|
| MD5 |
f6a89693e41c67c3dc096f7cfadf6c62
|
|
| BLAKE2b-256 |
022ded1de5f81ad032b5849e0fac6683d261e305f99ae62e1ed85e8a14c1d721
|