Unified Python package for RAG documentation workflows - Crawl, embed, store, and retrieve technical documentation for AI agents

These details have not been verified by PyPI

Project links

Project description

Context Bridge 🌉

Unified Python package for RAG-powered documentation management - Crawl, store, chunk, and retrieve technical documentation with vector + BM25 hybrid search.

🎯 Overview

Context Bridge is a standalone Python package designed to help AI agents, LLMs, and developers manage technical documentation with RAG (Retrieval-Augmented Generation) capabilities. It bridges the gap between scattered online documentation and AI-ready, searchable knowledge bases.

What It Does

Crawls technical documentation from URLs using Crawl4AI
Organizes crawled pages into logical groups with size management
Chunks Markdown content intelligently while preserving structure
Embeds chunks using vector embeddings (Ollama/Gemini)
Stores everything in PostgreSQL with vector and vchord_bm25
Searches with hybrid vector + BM25 search for best results
Serves via MCP (Model Context Protocol) for AI agent integration
Manages through a Streamlit UI for human oversight

✨ Features

Core Capabilities

🕷️ Smart Crawling: Automatically detect and crawl documentation sites, sitemaps, and text files
📦 Intelligent Chunking: Smart Markdown chunking that respects code blocks, paragraphs, and sentences
🔍 Hybrid Search: Dual vector + BM25 search for superior retrieval accuracy
📚 Version Management: Track multiple versions of the same documentation
🎯 Document Organization: Manual page grouping with size constraints before chunking
⚡ High Performance: PSQLPy for fast async PostgreSQL operations
🤖 AI-Ready: MCP server for seamless AI agent integration
🎨 User-Friendly: Streamlit UI for documentation management

Technical Features

Vector Search: Powered by vector extension
BM25 Full-Text Search: Using vchord_bm25 extension
Async/Await: Fully asynchronous operations for scalability
Configurable Embeddings: Support for Ollama (local) and Google Gemini (cloud)
Type-Safe: Pydantic models for configuration and data validation
Modular Design: Clean separation of concerns (repositories, services, managers)

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Context Bridge Architecture               │
└─────────────────────────────────────────────────────────────┘

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Streamlit   │    │ MCP Server   │    │  Python API  │
│      UI      │    │  (AI Agent)  │    │   (Direct)   │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                           │
                           ▼
       ┌───────────────────────────────────────┐
       │         Service Layer                 │
       │  - CrawlingService                    │
       │  - ChunkingService                    │
       │  - EmbeddingService                   │
       │  - SearchService                      │
       └───────────────┬───────────────────────┘
                       │
                       ▼
       ┌───────────────────────────────────────┐
       │       Repository Layer                │
       │  - DocumentRepository                 │
       │  - PageRepository                     │
       │  - GroupRepository                    │
       │  - ChunkRepository                    │
       └───────────────┬───────────────────────┘
                       │
                       ▼
       ┌───────────────────────────────────────┐
       │      PostgreSQL Manager               │
       │  - Connection Pooling                 │
       │  - Transaction Management             │
       └───────────────┬───────────────────────┘
                       │
                       ▼
       ┌───────────────────────────────────────┐
       │     PostgreSQL Database               │
       │  Extensions:                          │
       │  - vector (vector search)           │
       │  - vchord_bm25 (BM25 search)        │
       │  - pg_tokenizer (text tokenization) │
       └───────────────────────────────────────┘

External Dependencies:
┌──────────────┐    ┌──────────────┐
│   Crawl4AI   │    │    Ollama    │
│  (Crawling)  │    │  or Gemini   │
│              │    │ (Embeddings) │
└──────────────┘    └──────────────┘

Workflow

1. Crawl Documentation
   ↓
2. Store Raw Pages
   ↓
3. Manual Organization (Group Pages)
   ↓
4. Smart Chunking
   ↓
5. Generate Embeddings
   ↓
6. Store with Vector + BM25 Indexes
   ↓
7. Hybrid Search (Vector + BM25)

📦 Installation

Prerequisites

Python 3.11+
PostgreSQL 14+ with extensions:
- vector
- vchord
- pg_tokenizer
- vchord_bm25
Ollama (for local embeddings) or Google API Key (for Gemini)

Install from PyPI

pip install context-bridge

Install with Optional Dependencies

# With Gemini support
pip install context-bridge[gemini]

# With MCP server
pip install context-bridge[mcp]

# With Streamlit UI
pip install context-bridge[ui]

# All features
pip install context-bridge[all]

Running the Applications

MCP Server:

# Using the installed script
context-bridge-mcp

# Or run directly
python -m context_bridge_mcp

Streamlit UI:

# Using streamlit directly
streamlit run streamlit_app/app.py

# Or with uv
uv run streamlit run streamlit_app/app.py

Install from Source

git clone https://github.com/yourusername/context-bridge.git
cd context-bridge
pip install -e .

🚀 Quick Start

1. Initialize Database

python -m context_bridge.database.init_databases

This will:

Create required PostgreSQL extensions
Create all necessary tables
Set up vector and BM25 indexes

2. Basic Usage (Three Ways)

Option A: Direct Python (Recommended for PyPI users)

import asyncio
from context_bridge import ContextBridge, Config

async def main():
    # Create config with your settings
    config = Config(
        postgres_host="localhost",
        postgres_password="your_secure_password",
        embedding_model="nomic-embed-text:latest"
    )
    
    # Use with context manager
    async with ContextBridge(config=config) as bridge:
        # Crawl documentation
        result = await bridge.crawl_documentation(
            name="Python Docs",
            version="3.11",
            source_url="https://docs.python.org/3/library/"
        )
        
        # Search documentation
        search_results = await bridge.search(
            query="async await tutorial",
            document_id=result.document_id
        )
        
        for hit in search_results[:3]:
            print(f"Score: {hit.score}, Content: {hit.content[:100]}...")

if __name__ == "__main__":
    asyncio.run(main())

Option B: Environment Variables (Recommended for Docker/K8s)

# Set environment variables
export POSTGRES_HOST=postgres
export POSTGRES_PASSWORD=secure_password
export OLLAMA_BASE_URL=http://ollama:11434
export EMBEDDING_MODEL=nomic-embed-text:latest

# Or in docker-compose.yml
environment:
  - POSTGRES_HOST=postgres
  - POSTGRES_PASSWORD=secure_password
  - EMBEDDING_MODEL=nomic-embed-text:latest

import asyncio
from context_bridge import ContextBridge

async def main():
    # Config automatically loaded from environment variables
    async with ContextBridge() as bridge:
        result = await bridge.crawl_documentation(
            name="Python Docs",
            version="3.11",
            source_url="https://docs.python.org/3/library/"
        )

Option C: .env File (Convenient for local development)

Create .env file (git-ignored):

# PostgreSQL Configuration
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_secure_password
POSTGRES_DB=context_bridge

# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
EMBEDDING_MODEL=nomic-embed-text:latest
VECTOR_DIMENSION=768

# Search Configuration
SIMILARITY_THRESHOLD=0.7
BM25_WEIGHT=0.3
VECTOR_WEIGHT=0.7

# Chunking Configuration
CHUNK_SIZE=2000
MIN_COMBINED_CONTENT_SIZE=100
MAX_COMBINED_CONTENT_SIZE=3500000

# Crawling Configuration
CRAWL_MAX_DEPTH=3
CRAWL_MAX_CONCURRENT=5

Then in your code:

import asyncio
from context_bridge import ContextBridge

async def main():
    # Config automatically loaded from .env file (if python-dotenv is available)
    async with ContextBridge() as bridge:
        result = await bridge.crawl_documentation(...)

To use .env files in development, install with dev dependencies:

pip install context-bridge[dev]

⚙️ Configuration

The package uses Pydantic for type-safe, type-hinted configuration. Context Bridge supports three configuration methods:

Configuration Methods (Priority Order)

Direct Python instantiation (recommended for packaged installs)
Environment variables (recommended for containers/CI)
.env file (convenient for local development only)

Core Settings

Setting	Default	Description	Python API
`POSTGRES_HOST`	`localhost`	PostgreSQL host	`postgres_host`
`POSTGRES_PORT`	`5432`	PostgreSQL port	`postgres_port`
`POSTGRES_USER`	`postgres`	PostgreSQL user	`postgres_user`
`POSTGRES_PASSWORD`	`` (empty)	PostgreSQL password (min 8 chars for prod)	`postgres_password`
`POSTGRES_DB`	`context_bridge`	Database name	`postgres_db`
`DB_POOL_MAX`	`10`	Connection pool size	`postgres_max_pool_size`

Embedding Settings

Setting	Default	Description	Python API
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama API URL	`ollama_base_url`
`EMBEDDING_MODEL`	`nomic-embed-text:latest`	Ollama model name	`embedding_model`
`VECTOR_DIMENSION`	`768`	Embedding vector dimension	`vector_dimension`

Search Settings

Setting	Default	Description	Python API
`SIMILARITY_THRESHOLD`	`0.7`	Minimum similarity score	`similarity_threshold`
`BM25_WEIGHT`	`0.3`	BM25 weight in hybrid search	`bm25_weight`
`VECTOR_WEIGHT`	`0.7`	Vector weight in hybrid search	`vector_weight`

Chunking Settings

Setting	Default	Description	Python API
`CHUNK_SIZE`	`2000`	Default chunk size (bytes)	`chunk_size`
`MIN_COMBINED_CONTENT_SIZE`	`100`	Minimum combined page size (bytes)	`min_combined_content_size`
`MAX_COMBINED_CONTENT_SIZE`	`3500000`	Maximum combined page size (bytes)	`max_combined_content_size`

Crawling Settings

Setting	Default	Description	Python API
`CRAWL_MAX_DEPTH`	`3`	Maximum crawl depth	`crawl_max_depth`
`CRAWL_MAX_CONCURRENT`	`10`	Maximum concurrent crawl operations	`crawl_max_concurrent`

📚 Usage

Crawling Documentation

from context_bridge.service.crawling_service import CrawlingService, CrawlConfig
from crawl4ai import AsyncWebCrawler

# Configure crawler
config = CrawlConfig(
    max_depth=3,          # How deep to follow links
    max_concurrent=10,    # Concurrent requests
    memory_threshold=70.0 # Memory usage threshold
)

service = CrawlingService(config)

# Crawl a documentation site
async with AsyncWebCrawler(verbose=True) as crawler:
    result = await service.crawl_webpage(
        crawler,
        "https://docs.example.com"
    )

# Access results
for crawl_result in result.results:
    print(f"URL: {crawl_result.url}")
    print(f"Content length: {len(crawl_result.markdown)}")

Storing Documents

from context_bridge.repositories.document_repository import DocumentRepository

async with db_manager.connection() as conn:
    doc_repo = DocumentRepository(conn)
    
    # Create a new document
    doc_id = await doc_repo.create(
        name="Python Documentation",
        version="3.11",
        source_url="https://docs.python.org/3/",
        description="Official Python 3.11 documentation"
    )
    
    # Store crawled pages
    for page in crawled_pages:
        await page_repo.create(
            document_id=doc_id,
            url=page.url,
            content=page.markdown,
            content_hash=hash(page.markdown)
        )

Organizing Pages into Groups (Phase 2)

Context Bridge v2 introduces explicit group management for better organization and future AI context generation.

from context_bridge import ContextBridge

async def organize_documentation():
    bridge = ContextBridge()
    
    # Process specific pages as a group
    result = await bridge.process_pages(
        document_id=doc_id,
        page_ids=[1, 2, 3, 4, 5],
        chunk_size=2000,
        group_name="API Reference",
        group_description="REST API endpoints and examples"
    )
    
    print(f"Group created: {result.group_id}")
    print(f"Pages processed: {result.pages_processed}")
    print(f"Chunks created: {result.total_chunks}")
    
    # List all groups for a document
    groups = await bridge.list_groups(document_id=doc_id)
    for group in groups:
        print(f"\n{group['name']} ({group['processing_status']})")
        print(f"  Pages: {group['total_pages']}")
        print(f"  Chunks: {group['total_chunks']}")

**Key Features**:
- ✅ Name and describe page groups
- ✅ Track processing status (pending → processing → completed)
- ✅ View group statistics (pages, chunks, content size)
- ✅ Re-process groups with new settings
- ✅ Foundation for AI context generation (Phase 3)

See [Groups System Guide](./docs/guides/groups_system_guide.md) for comprehensive documentation.

### Chunking and Embedding

```python
from context_bridge.service.chunking_service import ChunkingService
from context_bridge.service.embedding import EmbeddingService

chunking_service = ChunkingService()
embedding_service = EmbeddingService(config)

# Get groups ready for chunking
eligible_groups = await group_repo.get_eligible_groups(doc_id)

for group in eligible_groups:
    # Get combined content
    content = await group_repo.get_group_content(group.id)
    
    # Smart chunking
    chunks = chunking_service.smart_chunk_markdown(
        content,
        chunk_size=2000
    )
    
    # Generate embeddings and store
    for i, chunk_text in enumerate(chunks):
        embedding = await embedding_service.get_embedding(chunk_text)
        
        await chunk_repo.create(
            document_id=doc_id,
            group_id=group.id,
            chunk_index=i,
            content=chunk_text,
            embedding=embedding
        )

Searching Documents

Find Documents by Query

from context_bridge.repositories.document_repository import DocumentRepository

# Find relevant documents
documents = await doc_repo.find_by_query(
    query="python asyncio tutorial",
    limit=5
)

for doc in documents:
    print(f"{doc.name} (v{doc.version})")

Search Document Content (Hybrid Search)

from context_bridge.repositories.chunk_repository import ChunkRepository

# Search within a specific document
chunks = await chunk_repo.hybrid_search(
    document_id=doc_id,
    version="3.11",
    query="async await examples",
    query_embedding=await embedding_service.get_embedding("async await examples"),
    limit=10,
    vector_weight=0.7,
    bm25_weight=0.3
)

for chunk in chunks:
    print(f"Score: {chunk.score}")
    print(f"Content: {chunk.content[:200]}...")

Using the Streamlit UI

The Context Bridge includes a full-featured web interface for managing documentation:

# Install with UI support
pip install context-bridge[ui]

# Run the Streamlit application
uv run streamlit run streamlit_app/app.py

# Or use the installed script
context-bridge-ui

Features:

Document Management: View, search, and delete documents
Page Organization: Select and group crawled pages for processing
Chunk Processing: Convert page groups into searchable chunks
Hybrid Search: Search across all documentation with advanced filtering

Using the MCP Server

The Model Context Protocol server allows AI agents to interact with Context Bridge:

# Install with MCP support
pip install context-bridge[mcp]

# Run the MCP server
uv run python -m context_bridge_mcp

# Or use the installed script
context-bridge-mcp

Available Tools:

find_documents: Search for documents by query
search_content: Perform hybrid vector + BM25 search within specific documents

Integration with AI Clients: The MCP server can be integrated with AI assistants like Claude Desktop for seamless documentation access.

For detailed usage instructions, see the MCP Server Usage Guide.

🗄️ Database Schema

Core Tables

-- Documents (versioned documentation)
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    version TEXT NOT NULL,
    source_url TEXT,
    description TEXT,
    metadata JSONB DEFAULT '{}'::jsonb,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(name, version)
);

-- Pages (raw crawled content)
CREATE TABLE pages (
    id SERIAL PRIMARY KEY,
    document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    url TEXT NOT NULL UNIQUE,
    content TEXT NOT NULL,
    content_hash TEXT NOT NULL,
    content_length INTEGER GENERATED ALWAYS AS (length(content)) STORED,
    crawled_at TIMESTAMPTZ DEFAULT NOW(),
    status TEXT DEFAULT 'pending' CHECK (status IN ('pending', 'processing', 'chunked', 'deleted')),
    group_id UUID, -- For future grouping feature
    metadata JSONB DEFAULT '{}'::jsonb
);

-- Chunks (embedded content)
CREATE TABLE chunks (
    id SERIAL PRIMARY KEY,
    document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    chunk_index INTEGER NOT NULL,
    content TEXT NOT NULL,
    group_id UUID, -- For future grouping feature
    embedding VECTOR(768), -- Dimension must match config
    bm25_vector bm25vector, -- Auto-generated by trigger
    created_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(document_id, group_id, chunk_index)
);

-- Indexes
CREATE INDEX idx_pages_document ON pages(document_id);
CREATE INDEX idx_pages_status ON pages(status);
CREATE INDEX idx_pages_hash ON pages(content_hash);
CREATE INDEX idx_pages_group ON pages(group_id);
CREATE INDEX idx_chunks_document ON chunks(document_id);
CREATE INDEX idx_chunks_group ON chunks(group_id);
CREATE INDEX idx_chunks_vector ON chunks USING ivfflat(embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX idx_chunks_bm25 ON chunks USING bm25(bm25_vector bm25_ops);

🛠️ Development

Project Structure

context_bridge/               # Core package
├── __init__.py
├── config.py                 # Configuration management
├── core.py                   # Main ContextBridge API
├── database/
│   ├── init_databases.py     # Database initialization
│   └── postgres_manager.py   # Connection pool manager
├── schema/
│   └── extensions.sql        # PostgreSQL extensions & schema
├── repositories/             # Data access layer
│   ├── document_repository.py
│   ├── page_repository.py
│   ├── group_repository.py
│   └── chunk_repository.py
├── service/                  # Business logic layer
│   ├── crawling_service.py
│   ├── chunking_service.py
│   ├── embedding.py
│   ├── search_service.py
│   └── url_service.py

context_bridge_mcp/          # MCP Server (Model Context Protocol)
├── __init__.py
├── server.py                 # MCP server implementation
├── schemas.py                # Tool input/output schemas
└── __main__.py               # CLI entry point

streamlit_app/               # Streamlit Web UI
├── __init__.py
├── app.py                    # Main application
├── pages/                    # Multi-page navigation
│   ├── documents.py          # Document management
│   ├── crawled_pages.py      # Page management
│   └── search.py             # Search interface
├── components/               # Reusable UI components
├── utils/                    # UI utilities and helpers
└── README.md                 # UI-specific documentation

docs/                        # Documentation
├── guide/
│   └── MCP_SERVER_USAGE.md   # MCP server usage guide
├── plan/                    # Development plans
│   └── ui_and_mcp_implementation_plan.md
├── technical/               # Technical guides
│   ├── crawl4ai_complete_guide.md
│   ├── embedding_service.md
│   ├── psqlpy-complete-guide.md
│   ├── python_mcp_server_guide.md
│   ├── python-testing-guide.md
│   └── smart_chunk_markdown_algorithm.md
└── memory_templates.yaml    # Memory usage templates

tests/                       # Test suite
├── conftest.py
├── integration/
├── unit/
└── e2e/                     # End-to-end tests
    ├── conftest.py
    └── test_streamlit_ui.py


### Running Tests

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest

# Run with coverage
pytest --cov=context_bridge --cov-report=html

# Run specific test file
pytest tests/test_chunking_service.py -v

Code Quality

# Format code
black context_bridge tests

# Type checking
mypy context_bridge

# Linting
ruff check context_bridge

📖 Technical Documentation

Comprehensive technical guides are available in docs/:

Testing & Quality Assurance

UI Testing Report - Comprehensive Playwright testing results and bug fixes
MCP Server Usage Guide - How to use the MCP server with AI clients

Technical Guides (`docs/technical/`)

API Reference - Complete Python API documentation
Crawl4AI Guide - Complete crawling documentation
Embedding Service - Ollama and Gemini embedding setup
PSQLPy Guide - PostgreSQL driver usage
MCP Server Guide - MCP server implementation
Testing Guide - Testing best practices
Smart Chunking Algorithm - Chunking implementation

Implementation Plans (`docs/plan/`)

UI & MCP Implementation Plan - Development roadmap and progress

🤝 Contributing

Contributions are welcome! Please read our contributing guidelines and submit pull requests.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Crawl4AI - High-performance web crawler
PSQLPy - Async PostgreSQL driver
pgvector - Vector similarity search
MCP - Model Context Protocol

📧 Support

For questions, issues, or feature requests:

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: your.email@example.com

Built with ❤️ for AI agents and developers

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Dec 1, 2025

0.1.1

Oct 23, 2025

0.1.0

Oct 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

context_bridge-0.2.1-py3-none-any.whl (92.8 kB view details)

Uploaded Dec 1, 2025 Python 3

File details

Details for the file context_bridge-0.2.1-py3-none-any.whl.

File metadata

Download URL: context_bridge-0.2.1-py3-none-any.whl
Upload date: Dec 1, 2025
Size: 92.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for context_bridge-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`44ecd1c86be611916328f95c212867c2c202da82d7d2894f0de7235be1c5f281`
MD5	`5786d7bd9fda6f135f774217759a9c7d`
BLAKE2b-256	`8f9796c5e087ef3a7f5cb435f674e7627f7aba81ff05f37fe4193c9d00ffc8f8`

See more details on using hashes here.

context-bridge 0.2.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Context Bridge 🌉

📋 Table of Contents

🎯 Overview

What It Does

✨ Features

Core Capabilities

Technical Features

🏗️ Architecture

Workflow

📦 Installation

Prerequisites

Install from PyPI

Install with Optional Dependencies

Running the Applications

Install from Source

🚀 Quick Start

1. Initialize Database

2. Basic Usage (Three Ways)

Option A: Direct Python (Recommended for PyPI users)

Option B: Environment Variables (Recommended for Docker/K8s)

Option C: .env File (Convenient for local development)

⚙️ Configuration

Configuration Methods (Priority Order)

Core Settings

Embedding Settings

Search Settings

Chunking Settings

Crawling Settings

📚 Usage

Crawling Documentation

Storing Documents

Organizing Pages into Groups (Phase 2)

Searching Documents

Find Documents by Query

Search Document Content (Hybrid Search)

Using the Streamlit UI

Using the MCP Server

🗄️ Database Schema

Core Tables

🛠️ Development

Project Structure

Code Quality

📖 Technical Documentation

Testing & Quality Assurance

Technical Guides (docs/technical/)

Implementation Plans (docs/plan/)

🤝 Contributing

📄 License

🙏 Acknowledgments

📧 Support

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

Technical Guides (`docs/technical/`)

Implementation Plans (`docs/plan/`)