Skip to main content

A tool for crawling, indexing, and semantically searching web content with RAG capabilities

Project description

RAG Retriever

A semantic search system that crawls websites, indexes content, and provides AI-powered search capabilities through an MCP server. Built with modular architecture using OpenAI embeddings and ChromaDB vector store.

🚀 Quick Start with AI Assistant

Let your AI coding assistant help you set up and use RAG Retriever:

Setup: Direct your AI assistant to SETUP_ASSISTANT_PROMPT.md
Usage: Direct your AI assistant to USAGE_ASSISTANT_PROMPT.md
CLI Operations: Direct your AI assistant to CLI_ASSISTANT_PROMPT.md
Administration: Direct your AI assistant to ADMIN_ASSISTANT_PROMPT.md
Advanced Content: Direct your AI assistant to ADVANCED_CONTENT_INGESTION_PROMPT.md
Troubleshooting: Direct your AI assistant to TROUBLESHOOTING_ASSISTANT_PROMPT.md

Quick Commands: See QUICKSTART.md for copy-paste installation commands.

These prompts provide comprehensive instructions for your AI assistant to walk you through setup, usage, and troubleshooting without needing to read through documentation.

What RAG Retriever Does

RAG Retriever enhances your AI coding workflows by providing:

  • Website Crawling: Index documentation sites, blogs, and knowledge bases
  • Semantic Search: Find relevant information using natural language queries
  • Collection Management: Organize content into themed collections
  • MCP Integration: Direct access from Claude Code and other AI assistants
  • Fast Processing: 20x faster crawling with Crawl4AI option
  • Rich Metadata: Extract titles, descriptions, and source attribution

Key Features

🌐 Advanced Web Crawling

  • Playwright: Reliable JavaScript-enabled crawling
  • Crawl4AI: High-performance crawling with content filtering
  • Configurable depth: Control how deep to crawl linked pages
  • Same-domain focus: Automatically stays within target sites

🔍 Semantic Search

  • OpenAI Embeddings: Uses text-embedding-3-large for high-quality vectors
  • Relevance Scoring: Configurable similarity thresholds
  • Cross-Collection Search: Search across all collections simultaneously
  • Source Attribution: Track where information comes from

📚 Collection Management

  • Named Collections: Organize content by topic, project, or source
  • Metadata Tracking: Creation dates, document counts, descriptions
  • Health Monitoring: Audit collections for quality and freshness
  • Easy Cleanup: Remove or rebuild collections as needed

🎯 Quality Management

  • Content Quality Assessment: Systematic evaluation of indexed content
  • AI-Powered Quality Review: Use AI to assess accuracy and completeness
  • Contradiction Detection: Find conflicting information across collections
  • Relevance Monitoring: Track search quality metrics over time
  • Best Practice Guidance: Comprehensive collection organization strategies

🤖 AI Integration

  • MCP Server: Direct integration with Claude Code
  • Custom Commands: Pre-built workflows for common tasks
  • Tool Descriptions: Clear interfaces for AI assistants
  • Permission Management: Secure access controls

MCP vs CLI Capabilities

MCP Server (Claude Code Integration)

The MCP server provides secure, AI-friendly access to core functionality:

  • Web Crawling: Index websites and documentation
  • Semantic Search: Search across collections with relevance scoring
  • Collection Discovery: List and explore available collections
  • Quality Assessment: Audit content quality and system health
  • Intentionally Limited: No administrative operations for security

CLI (Full Administrative Control)

The command-line interface provides complete system control:

  • All MCP Capabilities: Everything available through MCP server
  • Collection Management: Delete collections, clean entire vector store
  • Advanced Content Ingestion: Images, PDFs, GitHub repos, Confluence
  • Local File Processing: Directory scanning, bulk operations
  • System Administration: Configuration, maintenance, troubleshooting
  • Rich Output Options: JSON, verbose logging, custom formatting

Web UI (Visual Management)

The Streamlit-based web interface provides intuitive visual control:

  • Interactive Search: Visual search interface with adjustable parameters
  • Collection Management: View, delete, edit descriptions, compare collections
  • Content Discovery: Web search and direct content indexing workflow
  • Visual Analytics: Statistics, charts, and collection comparisons
  • User-Friendly: No command-line knowledge required
  • Real-time Feedback: Immediate visual confirmation of operations

When to Use Each Interface

Task MCP Server CLI Web UI Recommendation
Search content MCP for AI workflows, UI for interactive exploration
Index websites UI for discovery workflow, MCP for AI integration
Delete collections UI for visual confirmation, CLI for scripting
Edit collection metadata UI only option
Visual analytics UI only option
Content discovery UI provides search → select → index workflow
Process local files CLI only option
Analyze images CLI only option
GitHub integration CLI only option
System maintenance CLI only option
AI assistant integration MCP designed for AI workflows
Visual collection comparison UI provides interactive charts

Available Claude Code Commands

Once configured as an MCP server, you can use:

/rag-list-collections

Discover all available vector store collections with document counts and metadata.

/rag-search-knowledge "query [collection] [limit] [threshold]"

Search indexed content using semantic similarity:

  • "python documentation" - searches default collection
  • "python documentation python_docs" - searches specific collection
  • "python documentation all" - searches ALL collections
  • "error handling all 10 0.4" - custom parameters

/rag-index-website "url [max_depth] [collection]"

Crawl and index website content:

  • "https://docs.python.org" - index with defaults
  • "https://docs.python.org 3" - custom crawl depth
  • "https://docs.python.org python_docs 2" - custom collection

/rag-audit-collections

Review collection health, identify issues, and get maintenance recommendations.

/rag-assess-quality

Systematically evaluate content quality, accuracy, and reliability to ensure high-quality search results.

/rag-manage-collections

Administrative collection operations including deletion and cleanup (provides CLI commands).

/rag-ingest-content

Guide through advanced content ingestion for local files, images, and enterprise systems.

/rag-cli-help

Interactive CLI command builder and comprehensive help system.

Web UI Interface

Launch the visual interface with: rag-retriever --ui

Collection Management

Collection Management Comprehensive collection overview with statistics, metadata, and management actions

Collection Actions and Deletion

Collection Actions Collection management interface showing edit description and delete collection options with visual confirmation

Interactive Knowledge Search

Search Interface Search indexed content with adjustable parameters (max results, score threshold) and explore results with metadata and expandable content

Collection Analytics and Comparison

Collection Comparison Side-by-side collection comparison with interactive charts showing document counts, chunks, and performance metrics

Content Discovery and Indexing Workflow

Content Discovery Search the web, select relevant content, adjust crawl depth, and index directly into collections - complete discovery-to-indexing workflow

The Web UI excels at:

  • Content Discovery Workflow: Search → Select → Adjust → Index new content in one seamless interface
  • Visual Collection Management: View statistics, edit descriptions, delete collections with confirmation
  • Interactive Search: Real-time parameter adjustment and visual exploration of indexed content
  • Collection Analytics: Compare collections with interactive charts and performance metrics
  • Administrative Tasks: User-friendly collection deletion and management operations

How It Works

  1. Content Ingestion: Web pages are crawled and processed into clean text
  2. Embedding Generation: Text is converted to vectors using OpenAI's embedding models
  3. Vector Storage: Embeddings are stored in ChromaDB with metadata
  4. Semantic Search: Queries are embedded and matched against stored vectors
  5. Result Ranking: Results are ranked by similarity and returned with sources

Architecture

Layered Content Ingestion Architecture

flowchart TD
    subgraph CS ["CONTENT SOURCES"]
        subgraph WC ["Web Content"]
            WC1["Playwright"]
            WC2["Crawl4AI"] 
            WC3["Web Search"]
            WC4["Discovery UI"]
        end
        subgraph LF ["Local Files"]
            LF1["PDF Files"]
            LF2["Markdown"]
            LF3["Text Files"]
            LF4["Directories"]
        end
        subgraph RM ["Rich Media"]
            RM1["Images"]
            RM2["Screenshots"]
            RM3["Diagrams"]
            RM4["OpenAI Vision"]
        end
        subgraph ES ["Enterprise Systems"]
            ES1["GitHub Repos"]
            ES2["Confluence Spaces"]
            ES3["Private Repos"]
            ES4["Branch Selection"]
        end
    end
    
    subgraph PP ["PROCESSING PIPELINE"]
        subgraph CC ["Content Cleaning"]
            CC1["HTML Parsing"]
            CC2["Text Extract"]
            CC3["Format Normal"]
        end
        subgraph TC ["Text Chunking"]
            TC1["Smart Splits"]
            TC2["Overlap Mgmt"]
            TC3["Size Control"]
        end
        subgraph EB ["Embedding"]
            EB1["OpenAI API"]
            EB2["Vector Gen"]
            EB3["Batch Process"]
        end
        subgraph QA ["Quality Assessment"]
            QA1["Relevance Scoring"]
            QA2["Search Quality"]
            QA3["Collection Auditing"]
        end
    end
    
    subgraph SSE ["STORAGE & SEARCH ENGINE"]
        subgraph CD ["ChromaDB"]
            CD1["Vector Store"]
            CD2["Persistence"]
            CD3["Performance"]
        end
        subgraph COL ["Collections"]
            COL1["Topic-based"]
            COL2["Named Groups"]
            COL3["Metadata"]
        end
        subgraph SS ["Semantic Search"]
            SS1["Similarity"]
            SS2["Thresholds"]
            SS3["Cross-search"]
        end
        subgraph MS ["Metadata Store"]
            MS1["Source Attribution"]
            MS2["Timestamps"]
            MS3["Descriptions"]
        end
    end
    
    subgraph UI ["USER INTERFACES"]
        subgraph WUI ["Web UI"]
            WUI1["Discovery"]
            WUI2["Visual Mgmt"]
            WUI3["Interactive"]
        end
        subgraph CLI ["CLI"]
            CLI1["Full Admin"]
            CLI2["All Features"]
            CLI3["Maintenance"]
        end
        subgraph MCP ["MCP Server"]
            MCP1["Tool Provider"]
            MCP2["Secure Ops"]
            MCP3["FastMCP"]
        end
        subgraph AI ["AI Assistant Integ"]
            AI1["Claude Code Cmds"]
            AI2["AI Workflows"]
            AI3["Assistant Commands"]
        end
    end
    
    CS --> PP
    PP --> SSE
    SSE --> UI

Technical Component Architecture

graph TB
    subgraph RAG ["RAG RETRIEVER SYSTEM"]
        subgraph INTERFACES ["USER INTERFACES"]
            WEB["Streamlit Web UI<br/>(ui/app.py)<br/>• Discovery<br/>• Collections<br/>• Search"]
            CLI_MOD["CLI Module<br/>(cli.py)<br/>• Full Control<br/>• Admin Ops<br/>• All Features<br/>• Maintenance"]
            MCP_SRV["MCP Server<br/>(mcp/server.py)<br/>• FastMCP Framework<br/>• Tool Definitions<br/>• AI Integration<br/>• Claude Code Support"]
        end
        
        subgraph CORE ["CORE ENGINE"]
            PROC["Content Processing<br/>(main.py)<br/>• URL Processing<br/>• Search Coordination<br/>• Orchestration"]
            LOADERS["Document Loaders<br/>• LocalLoader<br/>• ImageLoader<br/>• GitHubLoader<br/>• ConfluenceLoader"]
            SEARCH["Search Engine<br/>(searcher.py)<br/>• Semantic Search<br/>• Cross-collection<br/>• Score Ranking"]
        end
        
        subgraph DATA ["DATA LAYER"]
            VECTOR["Vector Store<br/>(store.py)<br/>• ChromaDB<br/>• Collections<br/>• Metadata<br/>• Persistence"]
            CRAWLERS["Web Crawlers<br/>(crawling/)<br/>• Playwright<br/>• Crawl4AI<br/>• ContentClean<br/>• URL Handling"]
            CONFIG["Config System<br/>(config.py)<br/>• YAML Config<br/>• User Settings<br/>• API Keys<br/>• Validation"]
        end
        
        subgraph EXTERNAL ["EXTERNAL APIS"]
            OPENAI["OpenAI API<br/>• Embeddings<br/>• Vision Model<br/>• Batch Process"]
            SEARCH_API["Search APIs<br/>• Google Search<br/>• DuckDuckGo<br/>• Web Discovery"]
            EXT_SYS["External Systems<br/>• GitHub API<br/>• Confluence<br/>• Git Repos"]
        end
    end
    
    WEB --> PROC
    CLI_MOD --> PROC
    MCP_SRV --> PROC
    
    PROC <--> LOADERS
    PROC <--> SEARCH
    LOADERS <--> SEARCH
    
    CORE --> VECTOR
    CORE --> CRAWLERS
    CORE --> CONFIG
    
    DATA --> OPENAI
    DATA --> SEARCH_API
    DATA --> EXT_SYS

Use Cases

Documentation Management

  • Index official documentation sites
  • Search for APIs, functions, and usage examples
  • Maintain up-to-date development references

Knowledge Bases

  • Index company wikis and internal documentation
  • Search for policies, procedures, and best practices
  • Centralize organizational knowledge

Research and Learning

  • Index technical blogs and tutorials
  • Search for specific topics and technologies
  • Build personal knowledge repositories

Project Documentation

  • Index project-specific documentation
  • Search for implementation patterns
  • Maintain project knowledge bases

Configuration

RAG Retriever is highly configurable through config.yaml:

# Crawler selection
crawler:
  type: "crawl4ai"  # or "playwright"

# Search settings
search:
  default_limit: 8
  default_score_threshold: 0.3

# Content processing
content:
  chunk_size: 2000
  chunk_overlap: 400

# API configuration
api:
  openai_api_key: sk-your-key-here

Requirements

  • Python 3.10+
  • OpenAI API key
  • Git (for system functionality)
  • ~500MB disk space for dependencies

Installation

See QUICKSTART.md for exact installation commands, or use the AI assistant prompts for guided setup.

Data Storage

Your content is stored locally in:

  • macOS/Linux: ~/.local/share/rag-retriever/
  • Windows: %LOCALAPPDATA%\rag-retriever\

Collections persist between sessions and are automatically managed.

Performance

  • Crawl4AI: Up to 20x faster than traditional crawling
  • Embedding Caching: Efficient vector storage and retrieval
  • Parallel Processing: Concurrent indexing and search
  • Optimized Chunking: Configurable content processing

Security

  • Local Storage: All data stored locally, no cloud dependencies
  • API Key Protection: Secure configuration management
  • Permission Controls: MCP server permission management
  • Source Tracking: Complete audit trail of indexed content

Contributing

RAG Retriever is open source and welcomes contributions. See the repository for guidelines.

License

MIT License - see LICENSE file for details.

Support

  • Documentation: Use the AI assistant prompts for guidance
  • Issues: Report bugs and request features via GitHub issues
  • Community: Join discussions and share usage patterns

Remember: Use the AI assistant prompts above rather than reading through documentation. Your AI assistant can guide you through setup, usage, and troubleshooting much more effectively than traditional documentation!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_retriever-0.4.1.tar.gz (82.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_retriever-0.4.1-py3-none-any.whl (86.1 kB view details)

Uploaded Python 3

File details

Details for the file rag_retriever-0.4.1.tar.gz.

File metadata

  • Download URL: rag_retriever-0.4.1.tar.gz
  • Upload date:
  • Size: 82.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for rag_retriever-0.4.1.tar.gz
Algorithm Hash digest
SHA256 32c72d9af7f26eb82654264045559d7e702f98ad7e581e97c7acb29bee7464b2
MD5 976dcad7fdf3b0127a8a023d85f57eef
BLAKE2b-256 3d2d42821402c6dc7b7540b38ffd3bcc0786f0c2748e758b704fd0cee537279f

See more details on using hashes here.

File details

Details for the file rag_retriever-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: rag_retriever-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 86.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for rag_retriever-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a7a346504a95e49155b855f2898e164731412d49f27074ed0a96527f03cdd9ad
MD5 10f60781424f3a35032ce929a1b88fcc
BLAKE2b-256 69a103753f6d8df84f4ab37369d48e7f1c10bb1db3b153af243aaaa752a33e69

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page