A tool for crawling, indexing, and semantically searching web content with RAG capabilities
Project description
RAG Retriever
A semantic search system that crawls websites, indexes content, and provides AI-powered search capabilities through an MCP server. Built with modular architecture using OpenAI embeddings and ChromaDB vector store.
🚀 Quick Start with AI Assistant
Let your AI coding assistant help you set up and use RAG Retriever:
Setup: Direct your AI assistant to SETUP_ASSISTANT_PROMPT.md
Usage: Direct your AI assistant to USAGE_ASSISTANT_PROMPT.md
CLI Operations: Direct your AI assistant to CLI_ASSISTANT_PROMPT.md
Administration: Direct your AI assistant to ADMIN_ASSISTANT_PROMPT.md
Advanced Content: Direct your AI assistant to ADVANCED_CONTENT_INGESTION_PROMPT.md
Troubleshooting: Direct your AI assistant to TROUBLESHOOTING_ASSISTANT_PROMPT.md
Quick Commands: See QUICKSTART.md for copy-paste installation commands.
These prompts provide comprehensive instructions for your AI assistant to walk you through setup, usage, and troubleshooting without needing to read through documentation.
What RAG Retriever Does
RAG Retriever enhances your AI coding workflows by providing:
- Website Crawling: Index documentation sites, blogs, and knowledge bases
- Semantic Search: Find relevant information using natural language queries
- Collection Management: Organize content into themed collections
- MCP Integration: Direct access from Claude Code and other AI assistants
- Fast Processing: 20x faster crawling with Crawl4AI option
- Rich Metadata: Extract titles, descriptions, and source attribution
Key Features
🌐 Advanced Web Crawling
- Playwright: Reliable JavaScript-enabled crawling
- Crawl4AI: High-performance crawling with content filtering
- Configurable depth: Control how deep to crawl linked pages
- Same-domain focus: Automatically stays within target sites
🔍 Semantic Search
- OpenAI Embeddings: Uses text-embedding-3-large for high-quality vectors
- Relevance Scoring: Configurable similarity thresholds
- Cross-Collection Search: Search across all collections simultaneously
- Source Attribution: Track where information comes from
📚 Collection Management
- Named Collections: Organize content by topic, project, or source
- Metadata Tracking: Creation dates, document counts, descriptions
- Health Monitoring: Audit collections for quality and freshness
- Easy Cleanup: Remove or rebuild collections as needed
🎯 Quality Management
- Content Quality Assessment: Systematic evaluation of indexed content
- AI-Powered Quality Review: Use AI to assess accuracy and completeness
- Contradiction Detection: Find conflicting information across collections
- Relevance Monitoring: Track search quality metrics over time
- Best Practice Guidance: Comprehensive collection organization strategies
🤖 AI Integration
- MCP Server: Direct integration with Claude Code
- Custom Commands: Pre-built workflows for common tasks
- Tool Descriptions: Clear interfaces for AI assistants
- Permission Management: Secure access controls
MCP vs CLI Capabilities
MCP Server (Claude Code Integration)
The MCP server provides secure, AI-friendly access to core functionality:
- Web Crawling: Index websites and documentation
- Semantic Search: Search across collections with relevance scoring
- Collection Discovery: List and explore available collections
- Quality Assessment: Audit content quality and system health
- Intentionally Limited: No administrative operations for security
CLI (Full Administrative Control)
The command-line interface provides complete system control:
- All MCP Capabilities: Everything available through MCP server
- Collection Management: Delete collections, clean entire vector store
- Advanced Content Ingestion: Images, PDFs, GitHub repos, Confluence
- Local File Processing: Directory scanning, bulk operations
- System Administration: Configuration, maintenance, troubleshooting
- Rich Output Options: JSON, verbose logging, custom formatting
Web UI (Visual Management)
The Streamlit-based web interface provides intuitive visual control:
- Interactive Search: Visual search interface with adjustable parameters
- Collection Management: View, delete, edit descriptions, compare collections
- Content Discovery: Web search and direct content indexing workflow
- Visual Analytics: Statistics, charts, and collection comparisons
- User-Friendly: No command-line knowledge required
- Real-time Feedback: Immediate visual confirmation of operations
When to Use Each Interface
| Task | MCP Server | CLI | Web UI | Recommendation |
|---|---|---|---|---|
| Search content | ✅ | ✅ | ✅ | MCP for AI workflows, UI for interactive exploration |
| Index websites | ✅ | ✅ | ✅ | UI for discovery workflow, MCP for AI integration |
| Delete collections | ❌ | ✅ | ✅ | UI for visual confirmation, CLI for scripting |
| Edit collection metadata | ❌ | ❌ | ✅ | UI only option |
| Visual analytics | ❌ | ❌ | ✅ | UI only option |
| Content discovery | ❌ | ❌ | ✅ | UI provides search → select → index workflow |
| Process local files | ❌ | ✅ | ❌ | CLI only option |
| Analyze images | ❌ | ✅ | ❌ | CLI only option |
| GitHub integration | ❌ | ✅ | ❌ | CLI only option |
| System maintenance | ❌ | ✅ | ❌ | CLI only option |
| AI assistant integration | ✅ | ❌ | ❌ | MCP designed for AI workflows |
| Visual collection comparison | ❌ | ❌ | ✅ | UI provides interactive charts |
Available Claude Code Commands
Once configured as an MCP server, you can use:
/rag-list-collections
Discover all available vector store collections with document counts and metadata.
/rag-search-knowledge "query [collection] [limit] [threshold]"
Search indexed content using semantic similarity:
"python documentation"- searches default collection"python documentation python_docs"- searches specific collection"python documentation all"- searches ALL collections"error handling all 10 0.4"- custom parameters
/rag-index-website "url [max_depth] [collection]"
Crawl and index website content:
"https://docs.python.org"- index with defaults"https://docs.python.org 3"- custom crawl depth"https://docs.python.org python_docs 2"- custom collection
/rag-audit-collections
Review collection health, identify issues, and get maintenance recommendations.
/rag-assess-quality
Systematically evaluate content quality, accuracy, and reliability to ensure high-quality search results.
/rag-manage-collections
Administrative collection operations including deletion and cleanup (provides CLI commands).
/rag-ingest-content
Guide through advanced content ingestion for local files, images, and enterprise systems.
/rag-cli-help
Interactive CLI command builder and comprehensive help system.
Web UI Interface
Launch the visual interface with: rag-retriever --ui
Collection Management
Comprehensive collection overview with statistics, metadata, and management actions
Collection Actions and Deletion
Collection management interface showing edit description and delete collection options with visual confirmation
Interactive Knowledge Search
Search indexed content with adjustable parameters (max results, score threshold) and explore results with metadata and expandable content
Collection Analytics and Comparison
Side-by-side collection comparison with interactive charts showing document counts, chunks, and performance metrics
Content Discovery and Indexing Workflow
Search the web, select relevant content, adjust crawl depth, and index directly into collections - complete discovery-to-indexing workflow
The Web UI excels at:
- Content Discovery Workflow: Search → Select → Adjust → Index new content in one seamless interface
- Visual Collection Management: View statistics, edit descriptions, delete collections with confirmation
- Interactive Search: Real-time parameter adjustment and visual exploration of indexed content
- Collection Analytics: Compare collections with interactive charts and performance metrics
- Administrative Tasks: User-friendly collection deletion and management operations
How It Works
- Content Ingestion: Web pages are crawled and processed into clean text
- Embedding Generation: Text is converted to vectors using OpenAI's embedding models
- Vector Storage: Embeddings are stored in ChromaDB with metadata
- Semantic Search: Queries are embedded and matched against stored vectors
- Result Ranking: Results are ranked by similarity and returned with sources
Architecture
Layered Content Ingestion Architecture
flowchart TD
subgraph CS ["CONTENT SOURCES"]
subgraph WC ["Web Content"]
WC1["Playwright"]
WC2["Crawl4AI"]
WC3["Web Search"]
WC4["Discovery UI"]
end
subgraph LF ["Local Files"]
LF1["PDF Files"]
LF2["Markdown"]
LF3["Text Files"]
LF4["Directories"]
end
subgraph RM ["Rich Media"]
RM1["Images"]
RM2["Screenshots"]
RM3["Diagrams"]
RM4["OpenAI Vision"]
end
subgraph ES ["Enterprise Systems"]
ES1["GitHub Repos"]
ES2["Confluence Spaces"]
ES3["Private Repos"]
ES4["Branch Selection"]
end
end
subgraph PP ["PROCESSING PIPELINE"]
subgraph CC ["Content Cleaning"]
CC1["HTML Parsing"]
CC2["Text Extract"]
CC3["Format Normal"]
end
subgraph TC ["Text Chunking"]
TC1["Smart Splits"]
TC2["Overlap Mgmt"]
TC3["Size Control"]
end
subgraph EB ["Embedding"]
EB1["OpenAI API"]
EB2["Vector Gen"]
EB3["Batch Process"]
end
subgraph QA ["Quality Assessment"]
QA1["Relevance Scoring"]
QA2["Search Quality"]
QA3["Collection Auditing"]
end
end
subgraph SSE ["STORAGE & SEARCH ENGINE"]
subgraph CD ["ChromaDB"]
CD1["Vector Store"]
CD2["Persistence"]
CD3["Performance"]
end
subgraph COL ["Collections"]
COL1["Topic-based"]
COL2["Named Groups"]
COL3["Metadata"]
end
subgraph SS ["Semantic Search"]
SS1["Similarity"]
SS2["Thresholds"]
SS3["Cross-search"]
end
subgraph MS ["Metadata Store"]
MS1["Source Attribution"]
MS2["Timestamps"]
MS3["Descriptions"]
end
end
subgraph UI ["USER INTERFACES"]
subgraph WUI ["Web UI"]
WUI1["Discovery"]
WUI2["Visual Mgmt"]
WUI3["Interactive"]
end
subgraph CLI ["CLI"]
CLI1["Full Admin"]
CLI2["All Features"]
CLI3["Maintenance"]
end
subgraph MCP ["MCP Server"]
MCP1["Tool Provider"]
MCP2["Secure Ops"]
MCP3["FastMCP"]
end
subgraph AI ["AI Assistant Integ"]
AI1["Claude Code Cmds"]
AI2["AI Workflows"]
AI3["Assistant Commands"]
end
end
CS --> PP
PP --> SSE
SSE --> UI
Technical Component Architecture
graph TB
subgraph RAG ["RAG RETRIEVER SYSTEM"]
subgraph INTERFACES ["USER INTERFACES"]
WEB["Streamlit Web UI<br/>(ui/app.py)<br/>• Discovery<br/>• Collections<br/>• Search"]
CLI_MOD["CLI Module<br/>(cli.py)<br/>• Full Control<br/>• Admin Ops<br/>• All Features<br/>• Maintenance"]
MCP_SRV["MCP Server<br/>(mcp/server.py)<br/>• FastMCP Framework<br/>• Tool Definitions<br/>• AI Integration<br/>• Claude Code Support"]
end
subgraph CORE ["CORE ENGINE"]
PROC["Content Processing<br/>(main.py)<br/>• URL Processing<br/>• Search Coordination<br/>• Orchestration"]
LOADERS["Document Loaders<br/>• LocalLoader<br/>• ImageLoader<br/>• GitHubLoader<br/>• ConfluenceLoader"]
SEARCH["Search Engine<br/>(searcher.py)<br/>• Semantic Search<br/>• Cross-collection<br/>• Score Ranking"]
end
subgraph DATA ["DATA LAYER"]
VECTOR["Vector Store<br/>(store.py)<br/>• ChromaDB<br/>• Collections<br/>• Metadata<br/>• Persistence"]
CRAWLERS["Web Crawlers<br/>(crawling/)<br/>• Playwright<br/>• Crawl4AI<br/>• ContentClean<br/>• URL Handling"]
CONFIG["Config System<br/>(config.py)<br/>• YAML Config<br/>• User Settings<br/>• API Keys<br/>• Validation"]
end
subgraph EXTERNAL ["EXTERNAL APIS"]
OPENAI["OpenAI API<br/>• Embeddings<br/>• Vision Model<br/>• Batch Process"]
SEARCH_API["Search APIs<br/>• Google Search<br/>• DuckDuckGo<br/>• Web Discovery"]
EXT_SYS["External Systems<br/>• GitHub API<br/>• Confluence<br/>• Git Repos"]
end
end
WEB --> PROC
CLI_MOD --> PROC
MCP_SRV --> PROC
PROC <--> LOADERS
PROC <--> SEARCH
LOADERS <--> SEARCH
CORE --> VECTOR
CORE --> CRAWLERS
CORE --> CONFIG
DATA --> OPENAI
DATA --> SEARCH_API
DATA --> EXT_SYS
Use Cases
Documentation Management
- Index official documentation sites
- Search for APIs, functions, and usage examples
- Maintain up-to-date development references
Knowledge Bases
- Index company wikis and internal documentation
- Search for policies, procedures, and best practices
- Centralize organizational knowledge
Research and Learning
- Index technical blogs and tutorials
- Search for specific topics and technologies
- Build personal knowledge repositories
Project Documentation
- Index project-specific documentation
- Search for implementation patterns
- Maintain project knowledge bases
Configuration
RAG Retriever is highly configurable through config.yaml:
# Crawler selection
crawler:
type: "crawl4ai" # or "playwright"
# Search settings
search:
default_limit: 8
default_score_threshold: 0.3
# Content processing
content:
chunk_size: 2000
chunk_overlap: 400
# API configuration
api:
openai_api_key: sk-your-key-here
Requirements
- Python 3.10+
- OpenAI API key
- Git (for system functionality)
- ~500MB disk space for dependencies
Installation
See QUICKSTART.md for exact installation commands, or use the AI assistant prompts for guided setup.
Data Storage
Your content is stored locally in:
- macOS/Linux:
~/.local/share/rag-retriever/ - Windows:
%LOCALAPPDATA%\rag-retriever\
Collections persist between sessions and are automatically managed.
Performance
- Crawl4AI: Up to 20x faster than traditional crawling
- Embedding Caching: Efficient vector storage and retrieval
- Parallel Processing: Concurrent indexing and search
- Optimized Chunking: Configurable content processing
Security
- Local Storage: All data stored locally, no cloud dependencies
- API Key Protection: Secure configuration management
- Permission Controls: MCP server permission management
- Source Tracking: Complete audit trail of indexed content
Contributing
RAG Retriever is open source and welcomes contributions. See the repository for guidelines.
License
MIT License - see LICENSE file for details.
Support
- Documentation: Use the AI assistant prompts for guidance
- Issues: Report bugs and request features via GitHub issues
- Community: Join discussions and share usage patterns
Remember: Use the AI assistant prompts above rather than reading through documentation. Your AI assistant can guide you through setup, usage, and troubleshooting much more effectively than traditional documentation!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rag_retriever-0.4.1.tar.gz.
File metadata
- Download URL: rag_retriever-0.4.1.tar.gz
- Upload date:
- Size: 82.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32c72d9af7f26eb82654264045559d7e702f98ad7e581e97c7acb29bee7464b2
|
|
| MD5 |
976dcad7fdf3b0127a8a023d85f57eef
|
|
| BLAKE2b-256 |
3d2d42821402c6dc7b7540b38ffd3bcc0786f0c2748e758b704fd0cee537279f
|
File details
Details for the file rag_retriever-0.4.1-py3-none-any.whl.
File metadata
- Download URL: rag_retriever-0.4.1-py3-none-any.whl
- Upload date:
- Size: 86.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7a346504a95e49155b855f2898e164731412d49f27074ed0a96527f03cdd9ad
|
|
| MD5 |
10f60781424f3a35032ce929a1b88fcc
|
|
| BLAKE2b-256 |
69a103753f6d8df84f4ab37369d48e7f1c10bb1db3b153af243aaaa752a33e69
|