A tool for crawling, indexing, and semantically searching web content
Project description
RAG Retriever
A Python application that loads and processes both web pages and local documents, indexing their content using embeddings, and enabling semantic search queries. Built with a modular architecture using OpenAI embeddings and Chroma vector store.
Prerequisites
-
Python 3.10-3.12 (Download from python.org)
-
pipx (Install with one of these commands):
# On MacOS brew install pipx # On Windows/Linux python -m pip install --user pipx
System Requirements
The application uses Playwright with Chromium for web crawling:
- Chromium browser is automatically installed during package installation
- Sufficient disk space for Chromium (~200MB)
- Internet connection for initial setup and crawling
Note: The application will automatically download and manage Chromium installation.
Installation
Install RAG Retriever as a standalone application:
pipx install rag-retriever
This will:
- Create an isolated environment for the application
- Install all required dependencies
- Install Chromium browser automatically
- Make the
rag-retriever
command available in your PATH
After installation, initialize the configuration:
# Initialize configuration files
rag-retriever --init
This creates:
- A configuration file at
~/.config/rag-retriever/config.yaml
(Unix/Mac) or%APPDATA%\rag-retriever\config.yaml
(Windows) - A
.env
file in the same directory for your OpenAI API key
Setting up your API Key
Add your OpenAI API key to the .env
file:
OPENAI_API_KEY=your-api-key-here
Customizing Configuration
All settings are in config.yaml
. Key configuration sections include:
# Vector store settings
vector_store:
embedding_model: "text-embedding-3-large"
embedding_dimensions: 3072
chunk_size: 1000
chunk_overlap: 200
# Local document processing
document_processing:
supported_extensions:
- ".md"
- ".txt"
- ".pdf"
pdf_settings:
max_file_size_mb: 50
extract_images: false
ocr_enabled: false
languages: ["eng"]
strategy: "fast"
mode: "elements"
# Search settings
search:
default_limit: 8
default_score_threshold: 0.3
Data Storage
The vector store database is stored at:
- Unix/Mac:
~/.local/share/rag-retriever/chromadb/
- Windows:
%LOCALAPPDATA%\rag-retriever\chromadb/
This location is automatically managed by the application and should not be modified directly.
Uninstallation
To completely remove RAG Retriever:
# Remove the application and its isolated environment
pipx uninstall rag-retriever
# Remove Playwright browsers
python -m playwright uninstall chromium
# Optional: Remove configuration and data files
# Unix/Mac:
rm -rf ~/.config/rag-retriever ~/.local/share/rag-retriever
# Windows (run in PowerShell):
Remove-Item -Recurse -Force "$env:APPDATA\rag-retriever"
Remove-Item -Recurse -Force "$env:LOCALAPPDATA\rag-retriever"
Development Setup
If you want to contribute to RAG Retriever or modify the code:
# Clone the repository
git clone https://github.com/codingthefuturewithai/rag-retriever.git
cd rag-retriever
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # Unix/Mac
venv\Scripts\activate # Windows
# Install in editable mode
pip install -e .
# Initialize user configuration
./scripts/run-rag.sh --init # Unix/Mac
scripts\run-rag.bat --init # Windows
Usage Examples
Local Document Processing
# Process a single file
rag-retriever --ingest-file path/to/document.pdf
# Process all supported files in a directory
rag-retriever --ingest-directory path/to/docs/
# Enable OCR for scanned documents (update config.yaml first)
# Set in config.yaml:
# document_processing.pdf_settings.ocr_enabled: true
rag-retriever --ingest-file scanned-document.pdf
# Enable image extraction from PDFs (update config.yaml first)
# Set in config.yaml:
# document_processing.pdf_settings.extract_images: true
rag-retriever --ingest-file document-with-images.pdf
Web Content Fetching
# Basic fetch
rag-retriever --fetch https://example.com
# With depth control
rag-retriever --fetch https://example.com --max-depth 2
# Minimal output mode
rag-retriever --fetch https://example.com --verbose false
Searching Content
# Basic search
rag-retriever --query "How do I get started?"
# With truncated content
rag-retriever --query "How do I get started?" --truncate
# With custom result limit
rag-retriever --query "deployment options" --limit 5
# With minimum relevance score
rag-retriever --query "advanced configuration" --score-threshold 0.5
# JSON output format
rag-retriever --query "API reference" --json
Configuration Options
The configuration file (config.yaml
) is organized into several sections:
Vector Store Settings
vector_store:
persist_directory: null # Set automatically to OS-specific path
embedding_model: "text-embedding-3-large"
embedding_dimensions: 3072
chunk_size: 1000 # Size of text chunks for indexing
chunk_overlap: 200 # Overlap between chunks
Document Processing Settings
document_processing:
# Supported file extensions
supported_extensions:
- ".md"
- ".txt"
- ".pdf"
# Patterns to exclude from processing
excluded_patterns:
- ".*"
- "node_modules/**"
- "__pycache__/**"
- "*.pyc"
- ".git/**"
# Fallback encodings for text files
encoding_fallbacks:
- "utf-8"
- "latin-1"
- "cp1252"
# PDF processing settings
pdf_settings:
max_file_size_mb: 50
extract_images: false
ocr_enabled: false
languages: ["eng"]
password: null
strategy: "fast" # Options: fast, accurate
mode: "elements" # Options: single_page, paged, elements
Content Processing Settings
content:
chunk_size: 2000
chunk_overlap: 400
# Text splitting separators (in order of preference)
separators:
- "\n## " # h2 headers (strongest break)
- "\n### " # h3 headers
- "\n#### " # h4 headers
- "\n- " # bullet points
- "\n• " # alternative bullet points
- "\n\n" # paragraphs
- ". " # sentences (weakest break)
Search Settings
search:
default_limit: 8 # Default number of results
default_score_threshold: 0.3 # Minimum relevance score
Browser Settings (Web Crawling)
browser:
wait_time: 2 # Base wait time in seconds
viewport:
width: 1920
height: 1080
delays:
before_request: [1, 3] # Min and max seconds
after_load: [2, 4]
after_dynamic: [1, 2]
launch_options:
headless: true
channel: "chrome"
context_options:
bypass_csp: true
java_script_enabled: true
Understanding Search Results
Search results include relevance scores based on cosine similarity:
- Scores range from 0 to 1, where 1 indicates perfect similarity
- Default threshold is 0.3 (configurable via
search.default_score_threshold
) - Typical interpretation:
- 0.7+: Very high relevance (nearly exact matches)
- 0.6 - 0.7: High relevance
- 0.5 - 0.6: Good relevance
- 0.3 - 0.5: Moderate relevance
- Below 0.3: Lower relevance
Features
-
Local Document Loading: Load markdown, text, and PDF files from local directories.
- Single File and Directory Loading: Easily load individual files or entire directories with support for multithreading and progress indication.
- PDF Processing: Extract text and images from PDFs using multiple loaders, with optional OCR for scanned documents.
- Configurable Settings: Customize supported file types, PDF processing options, and more through configuration files.
-
Error Handling: Robust error handling for unsupported file types and missing files, with detailed logging for troubleshooting.
-
Configuration: Flexible configuration options for document processing, including supported extensions and PDF settings.
For more detailed usage instructions and examples, please refer to the local-document-loading.md documentation.
Project Structure
rag-retriever/
├── rag_retriever/ # Main package directory
│ ├── config/ # Configuration settings
│ ├── crawling/ # Web crawling functionality
│ ├── vectorstore/ # Vector storage operations
│ ├── search/ # Search functionality
│ └── utils/ # Utility functions
Dependencies
Key dependencies include:
- openai: For embeddings generation (text-embedding-3-large model)
- chromadb: Vector store implementation with cosine similarity
- selenium: JavaScript content rendering
- beautifulsoup4: HTML parsing
- python-dotenv: Environment management
Notes
- Uses OpenAI's text-embedding-3-large model for generating embeddings by default
- Content is automatically cleaned and structured during indexing
- Implements URL depth-based crawling control
- Vector store persists between runs unless explicitly deleted
- Uses cosine similarity for more intuitive relevance scoring
- Minimal output by default with
--verbose
flag for troubleshooting - Full content display by default with
--truncate
option for brevity
Known Current Limitations
The following limitations are currently being tracked, with possible future enhancements under consideration:
-
Does not check for existing URLs or content in the vector store during fetch operations
- Possible enhancement: Detect and skip already indexed content by default
- Possible enhancement: Add
--re-fetch
option to update existing content - Possible enhancement: Provide status information about existing content age
-
Limited document management capabilities
- Possible enhancement: Support for deleting specific documents from the vector store
- Possible enhancement: Support for bulk deletion of documents by base URL
- Possible enhancement: Document listing and filtering tools
-
No direct access to vector store data for analysis
- Possible enhancement: Tools to examine and analyze stored embeddings and metadata
- Possible enhancement: Support for export/import of vector store data for backup or transfer
-
Command-line interface only
- Possible enhancement: Web UI for easier interaction with all features
- Possible enhancement: Real-time progress monitoring and result visualization
Contributing
Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file rag_retriever-0.1.4.tar.gz
.
File metadata
- Download URL: rag_retriever-0.1.4.tar.gz
- Upload date:
- Size: 32.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e94d56fa370b55d2c3ce019c415be8fb89b511fc805d84c8949ea82e965e953 |
|
MD5 | 2e8aa2feea4e807e3fa823701d381ce8 |
|
BLAKE2b-256 | 415dd4ed29c753c25d04ae105db924c80feb13feb7f113476cdd8d2259f4b3a3 |
File details
Details for the file rag_retriever-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: rag_retriever-0.1.4-py3-none-any.whl
- Upload date:
- Size: 33.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a84f13c5b87a51fb4c85b3d52bb545dd4b2e3893b1861ba80681e86faedba53 |
|
MD5 | ec42fc0f0afc21b69a47de6a88f53b79 |
|
BLAKE2b-256 | 7f0b1f3c76438a3e575a7be0aa08c43214d214a53c96c1eb6de7276d0392b5e2 |