Skip to main content

A tool for crawling, indexing, and semantically searching web content

Project description

RAG Retriever

A Python application that loads and processes both web pages and local documents, indexing their content using embeddings, and enabling semantic search queries. Built with a modular architecture using OpenAI embeddings and Chroma vector store.

Prerequisites

  • Python 3.10-3.12 (Download from python.org)

  • pipx (Install with one of these commands):

    # On MacOS
    brew install pipx
    
    # On Windows/Linux
    python -m pip install --user pipx
    

System Requirements

The application uses Playwright with Chromium for web crawling:

  • Chromium browser is automatically installed during package installation
  • Sufficient disk space for Chromium (~200MB)
  • Internet connection for initial setup and crawling

Note: The application will automatically download and manage Chromium installation.

Installation

Install RAG Retriever as a standalone application:

pipx install rag-retriever

This will:

  • Create an isolated environment for the application
  • Install all required dependencies
  • Install Chromium browser automatically
  • Make the rag-retriever command available in your PATH

After installation, initialize the configuration:

# Initialize configuration files
rag-retriever --init

This creates:

  • A configuration file at ~/.config/rag-retriever/config.yaml (Unix/Mac) or %APPDATA%\rag-retriever\config.yaml (Windows)
  • A .env file in the same directory for your OpenAI API key

Setting up your API Key

Add your OpenAI API key to the .env file:

OPENAI_API_KEY=your-api-key-here

Customizing Configuration

All settings are in config.yaml. Key configuration sections include:

# Vector store settings
vector_store:
  embedding_model: "text-embedding-3-large"
  embedding_dimensions: 3072
  chunk_size: 1000
  chunk_overlap: 200

# Local document processing
document_processing:
  supported_extensions:
    - ".md"
    - ".txt"
    - ".pdf"
  pdf_settings:
    max_file_size_mb: 50
    extract_images: false
    ocr_enabled: false
    languages: ["eng"]
    strategy: "fast"
    mode: "elements"

# Search settings
search:
  default_limit: 8
  default_score_threshold: 0.3

Data Storage

The vector store database is stored at:

  • Unix/Mac: ~/.local/share/rag-retriever/chromadb/
  • Windows: %LOCALAPPDATA%\rag-retriever\chromadb/

This location is automatically managed by the application and should not be modified directly.

Uninstallation

To completely remove RAG Retriever:

# Remove the application and its isolated environment
pipx uninstall rag-retriever

# Remove Playwright browsers
python -m playwright uninstall chromium

# Optional: Remove configuration and data files
# Unix/Mac:
rm -rf ~/.config/rag-retriever ~/.local/share/rag-retriever
# Windows (run in PowerShell):
Remove-Item -Recurse -Force "$env:APPDATA\rag-retriever"
Remove-Item -Recurse -Force "$env:LOCALAPPDATA\rag-retriever"

Development Setup

If you want to contribute to RAG Retriever or modify the code:

# Clone the repository
git clone https://github.com/codingthefuturewithai/rag-retriever.git
cd rag-retriever

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # Unix/Mac
venv\Scripts\activate     # Windows

# Install in editable mode
pip install -e .

# Initialize user configuration
./scripts/run-rag.sh --init  # Unix/Mac
scripts\run-rag.bat --init   # Windows

Usage Examples

Local Document Processing

# Process a single file
rag-retriever --ingest-file path/to/document.pdf

# Process all supported files in a directory
rag-retriever --ingest-directory path/to/docs/

# Enable OCR for scanned documents (update config.yaml first)
# Set in config.yaml:
# document_processing.pdf_settings.ocr_enabled: true
rag-retriever --ingest-file scanned-document.pdf

# Enable image extraction from PDFs (update config.yaml first)
# Set in config.yaml:
# document_processing.pdf_settings.extract_images: true
rag-retriever --ingest-file document-with-images.pdf

Web Content Fetching

# Basic fetch
rag-retriever --fetch https://example.com

# With depth control
rag-retriever --fetch https://example.com --max-depth 2

# Minimal output mode
rag-retriever --fetch https://example.com --verbose false

Searching Content

# Basic search
rag-retriever --query "How do I get started?"

# With truncated content
rag-retriever --query "How do I get started?" --truncate

# With custom result limit
rag-retriever --query "deployment options" --limit 5

# With minimum relevance score
rag-retriever --query "advanced configuration" --score-threshold 0.5

# JSON output format
rag-retriever --query "API reference" --json

Configuration Options

The configuration file (config.yaml) is organized into several sections:

Vector Store Settings

vector_store:
  persist_directory: null # Set automatically to OS-specific path
  embedding_model: "text-embedding-3-large"
  embedding_dimensions: 3072
  chunk_size: 1000 # Size of text chunks for indexing
  chunk_overlap: 200 # Overlap between chunks

Document Processing Settings

document_processing:
  # Supported file extensions
  supported_extensions:
    - ".md"
    - ".txt"
    - ".pdf"

  # Patterns to exclude from processing
  excluded_patterns:
    - ".*"
    - "node_modules/**"
    - "__pycache__/**"
    - "*.pyc"
    - ".git/**"

  # Fallback encodings for text files
  encoding_fallbacks:
    - "utf-8"
    - "latin-1"
    - "cp1252"

  # PDF processing settings
  pdf_settings:
    max_file_size_mb: 50
    extract_images: false
    ocr_enabled: false
    languages: ["eng"]
    password: null
    strategy: "fast" # Options: fast, accurate
    mode: "elements" # Options: single_page, paged, elements

Content Processing Settings

content:
  chunk_size: 2000
  chunk_overlap: 400
  # Text splitting separators (in order of preference)
  separators:
    - "\n## " # h2 headers (strongest break)
    - "\n### " # h3 headers
    - "\n#### " # h4 headers
    - "\n- " # bullet points
    - "\n• " # alternative bullet points
    - "\n\n" # paragraphs
    - ". " # sentences (weakest break)

Search Settings

search:
  default_limit: 8 # Default number of results
  default_score_threshold: 0.3 # Minimum relevance score

Browser Settings (Web Crawling)

browser:
  wait_time: 2 # Base wait time in seconds
  viewport:
    width: 1920
    height: 1080
  delays:
    before_request: [1, 3] # Min and max seconds
    after_load: [2, 4]
    after_dynamic: [1, 2]
  launch_options:
    headless: true
    channel: "chrome"
  context_options:
    bypass_csp: true
    java_script_enabled: true

Understanding Search Results

Search results include relevance scores based on cosine similarity:

  • Scores range from 0 to 1, where 1 indicates perfect similarity
  • Default threshold is 0.3 (configurable via search.default_score_threshold)
  • Typical interpretation:
    • 0.7+: Very high relevance (nearly exact matches)
    • 0.6 - 0.7: High relevance
    • 0.5 - 0.6: Good relevance
    • 0.3 - 0.5: Moderate relevance
    • Below 0.3: Lower relevance

Features

  • Local Document Loading: Load markdown, text, and PDF files from local directories.

    • Single File and Directory Loading: Easily load individual files or entire directories with support for multithreading and progress indication.
    • PDF Processing: Extract text and images from PDFs using multiple loaders, with optional OCR for scanned documents.
    • Configurable Settings: Customize supported file types, PDF processing options, and more through configuration files.
  • Error Handling: Robust error handling for unsupported file types and missing files, with detailed logging for troubleshooting.

  • Configuration: Flexible configuration options for document processing, including supported extensions and PDF settings.

For more detailed usage instructions and examples, please refer to the local-document-loading.md documentation.

Project Structure

rag-retriever/
├── rag_retriever/         # Main package directory
│   ├── config/           # Configuration settings
│   ├── crawling/         # Web crawling functionality
│   ├── vectorstore/      # Vector storage operations
│   ├── search/          # Search functionality
│   └── utils/           # Utility functions

Dependencies

Key dependencies include:

  • openai: For embeddings generation (text-embedding-3-large model)
  • chromadb: Vector store implementation with cosine similarity
  • selenium: JavaScript content rendering
  • beautifulsoup4: HTML parsing
  • python-dotenv: Environment management

Notes

  • Uses OpenAI's text-embedding-3-large model for generating embeddings by default
  • Content is automatically cleaned and structured during indexing
  • Implements URL depth-based crawling control
  • Vector store persists between runs unless explicitly deleted
  • Uses cosine similarity for more intuitive relevance scoring
  • Minimal output by default with --verbose flag for troubleshooting
  • Full content display by default with --truncate option for brevity

Known Current Limitations

The following limitations are currently being tracked, with possible future enhancements under consideration:

  • Does not check for existing URLs or content in the vector store during fetch operations

    • Possible enhancement: Detect and skip already indexed content by default
    • Possible enhancement: Add --re-fetch option to update existing content
    • Possible enhancement: Provide status information about existing content age
  • Limited document management capabilities

    • Possible enhancement: Support for deleting specific documents from the vector store
    • Possible enhancement: Support for bulk deletion of documents by base URL
    • Possible enhancement: Document listing and filtering tools
  • No direct access to vector store data for analysis

    • Possible enhancement: Tools to examine and analyze stored embeddings and metadata
    • Possible enhancement: Support for export/import of vector store data for backup or transfer
  • Command-line interface only

    • Possible enhancement: Web UI for easier interaction with all features
    • Possible enhancement: Real-time progress monitoring and result visualization

Contributing

Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_retriever-0.1.4.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

rag_retriever-0.1.4-py3-none-any.whl (33.8 kB view details)

Uploaded Python 3

File details

Details for the file rag_retriever-0.1.4.tar.gz.

File metadata

  • Download URL: rag_retriever-0.1.4.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for rag_retriever-0.1.4.tar.gz
Algorithm Hash digest
SHA256 6e94d56fa370b55d2c3ce019c415be8fb89b511fc805d84c8949ea82e965e953
MD5 2e8aa2feea4e807e3fa823701d381ce8
BLAKE2b-256 415dd4ed29c753c25d04ae105db924c80feb13feb7f113476cdd8d2259f4b3a3

See more details on using hashes here.

File details

Details for the file rag_retriever-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for rag_retriever-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0a84f13c5b87a51fb4c85b3d52bb545dd4b2e3893b1861ba80681e86faedba53
MD5 ec42fc0f0afc21b69a47de6a88f53b79
BLAKE2b-256 7f0b1f3c76438a3e575a7be0aa08c43214d214a53c96c1eb6de7276d0392b5e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page