PDF Question-Answering System with MCP Integration
Project description
DocsRay
A powerful PDF Question-Answering System that uses advanced embedding models and multimodal LLMs with Coarse-to-Fine search (RAG) approach. Features seamless MCP (Model Context Protocol) integration with Claude Desktop, comprehensive directory management capabilities, visual content analysis, and intelligent hybrid OCR system.
Try It Online
๐ Quick Start
# 1. Install DocsRay
pip install docsray
# 1-1. Tesseract OCR (optional)
# For faster OCR, install Tesseract with appropriate language pack.
#pip install pytesseract
#sudo apt-get install tesseract-ocr # Debian/Ubuntu
#sudo apt-get install tesseract-ocr-kor
#brew install tesseract-ocr # MacOS
#brew install tesseract-ocr-kor
# 2. Download required models (approximately 8GB)
docsray download-models
# 3. Configure Claude Desktop integration (optional)
docsray configure-claude
# 4. Start using DocsRay
docsray web # Launch Web UI
๐ Features
- Advanced RAG System: Coarse-to-Fine search for accurate document retrieval
- Multimodal AI: Visual content analysis using Gemma-3-4B's image recognition capabilities
- Hybrid OCR System: Intelligent selection between AI-powered OCR and traditional Pytesseract
- Adaptive Performance: Automatically optimizes based on available system resources
- Multi-Model Support: Uses BGE-M3, E5-Large, Gemma-3-1B, and Gemma-3-4B models
- MCP Integration: Seamless integration with Claude Desktop
- Multiple Interfaces: Web UI, API server, CLI, and MCP server
- Directory Management: Advanced PDF directory handling and caching
- Multi-Language: Supports multiple languages including Korean and English
- Smart Resource Management: FAST_MODE, Standard, and FULL_FEATURE_MODE based on system specs
- Universal Document Support: Automatically converts 30+ file formats to PDF for processing
- Smart File Conversion: Handles Office documents, images, HTML, Markdown, and more
๐ฏ What's New in v1.4.0
Universal Document Support
DocsRay now automatically converts various document formats to PDF for processing:
Supported File Formats
Office Documents
- Microsoft Word (.docx, .doc)
- Microsoft Excel (.xlsx, .xls)
- Microsoft PowerPoint (.pptx, .ppt)
Text Formats
- Plain Text (.txt)
Image Formats
- JPEG (.jpg, .jpeg)
- PNG (.png)
- GIF (.gif)
- BMP (.bmp)
- TIFF (.tiff, .tif)
- WebP (.webp)
Automatic Conversion
Simply load any supported file type, and DocsRay will:
- Automatically detect the file format
- Convert it to PDF in the background
- Process it with all the same features as native PDFs
- Clean up temporary files automatically
# Works with any supported format!
docsray process /path/to/document.docx
docsray process /path/to/spreadsheet.xlsx
docsray process /path/to/image.png
Hybrid OCR System
DocsRay now features an AI-OCR powered by Gemma3-4b. You can also choose to use Tesseract OCR simply by installing:
sudo apt-get install tesseract-ocr # Debian/Ubuntu
sudo apt-get install tesseract-ocr-kor
brew install tesseract-ocr # MacOS
brew install tesseract-ocr-kor
Adaptive Performance Optimization
Automatically detects system resources and optimizes performance:
| System Memory | Mode | OCR | Visual Analysis | Max Tokens |
|---|---|---|---|---|
| CPU | FAST (Q4) | โ | โ | 8K |
| < 16GB | FAST (Q4) | โ | โ | 8K |
| 16-24GB | STANDARD (Q8) | โ | โ | 16K |
| > 24GB | FULL_FEATURE (F16) | โ | โ | 32K |
Enhanced MCP Commands
- Cache Management:
clear_all_cache,get_cache_info - Improved Summarization: Batch processing with section-by-section caching
- Detail Levels: Adjustable summary detail (brief/standard/detailed)
๐ Project Structure
DocsRay/
โโโ docsray/ # Main package directory
โ โโโ __init__.py # Package init with FAST_MODE detection
โ โโโ chatbot.py # Core chatbot functionality
โ โโโ mcp_server.py # MCP server with directory management
โ โโโ app.py # FastAPI server
โ โโโ web_demo.py # Gradio web interface
โ โโโ download_models.py # Model download utility
โ โโโ cli.py # Command-line interface
โ โโโ inference/
โ โ โโโ embedding_model.py # Embedding model implementations
โ โ โโโ gemma3_handler.py # Handler for Gemma3 vision input
โ โ โโโ llm_model.py # LLM implementations (including multimodal)
โ โโโ scripts/
โ โ โโโ pdf_extractor.py # Enhanced PDF extraction with visual analysis
โ โ โโโ chunker.py # Text chunking logic
โ โ โโโ build_index.py # Search index builder
โ โ โโโ section_rep_builder.py
โ โโโ search/
โ โ โโโ section_coarse_search.py
โ โ โโโ fine_search.py
โ โ โโโ vector_search.py
โ โโโ utils/
โ โโโ text_cleaning.py
โโโ setup.py # Package configuration
โโโ pyproject.toml # Modern Python packaging
โโโ requirements.txt # Dependencies
โโโ LICENSE
โโโ README.md
๐พ Installation
Basic Installation
pip install docsray
Development Installation
git clone https://github.com/MIMICLab/DocsRay.git
cd DocsRay
pip install -e .
๐ฏ Usage
Command Line Interface
# Download models (required for first-time setup)
docsray download-models
# Check model status
docsray download-models --check
# Process a PDF with visual analysis
docsray process /path/to/document.pdf
# Ask questions about a processed PDF
docsray ask "What is the main topic?" --pdf document.pdf
# Start web interface
docsray web
# Start API server
docsray api --pdf /path/to/document.pdf --port 8000
# Start MCP server
docsray mcp
Web Interface
docsray web
Access the web interface at http://localhost:44665. Default credentials:
- Username:
admin - Password:
password
Features:
- Upload and process PDFs with visual content analysis
- Ask questions about document content including images and charts
- Manage multiple PDFs with caching
- Customize system prompts
API Server
docsray api --pdf /path/to/document.pdf
Example API usage:
# Ask a question
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"question": "What does the chart on page 5 show?"}'
# Get PDF info
curl http://localhost:8000/info
Python API
from docsray import PDFChatBot
from docsray.scripts import pdf_extractor, chunker, build_index, section_rep_builder
# Process any document type - auto-conversion handled internally
extracted = pdf_extractor.extract_content(
"report.docx", # Can be DOCX, XLSX, PNG, HTML, etc.
analyze_visuals=True,
visual_analysis_interval=1
)
# Create chunks and build index
chunks = chunker.process_extracted_file(extracted)
chunk_index = build_index.build_chunk_index(chunks)
sections = section_rep_builder.build_section_reps(extracted["sections"], chunk_index)
# Initialize chatbot
chatbot = PDFChatBot(sections, chunk_index)
# Ask questions
answer, references = chatbot.answer("What are the key trends shown in the graphs?")
๐ MCP (Model Context Protocol) Integration
Setup
-
Configure Claude Desktop:
docsray configure-claude -
Restart Claude Desktop
-
Start using DocsRay in Claude
MCP Commands in Claude
๐ Directory Management
What's my current PDF directory?- Show current working directorySet my PDF directory to /path/to/documents- Change working directoryShow me information about /path/to/pdfs- Get directory detailsGet recommended search paths- Show common document locations for your OS
๐ Document Operations
List all documents in my current directory- List all supported files (not just PDFs)Load the document named "report.docx"- Load any supported file typeWhat file types are supported?- Show list of supported formatsProcess all documents in current directory- Batch process with summaries
๐ Search and Retrieval
Search for documents about machine learning- Content-based semantic searchFind and load the quarterly report- Search and auto-load best matchSearch for PDF files in my home directory- File system searchFind all Excel files modified this month- Advanced file search with filters
๐๏ธ Visual Content
What charts or figures are in this document?- List visual elementsDescribe the diagram on page 10- Get specific visual descriptionsWhat data is shown in the graphs?- Analyze data visualizationsEnable/disable visual analysis- Toggle visual content processing
๐ฌ Q&A and Summarization
What is the main topic of this document?- Ask questions about loaded documentSummarize this document briefly- Generate brief summary with embeddingsCreate a detailed summary- Comprehensive section-by-section summaryShow all document summaries- View all generated summaries
๐พ Cache Management
Clear all cache- Remove all cached filesShow cache info- Display cache statistics and detailsHow much cache space is being used?- Check cache storage
Enhanced MCP Features (v1.3.0)
๐ Batch Processing
Process all documents in /path/to/folder with brief summaries
- Processes multiple documents at once
- Generates summaries with embeddings for semantic search
- Supports brief/standard/detailed summary levels
- Caches results for faster access
๐ Dual Search Modes
-
File System Search (
search_files)- Recursively search directories
- Filter by file type, size, date
- Exclude system directories
- Returns file paths and metadata
-
Content Search (
search_by_content)- Semantic search using summary embeddings
- GPU-accelerated similarity computation
- Returns relevance scores
- Works only on processed documents
๐ Smart Directory Analysis
Analyze the path /Users/john/Documents for search complexity
- Estimates document count
- Predicts search time
- Provides complexity assessment
- Recommends search strategies
Example Workflows
Quick Document Discovery
1. "Get recommended search paths"
2. "Search for all PDF files in Documents folder"
3. "Process all documents with brief summaries"
4. "Search by content for budget analysis"
5. "Load the best match"
Research Assistant
1. "Set directory to my research papers"
2. "Process all documents"
3. "Search for papers about neural networks"
4. "Generate detailed summary of current document"
5. "What methodology was used in this paper?"
Visual Content Analysis
1. "Enable visual analysis"
2. "Load presentation.pptx"
3. "What charts are in this presentation?"
4. "Describe the diagram on slide 5"
Advanced MCP Commands
Filtering and Options
Process only PDF and DOCX filesSearch documents modified after 2024-01-01Find files larger than 10MBGenerate standard summaries for all documents
Performance Control
Process documents without visual analysisUse coarse search for faster resultsLimit processing to 50 files
Tips for Claude Desktop Integration
- First Time Setup: Claude will automatically find your Documents folder
- Batch Processing: Process entire directories before starting research
- Smart Search: Use content search for processed docs, file search for discovery
- Cache Management: Clear cache periodically to free space
- Visual Analysis: Disable for faster processing of text-only documents
โ๏ธ Configuration
Environment Variables
# Custom data directory (default: ~/.docsray)
export DOCSRAY_HOME=/path/to/custom/directory
# Force specific mode
export DOCSRAY_FAST_MODE=1 # Force FAST_MODE
# Model paths (optional)
export DOCSRAY_MODEL_DIR=/path/to/models
Programmatic Mode Detection
from docsray import FAST_MODE, FULL_FEATURE_MODE, MAX_TOKENS
print(f"Fast Mode: {FAST_MODE}")
print(f"Full Feature Mode: {FULL_FEATURE_MODE}")
print(f"Max Tokens: {MAX_TOKENS}")
Data Storage
DocsRay stores data in the following locations:
- Models:
~/.docsray/models/ - Cache:
~/.docsray/cache/ - User Data:
~/.docsray/data/
๐ค Models
DocsRay uses the following models (automatically downloaded):
| Model | Size | Purpose |
|---|---|---|
| bge-m3 | 1.7GB | Multilingual embedding model |
| multilingual-e5-Large | 1.2GB | Multilingual embedding model |
| Gemma-3-1B | 1.1GB | Query enhancement and light tasks |
| Gemma-3-4B | 4.1GB | Main answer generation & visual analysis |
Total storage requirement: ~8GB
๐ก Usage Recommendations by Scenario
1. Bulk PDF Processing (Server Environment)
- Recommended: FULL_FEATURE_MODE (ensure sufficient RAM)
- GPU acceleration essential
- Adjust visual_analysis_interval for batch processing
2. Personal Laptop Environment
- Recommended: Standard mode
- Switch to FAST_MODE when needed
- Analyze visuals only on important pages
3. Resource-Constrained Environment
- Use FAST_MODE
- Process text-based PDFs only
- Leverage caching aggressively
๐จ Visual Content Analysis Examples
Chart Analysis
[Figure 1 on page 3]: This is a bar chart showing quarterly revenue growth
from Q1 2023 to Q4 2023. The y-axis represents revenue in millions of dollars
ranging from 0 to 50. Each quarter shows progressive growth with Q1 at $12M,
Q2 at $18M, Q3 at $28M, and Q4 at $42M. The trend indicates strong
year-over-year growth of approximately 250%.
Diagram Recognition
[Figure 2 on page 5]: A flowchart diagram illustrating the data processing
pipeline. The flow starts with "Data Input" at the top, branches into three
parallel processes: "Validation", "Transformation", and "Enrichment", which
then converge at "Data Integration" before ending at "Output Database".
Table Extraction
[Table 1 on page 7]: A comparison table with 4 columns (Product, Q1 Sales,
Q2 Sales, Growth %) and 5 rows of data. Product A shows the highest growth
at 45%, while Product C has the highest absolute sales in Q2 at $2.3M.
๐ง Troubleshooting
Model Download Issues
# Check model status
docsray download-models --check
# Manual download (if automatic download fails)
# Download models from HuggingFace and place in ~/.docsray/models/
Memory Issues
If you encounter out-of-memory errors:
-
Check current mode:
from docsray import FAST_MODE, MAX_TOKENS print(f"FAST_MODE: {FAST_MODE}") print(f"MAX_TOKENS: {MAX_TOKENS}")
-
Force FAST_MODE:
export DOCSRAY_FAST_MODE=1
-
Reduce visual analysis frequency:
extracted = pdf_extractor.extract_pdf_content( pdf_path, analyze_visuals=True, visual_analysis_interval=5 # Analyze every 5th page )
GPU Support Issues
# Reinstall with GPU support
pip uninstall llama-cpp-python
# For CUDA
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --no-cache-dir
# For Metal
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --no-cache-dir
MCP Connection Issues
-
Ensure all models are downloaded:
docsray download-models -
Reconfigure Claude Desktop:
docsray configure-claude -
Check MCP server logs:
docsray mcp
OCR Language Errors
sudo apt-get install tesseract-ocr # Debian/Ubuntu
sudo apt-get install tesseract-ocr-kor
brew install tesseract-ocr # MacOS
brew install tesseract-ocr-kor
File Conversion Issues
Office Documents Not Converting
# Install LibreOffice for best results
sudo apt-get install libreoffice # Ubuntu/Debian
brew install libreoffice # macOS
HTML/Web Files Not Converting
# Install wkhtmltopdf
sudo apt-get install wkhtmltopdf # Ubuntu/Debian
brew install wkhtmltopdf # macOS
# Or use weasyprint (Python-only alternative)
pip install weasyprint
Missing Converter Warning
If you see "No suitable converter found":
- Check system dependencies are installed
- Verify Python packages:
pip install docsray[conversion] - Try alternative converters (LibreOffice > docx2pdf > pandoc)
๐ Auto-Restart Feature (v1.3.0+)
DocsRay includes an automatic restart feature that helps maintain service stability by automatically recovering from errors, memory issues, or crashes.
When Auto-Restart Triggers
The service will automatically restart in the following situations:
- Memory Usage Exceeds 85% - Prevents out-of-memory crashes
- PDF Processing Timeout - Default 5 minutes per document
- Error Threshold Reached - When errors occur within the time window
- Process Crashes - Unexpected termination or unhandled exceptions
Basic Usage
# Start web interface with auto-restart
docsray web --auto-restart
# Start MCP server with auto-restart
docsray mcp --auto-restart
Advanced Options
# Custom retry settings
docsray web --auto-restart --max-retries 10 --retry-delay 10
# With other options
docsray web --auto-restart --port 8080 --timeout 600 --max-retries 20
Configuration Parameters
| Parameter | Default | Description |
|---|---|---|
--auto-restart |
False | Enable automatic restart on errors |
--max-retries |
5 | Maximum restart attempts for crashes |
--retry-delay |
5 | Seconds to wait between restarts |
How It Works
-
Intentional Restarts (exit code 42)
- Triggered by memory limits, timeouts, or error thresholds
- Retry counter resets to 0
- Can restart indefinitely
-
Crashes (other exit codes)
- Triggered by unexpected errors
- Retry counter increases
- Stops after reaching max-retries
Monitoring
Check restart logs:
# View recovery log
cat ~/.docsray/logs/recovery_log.txt
# Monitor service logs
tail -f ~/.docsray/logs/DocsRay_Web_wrapper_*.log
Example Scenarios
Production Server
# High reliability settings
docsray web --auto-restart \
--max-retries 100 \
--retry-delay 30 \
--timeout 900
Development Environment
# Quick restart for testing
docsray web --auto-restart \
--max-retries 5 \
--retry-delay 2
System Service Alternative (Linux)
For production deployments, consider using systemd:
# /etc/systemd/system/docsray.service
[Unit]
Description=DocsRay Web Service
After=network.target
[Service]
Type=simple
User=your-user
WorkingDirectory=/home/your-user
ExecStart=/usr/bin/python -m docsray web --port 80
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Then:
sudo systemctl enable docsray
sudo systemctl start docsray
Troubleshooting
-
Service keeps restarting
- Check memory usage: might need to increase system RAM
- Reduce visual analysis or page limits
- Increase timeout values
-
Service won't restart
- Check if max-retries reached
- Look for "Max retries reached" in logs
- Restart manually or increase max-retries
๐ Advanced Usage
Custom Visual Analysis
from docsray.scripts.pdf_extractor import extract_pdf_content
# Fine-tune visual analysis
extracted = extract_pdf_content(
"technical_report.pdf",
analyze_visuals=True,
visual_analysis_interval=1 # Every page
)
# Access visual descriptions
for i, page_text in enumerate(extracted["pages_text"]):
if "[Figure" in page_text or "[Table" in page_text:
print(f"Visual content found on page {i+1}")
Batch Processing with Visual Analysis
#!/bin/bash
for pdf in *.pdf; do
echo "Processing $pdf with visual analysis..."
docsray process "$pdf" --analyze-visuals
done
Custom System Prompts for Visual Content
from docsray import PDFChatBot
visual_prompt = """
You are a document assistant specialized in analyzing visual content.
When answering questions:
1. Reference specific figures, charts, and tables by their descriptions
2. Integrate visual information with text content
3. Highlight data trends and patterns shown in visualizations
"""
chatbot = PDFChatBot(sections, chunk_index, system_prompt=visual_prompt)
Batch Document Processing (Mixed Formats)
#!/bin/bash
# Process all supported documents in a directory
for file in *.{pdf,docx,xlsx,pptx,txt,md,html,png,jpg}; do
if [[ -f "$file" ]]; then
echo "Processing $file..."
docsray process "$file"
fi
done
Programmatic Format Detection
from docsray.scripts.file_converter import FileConverter
converter = FileConverter()
# Check if file is supported
if converter.is_supported("presentation.pptx"):
print("File is supported!")
# Get all supported formats
formats = converter.get_supported_formats()
for ext, description in formats.items():
print(f"{ext}: {description}")
๐ ๏ธ Development
Setting Up Development Environment
# Clone repository
git clone https://github.com/MIMICLab/DocsRay.git
cd DocsRay
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e .[dev]
# Run tests
pytest tests/
Contributing
Contributions are welcome! Areas of interest:
- Additional multimodal model support
- Enhanced table extraction algorithms
- Support for more document formats
- Performance optimizations
- UI/UX improvements
๐ License
This project is licensed under the MIT License. See LICENSE file for details.
Note: Individual model licenses may have different requirements:
- BAAI/bge-m3: MIT License
- intfloat/multilingual-e5-large: MIT License
- gemma-3-1B-it: Gemma Terms of Use
- gemma-3-4B-it: Gemma Terms of Use
๐ค Support
- Web Demo: https://docsray.com
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docsray-1.4.5.tar.gz.
File metadata
- Download URL: docsray-1.4.5.tar.gz
- Upload date:
- Size: 91.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41eced878fdc38dbdccbd6cb41e0e81d29353ec77fe4f574f5742de7d45c54df
|
|
| MD5 |
34d82f65b224f26bd1055c995077d966
|
|
| BLAKE2b-256 |
405a76f168027d2cba964ba39edc035f50b18de7d6fb5fac0af0b70453a863bf
|
Provenance
The following attestation bundles were made for docsray-1.4.5.tar.gz:
Publisher:
python-publish.yml on MIMICLab/DocsRay
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docsray-1.4.5.tar.gz -
Subject digest:
41eced878fdc38dbdccbd6cb41e0e81d29353ec77fe4f574f5742de7d45c54df - Sigstore transparency entry: 230533602
- Sigstore integration time:
-
Permalink:
MIMICLab/DocsRay@4240db08efe8c13d6fa93df28f05375917cd3101 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/MIMICLab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@4240db08efe8c13d6fa93df28f05375917cd3101 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file docsray-1.4.5-py3-none-any.whl.
File metadata
- Download URL: docsray-1.4.5-py3-none-any.whl
- Upload date:
- Size: 92.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
544e13717036e9a839a3ff5a0f742b3e62686b96070825f3a9992d2269886e6a
|
|
| MD5 |
2ad21b7af3d5084f04b39bcc8c50e07b
|
|
| BLAKE2b-256 |
dec659a283ca35ffde4fc89b12afb3454244bc89dd15f152df5af23a2cf92ac2
|
Provenance
The following attestation bundles were made for docsray-1.4.5-py3-none-any.whl:
Publisher:
python-publish.yml on MIMICLab/DocsRay
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docsray-1.4.5-py3-none-any.whl -
Subject digest:
544e13717036e9a839a3ff5a0f742b3e62686b96070825f3a9992d2269886e6a - Sigstore transparency entry: 230533603
- Sigstore integration time:
-
Permalink:
MIMICLab/DocsRay@4240db08efe8c13d6fa93df28f05375917cd3101 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/MIMICLab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@4240db08efe8c13d6fa93df28f05375917cd3101 -
Trigger Event:
workflow_dispatch
-
Statement type: