Skip to main content

MCP-RAG system built with the Model Context Protocol (MCP) that handles large files (up to 200MB) using intelligent chunking strategies, multi-format document support, and enterprise-grade reliability.

Project description

๐Ÿ“š MCP-RAG

MCP-RAG system built with the Model Context Protocol (MCP) that handles large files (up to 200MB) using intelligent chunking strategies, multi-format document support, and enterprise-grade reliability.

Python 3.11+ License: MIT MCP

๐ŸŒŸ Features

๐Ÿ“„ Multi-Format Document Support

  • PDF: Intelligent page-by-page processing with table detection
  • DOCX: Paragraph and table extraction with formatting preservation
  • Excel: Sheet-aware processing with column context (.xlsx/.xls)
  • CSV: Smart row batching with header preservation
  • PPTX: Support for PPTX
  • IMAGE: Suppport for jpeg , png , webp , gif etc and OCR

๐Ÿš€ Large File Processing

  • Adaptive chunking: Different strategies based on file size
  • Memory management: Streaming processing for 50MB+ files
  • Progress tracking: Real-time progress indicators
  • Timeout handling: Graceful handling of long-running operations

๐Ÿง  Advanced RAG Capabilities

  • Semantic search: Vector similarity with confidence scores
  • Cross-document queries: Search across multiple documents simultaneously
  • Source attribution: Citations with similarity scores
  • Hybrid retrieval: Combine semantic and keyword search

๐Ÿ”Œ Model Context Protocol (MCP) Integration

  • Universal tool interface: Standardized AI-to-tool communication
  • Auto-discovery: LangChain agents automatically find and use tools
  • Secure communication: Built-in permission controls
  • Extensible architecture: Easy to add new document processors

๐Ÿข Enterprise Ready

  • Custom LLM endpoints: Support for any OpenAI-compatible API
  • Vector database options: ChromaDB (local) + Milvus (production)
  • Batch processing: Handles API rate limits and batch size constraints
  • Error recovery: Retry logic and graceful degradation

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Streamlit โ”‚ โ”‚ LangChain โ”‚ โ”‚ MCP Server โ”‚ โ”‚ Frontend โ”‚โ—„โ”€โ”€โ–บโ”‚ Agent โ”‚โ—„โ”€โ”€โ–บโ”‚ (Tools) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ–ผ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Document โ”‚ โ”‚ Vector Database โ”‚ โ”‚ LLM API โ”‚ โ”‚ Processors โ”‚ โ”‚ (ChromaDB) โ”‚ โ”‚ Endpoint โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.11+
  • OpenAI API key or compatible LLM endpoint
  • 8GB+ RAM (for large file processing)

Installation

Clone the repository

git clone https://github.com/yourusername/rag-large-file-processor.git
cd rag-large-file-processor

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

pip install -r requirements.txt

# Create .env file
cat > .env << EOF
OPENAI_API_KEY=your_openai_api_key_here
BASE_URL=https://api.openai.com/v1
MODEL_NAME=gpt-4o
VECTOR_DB_TYPE=chromadb


streamlit run streamlit_app.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iflow_mcp_anuragb7_mcp_rag-0.1.0.tar.gz (31.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iflow_mcp_anuragb7_mcp_rag-0.1.0-py3-none-any.whl (45.4 kB view details)

Uploaded Python 3

File details

Details for the file iflow_mcp_anuragb7_mcp_rag-0.1.0.tar.gz.

File metadata

  • Download URL: iflow_mcp_anuragb7_mcp_rag-0.1.0.tar.gz
  • Upload date:
  • Size: 31.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_anuragb7_mcp_rag-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bdaada7c04993216397bd65750d17f777a6850ca8a7d0498d1b15741a26b38ec
MD5 a28d7c125faf881f635406d11aecc607
BLAKE2b-256 aaa8a11f299b6335b39e4861445b1f108cda2463a372ac1aad750c9cbc79015c

See more details on using hashes here.

File details

Details for the file iflow_mcp_anuragb7_mcp_rag-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: iflow_mcp_anuragb7_mcp_rag-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_anuragb7_mcp_rag-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fb8c468d84c8b657fa72bdb9da076baf045fe1c7775b053dada9afb8468da603
MD5 51a49a4bd42645995df8a7e1be317b3f
BLAKE2b-256 f54e8e32fa115457417962000808947df397d35435529ffcf40f2c165b6f93f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page