Skip to main content

An AWS Labs Model Context Protocol (MCP) server for document parsing

Project description

Document Loader MCP Server

Model Context Protocol (MCP) server for document parsing and content extraction

This MCP server provides tools to parse and extract content from various document formats including PDF, Word documents, Excel spreadsheets, PowerPoint presentations, and images.

Features

  • PDF Text Extraction: Extract text content from PDF files using pdfplumber
  • Word Document Processing: Convert DOCX/DOC files to markdown using markitdown
  • Excel Spreadsheet Reading: Parse XLSX/XLS files and convert to markdown
  • PowerPoint Presentation Processing: Extract content from PPTX/PPT files
  • Image Loading: Load and display various image formats (PNG, JPG, GIF, BMP, TIFF, WEBP)
  • Slide Image Extraction: Extract individual slides/pages as PNG images from PPTX, PPT, or PDF files using LibreOffice and poppler

Prerequisites

Installation Requirements

  1. Install uv from Astral or the GitHub README
  2. Install Python 3.10 or newer using uv python install 3.10 (or a more recent version)

Optional: Slide Image Extraction

The extract_slides_as_images tool requires external system packages:

  • LibreOffice (for PPTX/PPT → PDF conversion):
  • poppler-utils (for PDF → image rendering):
    • Ubuntu/Debian: sudo apt install poppler-utils
    • macOS: brew install poppler
    • Windows: Download from GitHub and add to PATH

Installation

Kiro Cursor VS Code
Add to Kiro Install MCP Server Install on VS Code

Configure the MCP server in your MCP client configuration:

{
  "mcpServers": {
    "awslabs.document-loader-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.document-loader-mcp-server@latest"],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    }
  }
}

For Kiro MCP configuration, see the Kiro IDE documentation or the Kiro CLI documentation for details.

For global configuration, edit ~/.kiro/settings/mcp.json. For project-specific configuration, edit .kiro/settings/mcp.json in your project directory.

Available Tools

  • read_document: Extract content from various document formats by specifying file_path and file_type ('pdf', 'docx', 'doc', 'xlsx', 'xls', 'pptx', 'ppt')
  • read_image: Load image files for LLM viewing and analysis
  • extract_slides_as_images: Extract slides/pages as individual PNG images from PPTX, PPT, or PDF files. Requires LibreOffice (for PPTX/PPT) and poppler-utils (for PDF-to-image rendering)

Environment Variables

  • FASTMCP_LOG_LEVEL: Set logging level (ERROR, INFO, DEBUG)
  • MAX_FILE_SIZE_MB: Maximum allowed file size in megabytes (default: 50). Must be a positive integer.
  • DOCUMENT_BASE_DIR: Base directory for file access security. Restricts document loading to files within this directory. Defaults to the current working directory.

Development

Setup

# Clone the repository
git clone https://github.com/awslabs/mcp.git
cd mcp/src/document-loader-mcp-server

# Install dependencies
uv sync

# Install in development mode
uv pip install -e .

Testing

# Run tests
uv run pytest

# Run with coverage
uv run pytest --cov=awslabs.document_loader_mcp_server

The test suite includes:

  • Server functionality validation
  • Document parsing tests with generated sample files
  • Error handling verification

Sample Documents

The test suite automatically generates sample documents for testing:

  • PDF with multi-page content
  • DOCX with formatted text and lists
  • XLSX with multiple sheets and data
  • PPTX with slides and content
  • Various image formats

Docker

You can also run this server in a Docker container:

docker build -t document-loader-mcp-server .
docker run -p 8000:8000 document-loader-mcp-server

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

Support

For issues and questions, please use the GitHub issue tracker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awslabs_document_loader_mcp_server-1.0.14.tar.gz (181.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file awslabs_document_loader_mcp_server-1.0.14.tar.gz.

File metadata

File hashes

Hashes for awslabs_document_loader_mcp_server-1.0.14.tar.gz
Algorithm Hash digest
SHA256 b69aa56fb3c27b541dc8e7b6721d788dc8bd3e2095be4ba51046e6105772fb95
MD5 27443e771e5ad414f81ce1001cdb9d93
BLAKE2b-256 6c093d7fb1a05f0cf252725ae525edd2f4a877e9693167d9137b4da58053ba1b

See more details on using hashes here.

Provenance

The following attestation bundles were made for awslabs_document_loader_mcp_server-1.0.14.tar.gz:

Publisher: release.yml on awslabs/mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file awslabs_document_loader_mcp_server-1.0.14-py3-none-any.whl.

File metadata

File hashes

Hashes for awslabs_document_loader_mcp_server-1.0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 57c64bc287580039f11eb15321edddda31a9bc0f9be0563ebb6c86d2e8a69c2d
MD5 f4dd9601a3d6bc7fe240d94332ca0867
BLAKE2b-256 176703bfeabe37cc5e00e77d9bbdead8c415b592da3a79ca5bb65a1e9b22d272

See more details on using hashes here.

Provenance

The following attestation bundles were made for awslabs_document_loader_mcp_server-1.0.14-py3-none-any.whl:

Publisher: release.yml on awslabs/mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page