Skip to main content

A Model Context Protocol (MCP) service that extracts text content from files using Apache Tika

Project description

MCP File Reader

A Model Context Protocol (MCP) service that extracts text content from files using Apache Tika. This service provides a single tool that can read various file formats (PDF, Word, Excel, PowerPoint, images with OCR, and more) and return their text content.

Features

  • File Content Extraction: Reads files and extracts text using Apache Tika
  • Multiple Format Support: Supports PDF, DOCX, XLSX, PPTX, images, and many other formats
  • Automatic Tika Management: Automatically starts and manages Tika server when needed
  • Simple Deployment: Easy installation and setup using uv
  • Directory Access Control: Configurable allowed directories for secure file access
  • Path Traversal Protection: Prevents access outside allowed directories via path traversal attacks
  • Environment Configuration: Configurable Tika server endpoint and allowed directories
  • Error Handling: Comprehensive error handling for missing files, network issues, etc.

Installation and usage

Since this package is designed to be used as an MCP service, it is typically installed by inserting an entry into the configuration of your MCP client. The examples below are for the claude_desktop_confoig.json settings for the Claude Desktop app, but the general content of the settings should be the same for other applications.

Using uvx (Recommended)

Insert the "file-reader" stanza below into your mcpServers configuration:

{
  "mcpServers": {
    "file-reader": {
      "command": "uvx",
      "args": [
        "mcp-file-reader",
        "/Users/your_name/Desktop",
        "/Users/your_name/Downloads",
        "/Users/your_name/other_accessible_directory"
      ]
    }
  }
}

Running from a local development copy

Check out the latest source using:

cd /Users/your_name/source_path
git clone https://github.com/nickovs/mcp_file_reader.git

Then insert the "file-reader" stanza below into your mcpServers configuration:

{
  "mcpServers": {
    "file-reader": {
      "command": "uvx",
      "args": [
        "--refresh",
        "--from",
        "/Users/your_name/source_path/mcp_file_reader",
        "mcp-file-reader",
        "/Users/your_name/Desktop",
        "/Users/your_name/Downloads",
        "/Users/your_name/other_accessible_directory"
      ]
    }
  }
}

Manual Tika Configuration

If you prefer to manage Tika yourself, set the TIKA_URL environment variable:

{
  "mcpServers": {
    "file-reader": {
      "command": "uvx",
      "env": {
        "TIKA_URL": "http://some.tika.server:9998"
      },
      "args": [
        "mcp-file-reader",
        "/Users/your_name/Desktop",
        "/Users/your_name/Downloads",
        "/Users/your_name/other_accessible_directory"
      ]
    }
  }
}

Available Tools

read_file_content

Extracts text content from a file using Apache Tika. The file must be within an allowed directory.

Parameters:

  • file_path (string, required): Absolute path to the file to read and extract text from. Must be within allowed directories.

Example:

{
  "name": "read_file_content",
  "arguments": {
    "file_path": "/Users/yourname/Documents/document.pdf"
  }
}

Returns:

  • Success: The extracted text content
  • Error: Error message describing what went wrong (access denied, file not found, etc.)

list_allowed_directories

Lists the directories that this service is allowed to access.

Parameters:

  • None

Example:

{
  "name": "list_allowed_directories",
  "arguments": {}
}

Returns:

  • JSON object containing the list of allowed directories and a description

Supported File Formats

Thanks to Apache Tika, this service supports:

  • Documents: PDF, DOC, DOCX, RTF, ODT
  • Spreadsheets: XLS, XLSX, ODS, CSV
  • Presentations: PPT, PPTX, ODP
  • Images: PNG, JPG, GIF, TIFF (with OCR)
  • Text: TXT, XML, HTML, JSON
  • Archives: ZIP, TAR, 7Z (extracts text from contained files)
  • And many more...

Configuration

Environment Variables

  • TIKA_URL: URL of the Apache Tika server (optional, defaults to auto-managed Tika)
  • MCP_ALLOWED_DIRECTORIES: Colon, semicolon, or comma-separated list of directories that the service is allowed to access (optional, defaults to current working directory)

Examples:

# Single directory
export MCP_ALLOWED_DIRECTORIES="/Users/yourname/Documents"

# Multiple directories, space separated
export MCP_ALLOWED_DIRECTORIES="/Users/yourname/Documents /Users/yourname/Downloads"

Security Model

The service implements directory-based access control:

  1. Allowed Directories: Files can only be accessed if they are within configured allowed directories
  2. Path Traversal Protection: Prevents access to files outside allowed directories via ../ or symlink attacks
  3. Absolute Path Requirement: All file paths must be absolute paths
  4. Default Access: If no directories are configured, only the current working directory is accessible

Requirements

  • Python: 3.8 or higher
  • Docker: Required for automatic Tika management (if not providing custom TIKA_URL)

Development

Development Setup

  1. Clone the repository:

    git clone https://github.com/nickovs/mcp_file_reader.git
    cd mcp_file_reader
    
  2. Install in development mode:

    uv pip install -e ".[dev]"
    
  3. Run the service: See the example above for running the code from a local development copy.

Testing

Run the test suite:

# Install test dependencies
uv pip install ".[dev]"

# Run tests
./run_tests.sh

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_file_reader-0.2.2.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_file_reader-0.2.2-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file mcp_file_reader-0.2.2.tar.gz.

File metadata

  • Download URL: mcp_file_reader-0.2.2.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.1

File hashes

Hashes for mcp_file_reader-0.2.2.tar.gz
Algorithm Hash digest
SHA256 a4bfe61b54465690a47863c58b102b2baf4f1b93cd8cfcec50265072d340ae20
MD5 3abfe92db8212ef0eb05a95f5a256667
BLAKE2b-256 a61b35178521b3dc5e6d90f3c647f96ae2473110f8d325d87df7a8f1cd304f91

See more details on using hashes here.

File details

Details for the file mcp_file_reader-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for mcp_file_reader-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 676efb4a561cb114f844ebebb97f513494f97cfc1297d5ec7a45baedafd59ba8
MD5 7480748981c684bece7259edb511b158
BLAKE2b-256 ed33ee38ef7687e8b4fed16268149d1ff45937240656b9ca674575b17ebf9c3a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page