Skip to main content

MCP Gosling - Advanced document processing server for Goose AI using IBM's Docling library

Project description

๐Ÿฆ† MCP Gosling - Document Processor for Goose

PyPI version License: MIT Python 3.9+

Advanced document processing extension for Goose AI with enterprise-grade offline fallback

A powerful Model Context Protocol (MCP) server that provides advanced document processing capabilities for Goose. Process PDFs, DOCX, PPTX, images, and HTML documents with high fidelity using IBM's Docling library, with intelligent fallback to offline processing for network-restricted environments.

โœจ Key Features

๐Ÿ”ง Enterprise-Ready: SSL certificate fixes for corporate networks
๐Ÿ“„ Multi-Format: PDF, DOCX, PPTX, images, HTML, and more
๐ŸŒ Offline Capable: Graceful fallback to PyPDF2 when Hugging Face is blocked
โšก High Performance: Optimized for production workloads
๐Ÿ›ก๏ธ Robust: Comprehensive error handling and validation
๐ŸŽฏ AI-Optimized: Clean Markdown output perfect for AI analysis

๐Ÿš€ Quick Start

Installation Options

Option 1: Standard Installation (Recommended)

pip install mcp-gosling

Option 2: Using uvx (Modern Tooling)

# Run directly without installation
uvx mcp-gosling

# Or using uv tool run (identical behavior)
uv tool run mcp-gosling

Configuration for Goose

With Standard Installation:

{
  "mcpServers": {
    "gosling": {
      "command": "mcp-gosling",
      "args": []
    }
  }
}

With uvx:

{
  "mcpServers": {
    "gosling": {
      "command": "uvx",
      "args": ["mcp-gosling"]
    }
  }
}

Usage

# Process your AWS certification PDF
goose "Process my AWS certification document at /path/to/cert.pdf"

# Batch process multiple documents
goose "Process all PDFs in my documents folder and summarize them"

# Extract metadata only
goose "What are the metadata details of this document?"

๐Ÿ“‹ Available Tools

process_document

Process a single document and return clean Markdown content.

Parameters:

  • file_path (string): Path to the document file
  • output_format (string): "markdown", "json", or "text" (default: "markdown")
  • extract_images (boolean): Whether to extract and describe images (default: false)
  • extract_tables (boolean): Whether to extract table structure (default: true)

batch_process_documents

Process multiple documents in batch with optional file output.

Parameters:

  • file_paths (array): List of document file paths (max 20 files)
  • output_format (string): Output format for all documents (default: "markdown")
  • output_directory (string): Directory to save files (empty = return content)

extract_document_metadata

Extract detailed metadata and structure information from a document.

Parameters:

  • file_path (string): Path to the document file

๐Ÿ”ง Advanced Features

Corporate Network Support

  • โœ… SSL certificate fixes for enterprise environments
  • โœ… Automatic fallback when Hugging Face Hub is blocked
  • โœ… Works behind corporate firewalls and proxies

Intelligent Processing

  • Primary: IBM Docling for high-fidelity extraction with OCR and table recognition
  • Fallback: PyPDF2 for reliable offline PDF processing
  • Formats: PDF, DOCX, PPTX, PNG, JPG, HTML, TXT, MD, JSON

Performance & Reliability

  • File size limits (50MB for full processing, 5MB for metadata)
  • Batch processing (up to 20 files)
  • Comprehensive error handling
  • Memory-efficient processing

๐ŸŽฏ Use Cases

  • ๐Ÿ“‘ Document Analysis: Extract and analyze content from reports, papers, contracts
  • ๐Ÿข Enterprise: Process documents in network-restricted corporate environments
  • ๐Ÿ” Research: Batch process academic papers and research documents
  • ๐Ÿ“Š Data Extraction: Convert documents to structured data for AI analysis
  • ๐Ÿ“ Content Migration: Bulk convert document formats with preserved structure

๐Ÿ›  Technical Details

Built With:

Requirements:

  • Python 3.9+
  • Works on macOS, Linux, Windows
  • Optional: GPU acceleration for enhanced performance

๐Ÿš€ Installation Options

For Goose Users (Recommended)

  1. Install via pip:

    pip install mcp-gosling
    
  2. Configure in Goose: Add the MCP server to your Goose configuration

  3. Start using:

    goose "Process this document for me"
    

For MCP Development

  1. Clone and install:

    git clone https://github.com/masanderso/goose-docling.git
    cd goose-docling
    pip install -e .
    
  2. Test with MCP Inspector:

    mcp dev src/mcp_docling/server.py
    

๐Ÿ” Example Outputs

Document Processing

# Document: AWS Certified Solutions Architect - Associate.pdf
**Source:** /path/to/document.pdf
**Format:** .pdf
**Pages:** 2

---

## Page 1

AWS Certified Solutions Architect - Associate
Notice of Exam Results
Candidate: Matthew Sanderson Exam Date: 12/3/2024
Candidate Score: 779 Pass/Fail: PASS
...

Metadata Extraction

{
  "file_info": {
    "name": "document.pdf",
    "size_mb": 0.03,
    "format": ".pdf"
  },
  "document_structure": {
    "page_count": 2,
    "has_tables": true,
    "has_figures": false
  }
}

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ”— Links

๐Ÿท๏ธ Tags

goose-extension document-processing mcp-server docling pdf-processing enterprise-ready offline-capable ai-tools

๐Ÿš€ Features

  • Multi-format Support: PDF, DOCX, PPTX, images (PNG, JPG), HTML, and more
  • Intelligent Processing: OCR, table extraction, and structure preservation
  • Flexible Output: Markdown, JSON, or plain text formats
  • Batch Processing: Handle multiple documents efficiently
  • Metadata Extraction: Detailed document structure and properties
  • Production Ready: Robust error handling and file size limits

๐Ÿ“‹ Tools Available

This MCP server exposes three main tools:

1. process_document

Process a single document and return the converted content.

Parameters:

  • file_path (string): Path to the document file
  • output_format (string): "markdown", "json", or "text" (default: "markdown")
  • extract_images (boolean): Whether to extract and describe images (default: false)
  • extract_tables (boolean): Whether to extract table structure (default: true)

Example:

process_document("report.pdf", "markdown", true, true)

2. batch_process_documents

Process multiple documents in batch with optional file output.

Parameters:

  • file_paths (array): List of document file paths (max 20 files)
  • output_format (string): Output format for all documents (default: "markdown")
  • output_directory (string): Directory to save files (empty = return content)

Example:

batch_process_documents(["doc1.pdf", "doc2.docx"], "markdown", "/output")

3. extract_document_metadata

Extract detailed metadata and structure information from a document.

Parameters:

  • file_path (string): Path to the document file

Example:

extract_document_metadata("report.pdf")

๐Ÿ›  Installation

For Goose Users

Option 1: Standard Installation

  1. Install the MCP server:
pip install mcp-gosling
  1. Add to your Goose configuration:
{
  "mcpServers": {
    "gosling": {
      "command": "mcp-gosling",
      "args": []
    }
  }
}

Option 2: Using uvx (Modern)

  1. Ensure uv is installed:
pip install uv
  1. Add to your Goose configuration:
{
  "mcpServers": {
    "gosling": {
      "command": "uvx",
      "args": ["mcp-gosling"]
    }
  }
}

For MCP Development

  1. Clone and install:
git clone https://github.com/masanderso/mcp-gosling.git
cd mcp-gosling
pip install -e .
  1. Test with MCP Inspector:
mcp dev src/mcp_docling/server.py

๐Ÿ”ง Configuration

The server automatically configures Docling with optimal settings:

  • OCR enabled for scanned documents
  • Table structure extraction with cell matching
  • Support for all major document formats
  • 50MB file size limit for safety

๐ŸŽฏ Use Cases

  • Research: Extract content from academic papers and reports
  • Business: Process contracts, invoices, and presentations
  • Data Extraction: Convert documents to structured data
  • Content Migration: Bulk convert document formats
  • Analysis: Extract metadata and document structure

๐Ÿ— Architecture

This server follows the MCP (Model Context Protocol) specification:

  • Tools: Document processing functions exposed to AI assistants
  • STDIO Transport: Communication via standard input/output
  • Error Handling: Proper MCP error responses
  • Type Safety: Full type annotations and validation

๐Ÿค Integration Examples

With Goose

"Process the quarterly report in /documents/q4-report.pdf and summarize the key findings"

With other MCP clients

# Call the process_document tool
result = await client.call_tool("process_document", {
    "file_path": "/path/to/document.pdf",
    "output_format": "markdown"
})

๐Ÿ“Š Performance

  • Speed: Optimized for production workloads
  • Memory: Efficient processing of large documents
  • Reliability: Robust error handling and validation
  • Scalability: Suitable for batch processing workflows

๐Ÿ› Troubleshooting

Common issues and solutions:

  • File not found: Ensure file paths are absolute and accessible
  • Large files: Files over 50MB are rejected for safety
  • Format errors: Check that file format is supported
  • Memory issues: Process large batches in smaller chunks

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿค Contributing

Contributions welcome! Please read our contributing guidelines and submit pull requests.

๐Ÿ”— Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_gosling-0.1.0.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_gosling-0.1.0-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file mcp_gosling-0.1.0.tar.gz.

File metadata

  • Download URL: mcp_gosling-0.1.0.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for mcp_gosling-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e10387db6c0bf40417811dbd2de29dcdbbc947b1320eccb169201e8661dbd8b1
MD5 038724748987a2f3f348fc432d7545f9
BLAKE2b-256 7c9256579a8d071af21548040fb24c661f8a0fb368b62e2394b9166205b2518a

See more details on using hashes here.

File details

Details for the file mcp_gosling-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mcp_gosling-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for mcp_gosling-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 85af761fdd47d19e591cbb04dd1217ec663911157717a4cdf979d3e22afa4b69
MD5 6c63ad9202db68d0228e2f294e3dddac
BLAKE2b-256 a0a68c57b35b9a61d594fb79d8eefc350724d1ce631ed4618bb583f2814b752e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page