MCP Gosling - Advanced document processing server for Goose AI using IBM's Docling library
Project description
๐ฆ MCP Gosling - Document Processor for Goose
Advanced document processing extension for Goose AI with enterprise-grade offline fallback
A powerful Model Context Protocol (MCP) server that provides advanced document processing capabilities for Goose. Process PDFs, DOCX, PPTX, images, and HTML documents with high fidelity using IBM's Docling library, with intelligent fallback to offline processing for network-restricted environments.
โจ Key Features
๐ง Enterprise-Ready: SSL certificate fixes for corporate networks
๐ Multi-Format: PDF, DOCX, PPTX, images, HTML, and more
๐ Offline Capable: Graceful fallback to PyPDF2 when Hugging Face is blocked
โก High Performance: Optimized for production workloads
๐ก๏ธ Robust: Comprehensive error handling and validation
๐ฏ AI-Optimized: Clean Markdown output perfect for AI analysis
๐ Quick Start
Installation Options
Option 1: Standard Installation (Recommended)
pip install mcp-gosling
Option 2: Using uvx (Modern Tooling)
# Run directly without installation
uvx mcp-gosling
# Or using uv tool run (identical behavior)
uv tool run mcp-gosling
Configuration for Goose
With Standard Installation:
{
"mcpServers": {
"gosling": {
"command": "mcp-gosling",
"args": []
}
}
}
With uvx:
{
"mcpServers": {
"gosling": {
"command": "uvx",
"args": ["mcp-gosling"]
}
}
}
Usage
# Process your AWS certification PDF
goose "Process my AWS certification document at /path/to/cert.pdf"
# Batch process multiple documents
goose "Process all PDFs in my documents folder and summarize them"
# Extract metadata only
goose "What are the metadata details of this document?"
๐ Available Tools
process_document
Process a single document and return clean Markdown content.
Parameters:
file_path(string): Path to the document fileoutput_format(string): "markdown", "json", or "text" (default: "markdown")extract_images(boolean): Whether to extract and describe images (default: false)extract_tables(boolean): Whether to extract table structure (default: true)
batch_process_documents
Process multiple documents in batch with optional file output.
Parameters:
file_paths(array): List of document file paths (max 20 files)output_format(string): Output format for all documents (default: "markdown")output_directory(string): Directory to save files (empty = return content)
extract_document_metadata
Extract detailed metadata and structure information from a document.
Parameters:
file_path(string): Path to the document file
๐ง Advanced Features
Corporate Network Support
- โ SSL certificate fixes for enterprise environments
- โ Automatic fallback when Hugging Face Hub is blocked
- โ Works behind corporate firewalls and proxies
Intelligent Processing
- Primary: IBM Docling for high-fidelity extraction with OCR and table recognition
- Fallback: PyPDF2 for reliable offline PDF processing
- Formats: PDF, DOCX, PPTX, PNG, JPG, HTML, TXT, MD, JSON
Performance & Reliability
- File size limits (50MB for full processing, 5MB for metadata)
- Batch processing (up to 20 files)
- Comprehensive error handling
- Memory-efficient processing
๐ฏ Use Cases
- ๐ Document Analysis: Extract and analyze content from reports, papers, contracts
- ๐ข Enterprise: Process documents in network-restricted corporate environments
- ๐ Research: Batch process academic papers and research documents
- ๐ Data Extraction: Convert documents to structured data for AI analysis
- ๐ Content Migration: Bulk convert document formats with preserved structure
๐ Technical Details
Built With:
- IBM Docling - Enterprise-grade document processing
- PyPDF2 - Reliable offline PDF processing
- MCP Python SDK - Model Context Protocol
Requirements:
- Python 3.9+
- Works on macOS, Linux, Windows
- Optional: GPU acceleration for enhanced performance
๐ Installation Options
For Goose Users (Recommended)
-
Install via pip:
pip install mcp-gosling
-
Configure in Goose: Add the MCP server to your Goose configuration
-
Start using:
goose "Process this document for me"
For MCP Development
-
Clone and install:
git clone https://github.com/masanderso/goose-docling.git cd goose-docling pip install -e .
-
Test with MCP Inspector:
mcp dev src/mcp_docling/server.py
๐ Example Outputs
Document Processing
# Document: AWS Certified Solutions Architect - Associate.pdf
**Source:** /path/to/document.pdf
**Format:** .pdf
**Pages:** 2
---
## Page 1
AWS Certified Solutions Architect - Associate
Notice of Exam Results
Candidate: Matthew Sanderson Exam Date: 12/3/2024
Candidate Score: 779 Pass/Fail: PASS
...
Metadata Extraction
{
"file_info": {
"name": "document.pdf",
"size_mb": 0.03,
"format": ".pdf"
},
"document_structure": {
"page_count": 2,
"has_tables": true,
"has_figures": false
}
}
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
๐ License
MIT License - see LICENSE file for details.
๐ Links
๐ท๏ธ Tags
goose-extension document-processing mcp-server docling pdf-processing enterprise-ready offline-capable ai-tools
๐ Features
- Multi-format Support: PDF, DOCX, PPTX, images (PNG, JPG), HTML, and more
- Intelligent Processing: OCR, table extraction, and structure preservation
- Flexible Output: Markdown, JSON, or plain text formats
- Batch Processing: Handle multiple documents efficiently
- Metadata Extraction: Detailed document structure and properties
- Production Ready: Robust error handling and file size limits
๐ Tools Available
This MCP server exposes three main tools:
1. process_document
Process a single document and return the converted content.
Parameters:
file_path(string): Path to the document fileoutput_format(string): "markdown", "json", or "text" (default: "markdown")extract_images(boolean): Whether to extract and describe images (default: false)extract_tables(boolean): Whether to extract table structure (default: true)
Example:
process_document("report.pdf", "markdown", true, true)
2. batch_process_documents
Process multiple documents in batch with optional file output.
Parameters:
file_paths(array): List of document file paths (max 20 files)output_format(string): Output format for all documents (default: "markdown")output_directory(string): Directory to save files (empty = return content)
Example:
batch_process_documents(["doc1.pdf", "doc2.docx"], "markdown", "/output")
3. extract_document_metadata
Extract detailed metadata and structure information from a document.
Parameters:
file_path(string): Path to the document file
Example:
extract_document_metadata("report.pdf")
๐ Installation
For Goose Users
Option 1: Standard Installation
- Install the MCP server:
pip install mcp-gosling
- Add to your Goose configuration:
{
"mcpServers": {
"gosling": {
"command": "mcp-gosling",
"args": []
}
}
}
Option 2: Using uvx (Modern)
- Ensure uv is installed:
pip install uv
- Add to your Goose configuration:
{
"mcpServers": {
"gosling": {
"command": "uvx",
"args": ["mcp-gosling"]
}
}
}
For MCP Development
- Clone and install:
git clone https://github.com/masanderso/mcp-gosling.git
cd mcp-gosling
pip install -e .
- Test with MCP Inspector:
mcp dev src/mcp_docling/server.py
๐ง Configuration
The server automatically configures Docling with optimal settings:
- OCR enabled for scanned documents
- Table structure extraction with cell matching
- Support for all major document formats
- 50MB file size limit for safety
๐ฏ Use Cases
- Research: Extract content from academic papers and reports
- Business: Process contracts, invoices, and presentations
- Data Extraction: Convert documents to structured data
- Content Migration: Bulk convert document formats
- Analysis: Extract metadata and document structure
๐ Architecture
This server follows the MCP (Model Context Protocol) specification:
- Tools: Document processing functions exposed to AI assistants
- STDIO Transport: Communication via standard input/output
- Error Handling: Proper MCP error responses
- Type Safety: Full type annotations and validation
๐ค Integration Examples
With Goose
"Process the quarterly report in /documents/q4-report.pdf and summarize the key findings"
With other MCP clients
# Call the process_document tool
result = await client.call_tool("process_document", {
"file_path": "/path/to/document.pdf",
"output_format": "markdown"
})
๐ Performance
- Speed: Optimized for production workloads
- Memory: Efficient processing of large documents
- Reliability: Robust error handling and validation
- Scalability: Suitable for batch processing workflows
๐ Troubleshooting
Common issues and solutions:
- File not found: Ensure file paths are absolute and accessible
- Large files: Files over 50MB are rejected for safety
- Format errors: Check that file format is supported
- Memory issues: Process large batches in smaller chunks
๐ License
MIT License - see LICENSE file for details.
๐ค Contributing
Contributions welcome! Please read our contributing guidelines and submit pull requests.
๐ Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcp_gosling-0.1.0.tar.gz.
File metadata
- Download URL: mcp_gosling-0.1.0.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e10387db6c0bf40417811dbd2de29dcdbbc947b1320eccb169201e8661dbd8b1
|
|
| MD5 |
038724748987a2f3f348fc432d7545f9
|
|
| BLAKE2b-256 |
7c9256579a8d071af21548040fb24c661f8a0fb368b62e2394b9166205b2518a
|
File details
Details for the file mcp_gosling-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mcp_gosling-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85af761fdd47d19e591cbb04dd1217ec663911157717a4cdf979d3e22afa4b69
|
|
| MD5 |
6c63ad9202db68d0228e2f294e3dddac
|
|
| BLAKE2b-256 |
a0a68c57b35b9a61d594fb79d8eefc350724d1ce631ed4618bb583f2814b752e
|