Skip to main content

Vision-empowered MCP server for OpenCode text-only models. PaddleOCR (SOTA deep learning) + Google Gemini fallback for model-agnostic image analysis.

Project description

opencode-vision ๐Ÿ‘๏ธ

PyPI - Downloads PyPI - Version GitHub

Vision-empowered MCP server for OpenCode text-only models.

Give vision capabilities to any text-only model โ€” big-pickle, DeepSeek, MiMo, MiniMax, or any other model that can't process images natively.

pip install opencode-vision[paddle]

The Problem

OpenCode supports many models, but most open-weight and free models are text-only. When you paste an image or try to read() one, you get:

ERROR: Cannot read image (this model does not support image input).

This is not a configuration issue โ€” it's a fundamental limitation of the model architecture. Text-only models have no visual neurons.

The Solution

opencode-vision is an MCP server that acts as a "guide dog" for text-only models. It handles image analysis via a dual-engine architecture:

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚  opencode-vision MCP Server           โ”‚
                    โ”‚                                      โ”‚
  [big-pickle] โ”€โ”€โ”€โ”€โ–บโ”‚  1. PaddleOCR (PP-OCRv5, SOTA) โ”€โ”€โ”€โ”€โ–บโ”‚โ”€โ”€โ–บ Text
  [DeepSeek]   โ”€โ”€โ”€โ”€โ–บโ”‚     โ€ข 0% error rate on benchmarks    โ”‚
  [MiMo]       โ”€โ”€โ”€โ”€โ–บโ”‚     โ€ข 100+ languages                 โ”‚
                    โ”‚     โ€ข ~15MB model footprint           โ”‚
                    โ”‚                                      โ”‚
                    โ”‚  2. Gemini Vision API (fallback) โ”€โ”€โ”€โ”€โ–บโ”‚โ”€โ”€โ–บ Text
                    โ”‚     โ€ข Handwriting & scene text        โ”‚
                    โ”‚     โ€ข 1,500 free requests/day         โ”‚
                    โ”‚     โ€ข Zero installation               โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Why PaddleOCR (not Tesseract)?

Metric PaddleOCR (PP-OCRv5) Tesseract 5
Character Error Rate 4.5% 18.2% (4ร— worse)
Invoice accuracy 100% (0 errors) 87.5% (3 errors)
OmniDocBench score 92.86 (SOTA) N/A
Rotated text โœ“ Highly robust โœ— Fails >5ยฐ
Scene text accuracy 85โ€“90% 60โ€“70%
Model size ~15MB ~30MB
License Apache 2.0 Apache 2.0

The community consensus in 2026 is clear: Tesseract is no longer competitive for production OCR. PaddleOCR's deep learning pipeline delivers 4ร— lower error rates, handles rotated and degraded text, and supports 100+ languages.

Gemini Fallback

PaddleOCR struggles with handwriting (14.4% accuracy). When confidence is below 70%, the server falls back to Google Gemini 2.5 Flash Vision API (FREE tier, 1,500 requests/day, no credit card required), which achieves 86%+ accuracy on handwritten text and handles scene text perfectly.

Quick Start

1. Install

pip install opencode-vision[paddle]    # Recommended: PaddleOCR + Pillow
pip install opencode-vision            # Minimal: Gemini API only

2. Get a Gemini API key

Get a free key at aistudio.google.com (1,500 requests/day, no credit card required).

Set it in ~/.config/opencode/.env:

echo 'GOOGLE_API_KEY=your_key_here' >> ~/.config/opencode/.env

Or export it directly:

export GOOGLE_API_KEY=your_key_here

3. Add to OpenCode config

Add this to ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "mcp": {
    "vision": {
      "type": "local",
      "command": ["python3", "-m", "opencode_vision.server"],
      "enabled": true,
      "timeout": 30000
    }
  }
}

4. Restart OpenCode

Start a new session. The vision_describe, vision_ocr, and vision_analyze tools will be available to all models โ€” even text-only ones.

5. Ask about images

User: What's in this image?
Model: [calls vision_describe("/path/to/image.png")]
       "A dark gradient banner with 'Nicolรกs Rรญos Herrera'..."

Tools

Tool Description When to use
vision_describe(path, prompt?) Describe an image in detail "What does this show?"
vision_ocr(path) Extract all visible text "What text is in this screenshot?"
vision_analyze(path) Metadata + description + OCR Comprehensive understanding

Dependencies

Component Required? Notes
Python >= 3.10 โœ… Required
GOOGLE_API_KEY โœ… Required Get free at aistudio.google.com
pillow ๐Ÿ“ฆ Recommended pip install pillow for metadata + auto-resize
paddleocr ๐Ÿš€ Recommended pip install paddleocr for local SOTA OCR
tesseract-ocr โŒ Deprecated No longer used. PaddleOCR replaces it entirely.

The server auto-detects the API key from (in order):

  1. GOOGLE_API_KEY environment variable
  2. GOOGLE_GENERATIVE_AI_API_KEY environment variable
  3. ~/.config/opencode/.env file
  4. ~/.env file
  5. $PWD/.env file

CLI Usage (without OpenCode)

# Start MCP server (for OpenCode integration)
opencode-vision

# Direct analysis
opencode-vision describe ~/screenshot.png
opencode-vision ocr ~/scanned-document.png
opencode-vision analyze ~/photo.jpg

# Custom prompt
opencode-vision describe ~/chart.png "What are the values in this chart?"

Architecture

Why Python?

All existing MCP vision servers for OpenCode are Node.js/TypeScript and require npm install or npx. opencode-vision is pure Python because:

  • Python is already installed on every developer machine
  • pillow (PIL) is the standard image processing library
  • PaddleOCR is the best open-source OCR engine available
  • The MCP protocol is simple JSON-RPC over stdio โ€” no framework needed
  • Zero node_modules, zero npm, zero npx

Modular Design (v2.0)

opencode-vision/
โ”œโ”€โ”€ opencode_vision/
โ”‚   โ”œโ”€โ”€ __init__.py    # Package metadata
โ”‚   โ”œโ”€โ”€ __main__.py    # CLI entry point
โ”‚   โ”œโ”€โ”€ server.py      # MCP server (thin router)
โ”‚   โ”œโ”€โ”€ mcp.py         # MCP transport protocol
โ”‚   โ”œโ”€โ”€ ocr.py         # OCR engine (PaddleOCR + Gemini fallback)
โ”‚   โ”œโ”€โ”€ gemini.py      # Gemini Vision API client
โ”‚   โ””โ”€โ”€ image.py       # Image processing utilities
โ”œโ”€โ”€ pyproject.toml
โ””โ”€โ”€ README.md

OCR Strategy

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚   PaddleOCR (PP-OCRv5)      โ”‚
                    โ”‚   โ€ข Deep learning OCR       โ”‚
  User image โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚   โ€ข 0% error on benchmarks  โ”‚โ”€โ”€โ”€โ–บ conf โ‰ฅ 70% โ”€โ”€โ–บ Return text
                    โ”‚   โ€ข 100+ languages           โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚ conf < 70% / error
                              โ–ผ
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚   Gemini 2.5 Flash Vision   โ”‚
                    โ”‚   โ€ข Handwriting / scene     โ”‚โ”€โ”€โ”€โ–บ Return text
                    โ”‚   โ€ข 1,500 free req/day      โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Cost: $0

  • Gemini 2.5 Flash: 1,500 free requests/day via Google AI Studio API key
  • PaddleOCR: free and open-source (Apache 2.0)
  • Pillow: free and local for metadata
  • No OpenCode Go credits consumed โ€” the API call happens in the vision server, not through OpenCode's model proxy

Comparison with Alternatives

Feature opencode-vision v2 opencode-vision v1 opencode-minimax-easy-vision qwen-vision-mcp
Runtime Python (stdlib) Python (stdlib) Node.js + npm Node.js + npm
OCR engine PaddleOCR (SOTA) Tesseract (legacy) None (API only) None (API only)
OCR accuracy 0% error rate ~18% CER N/A N/A
Handwriting Gemini Vision API โŒ Not supported โŒ โŒ
Dependencies pip install opencode-vision[paddle] pip install opencode-vision npm install npx
API cost $0 (Gemini FREE tier) $0 MiniMax pricing $0 (local)
Auto .env โœ“ Reads ~/.config/opencode/.env โœ“ โœ— Manual env vars โœ—
Image resize โœ“ Pillow auto-resize โœ“ Pillow โœ— โœ—
Install size ~200 KB + optional 15MB model ~200 KB ~30 MB ~30 MB

Why "Model-Agnostic"?

The key architectural insight: the model never needs to see pixels. The MCP server does all the visual processing externally and returns text. This means:

  • Works with any text-only model (big-pickle, DeepSeek, MiMo, MiniMax, etc.)
  • Works with any multimodal model too (it doesn't interfere)
  • No model-specific configuration
  • No provider-specific setup
  • The model can be changed at any time without reconfiguring vision

License

MIT


Built with โค๏ธ by Nicolรกs Rรญos Herrera for the OpenCode community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencode_vision-2.1.0.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opencode_vision-2.1.0-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file opencode_vision-2.1.0.tar.gz.

File metadata

  • Download URL: opencode_vision-2.1.0.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for opencode_vision-2.1.0.tar.gz
Algorithm Hash digest
SHA256 af7aa64b598cfa2432a0c481fc7117f0cb8f24f9f6f3db987ca206f9cf5bbad4
MD5 c3a9e78cf6756e026e5abd553044fe62
BLAKE2b-256 75af5b33c40c57f202edc3f499f1c80c85c37cd19bbb75eaa76e07d536b3bd8f

See more details on using hashes here.

File details

Details for the file opencode_vision-2.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for opencode_vision-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 eb55cc303b925f12a9b5e5b35d3673730ed3354132ed49c4b68c4487a08e6d53
MD5 c43d3e34bee81b8267655c3dff5a5453
BLAKE2b-256 6323234eedf7be8cd0e306e5511ae93f4f2e552af4153002a0c1fe1a17627a8a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page