Vision-empowered MCP server for OpenCode text-only models. PaddleOCR (SOTA deep learning) + Google Gemini fallback for model-agnostic image analysis.
Project description
opencode-vision ๐๏ธ
Vision-empowered MCP server for OpenCode text-only models.
Give vision capabilities to any text-only model โ big-pickle, DeepSeek, MiMo, MiniMax, or any other model that can't process images natively.
pip install opencode-vision[paddle]
The Problem
OpenCode supports many models, but most open-weight and free models are
text-only. When you paste an image or try to read() one, you get:
ERROR: Cannot read image (this model does not support image input).
This is not a configuration issue โ it's a fundamental limitation of the model architecture. Text-only models have no visual neurons.
The Solution
opencode-vision is an MCP server that acts as a "guide dog" for text-only
models. It handles image analysis via a dual-engine architecture:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ opencode-vision MCP Server โ
โ โ
[big-pickle] โโโโโบโ 1. PaddleOCR (PP-OCRv5, SOTA) โโโโโบโโโโบ Text
[DeepSeek] โโโโโบโ โข 0% error rate on benchmarks โ
[MiMo] โโโโโบโ โข 100+ languages โ
โ โข ~15MB model footprint โ
โ โ
โ 2. Gemini Vision API (fallback) โโโโโบโโโโบ Text
โ โข Handwriting & scene text โ
โ โข 1,500 free requests/day โ
โ โข Zero installation โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Why PaddleOCR (not Tesseract)?
| Metric | PaddleOCR (PP-OCRv5) | Tesseract 5 |
|---|---|---|
| Character Error Rate | 4.5% | 18.2% (4ร worse) |
| Invoice accuracy | 100% (0 errors) | 87.5% (3 errors) |
| OmniDocBench score | 92.86 (SOTA) | N/A |
| Rotated text | โ Highly robust | โ Fails >5ยฐ |
| Scene text accuracy | 85โ90% | 60โ70% |
| Model size | ~15MB | ~30MB |
| License | Apache 2.0 | Apache 2.0 |
The community consensus in 2026 is clear: Tesseract is no longer competitive for production OCR. PaddleOCR's deep learning pipeline delivers 4ร lower error rates, handles rotated and degraded text, and supports 100+ languages.
Gemini Fallback
PaddleOCR struggles with handwriting (14.4% accuracy). When confidence is below 70%, the server falls back to Google Gemini 2.5 Flash Vision API (FREE tier, 1,500 requests/day, no credit card required), which achieves 86%+ accuracy on handwritten text and handles scene text perfectly.
Quick Start
1. Install
pip install opencode-vision[paddle] # Recommended: PaddleOCR + Pillow
pip install opencode-vision # Minimal: Gemini API only
2. Get a Gemini API key
Get a free key at aistudio.google.com (1,500 requests/day, no credit card required).
Set it in ~/.config/opencode/.env:
echo 'GOOGLE_API_KEY=your_key_here' >> ~/.config/opencode/.env
Or export it directly:
export GOOGLE_API_KEY=your_key_here
3. Add to OpenCode config
Add this to ~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"mcp": {
"vision": {
"type": "local",
"command": ["python3", "-m", "opencode_vision.server"],
"enabled": true,
"timeout": 30000
}
}
}
4. Restart OpenCode
Start a new session. The vision_describe, vision_ocr, and vision_analyze
tools will be available to all models โ even text-only ones.
5. Ask about images
User: What's in this image?
Model: [calls vision_describe("/path/to/image.png")]
"A dark gradient banner with 'Nicolรกs Rรญos Herrera'..."
Tools
| Tool | Description | When to use |
|---|---|---|
vision_describe(path, prompt?) |
Describe an image in detail | "What does this show?" |
vision_ocr(path) |
Extract all visible text | "What text is in this screenshot?" |
vision_analyze(path) |
Metadata + description + OCR | Comprehensive understanding |
Dependencies
| Component | Required? | Notes |
|---|---|---|
| Python >= 3.10 | โ Required | |
GOOGLE_API_KEY |
โ Required | Get free at aistudio.google.com |
pillow |
๐ฆ Recommended | pip install pillow for metadata + auto-resize |
paddleocr |
๐ Recommended | pip install paddleocr for local SOTA OCR |
tesseract-ocr |
โ Deprecated | No longer used. PaddleOCR replaces it entirely. |
The server auto-detects the API key from (in order):
GOOGLE_API_KEYenvironment variableGOOGLE_GENERATIVE_AI_API_KEYenvironment variable~/.config/opencode/.envfile~/.envfile$PWD/.envfile
CLI Usage (without OpenCode)
# Start MCP server (for OpenCode integration)
opencode-vision
# Direct analysis
opencode-vision describe ~/screenshot.png
opencode-vision ocr ~/scanned-document.png
opencode-vision analyze ~/photo.jpg
# Custom prompt
opencode-vision describe ~/chart.png "What are the values in this chart?"
Architecture
Why Python?
All existing MCP vision servers for OpenCode are Node.js/TypeScript and
require npm install or npx. opencode-vision is pure Python because:
- Python is already installed on every developer machine
pillow(PIL) is the standard image processing library- PaddleOCR is the best open-source OCR engine available
- The MCP protocol is simple JSON-RPC over stdio โ no framework needed
- Zero
node_modules, zeronpm, zeronpx
Modular Design (v2.0)
opencode-vision/
โโโ opencode_vision/
โ โโโ __init__.py # Package metadata
โ โโโ __main__.py # CLI entry point
โ โโโ server.py # MCP server (thin router)
โ โโโ mcp.py # MCP transport protocol
โ โโโ ocr.py # OCR engine (PaddleOCR + Gemini fallback)
โ โโโ gemini.py # Gemini Vision API client
โ โโโ image.py # Image processing utilities
โโโ pyproject.toml
โโโ README.md
OCR Strategy
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PaddleOCR (PP-OCRv5) โ
โ โข Deep learning OCR โ
User image โโโโโโโบโ โข 0% error on benchmarks โโโโโบ conf โฅ 70% โโโบ Return text
โ โข 100+ languages โ
โโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ conf < 70% / error
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Gemini 2.5 Flash Vision โ
โ โข Handwriting / scene โโโโโบ Return text
โ โข 1,500 free req/day โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Cost: $0
- Gemini 2.5 Flash: 1,500 free requests/day via Google AI Studio API key
- PaddleOCR: free and open-source (Apache 2.0)
- Pillow: free and local for metadata
- No OpenCode Go credits consumed โ the API call happens in the vision server, not through OpenCode's model proxy
Comparison with Alternatives
| Feature | opencode-vision v2 | opencode-vision v1 | opencode-minimax-easy-vision | qwen-vision-mcp |
|---|---|---|---|---|
| Runtime | Python (stdlib) | Python (stdlib) | Node.js + npm | Node.js + npm |
| OCR engine | PaddleOCR (SOTA) | Tesseract (legacy) | None (API only) | None (API only) |
| OCR accuracy | 0% error rate | ~18% CER | N/A | N/A |
| Handwriting | Gemini Vision API | โ Not supported | โ | โ |
| Dependencies | pip install opencode-vision[paddle] |
pip install opencode-vision |
npm install |
npx |
| API cost | $0 (Gemini FREE tier) | $0 | MiniMax pricing | $0 (local) |
| Auto .env | โ Reads ~/.config/opencode/.env | โ | โ Manual env vars | โ |
| Image resize | โ Pillow auto-resize | โ Pillow | โ | โ |
| Install size | ~200 KB + optional 15MB model | ~200 KB | ~30 MB | ~30 MB |
Why "Model-Agnostic"?
The key architectural insight: the model never needs to see pixels. The MCP server does all the visual processing externally and returns text. This means:
- Works with any text-only model (big-pickle, DeepSeek, MiMo, MiniMax, etc.)
- Works with any multimodal model too (it doesn't interfere)
- No model-specific configuration
- No provider-specific setup
- The model can be changed at any time without reconfiguring vision
License
MIT
Built with โค๏ธ by Nicolรกs Rรญos Herrera for the OpenCode community.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file opencode_vision-2.0.0.tar.gz.
File metadata
- Download URL: opencode_vision-2.0.0.tar.gz
- Upload date:
- Size: 19.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f0d0e13a69f901a2e358ec622089dbb8c463104fdb18c4b8d6f4f7a993813bc
|
|
| MD5 |
1c0ba146db094daf795d9121b4f1c33e
|
|
| BLAKE2b-256 |
24ee07f508464c1e6a3a6302143c70679e497ff45d9526517b182495369a6ec3
|
File details
Details for the file opencode_vision-2.0.0-py3-none-any.whl.
File metadata
- Download URL: opencode_vision-2.0.0-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eab6de868f8f536fe5a520934ec3a2927ed435ee41d684ab4936d552c87275d5
|
|
| MD5 |
0a64e0ad2c375462b99e273d36561b62
|
|
| BLAKE2b-256 |
322dc144e940fa5551b54569e53b1301abd5a66b8d6c0bce3cc2cd7db7109d8c
|