Skip to main content

Model-agnostic image analysis via MCP for OpenCode text-only models. Vision proxy server using Google Gemini + tesseract.

Project description

opencode-vision ๐Ÿ‘๏ธ

Model-agnostic image analysis via MCP for OpenCode.

Give vision capabilities to any text-only model โ€” big-pickle, DeepSeek, MiMo, MiniMax, or any other model that can't process images natively.

pip install opencode-vision

The Problem

OpenCode supports many models, but most open-weight and free models are text-only. When you paste an image or try to read() one, you get:

ERROR: Cannot read image (this model does not support image input).

This is not a configuration issue โ€” it's a fundamental limitation of the model architecture. Text-only models have no visual neurons.

The Solution

opencode-vision is an MCP server that acts as a "guide dog" for text-only models. It runs as an independent process and handles image analysis via:

  1. Google Gemini Vision API (FREE tier, 1,500 requests/day)
  2. Local tesseract OCR (fast, private, works offline)

It returns plain text descriptions that any model can understand โ€” no vision capabilities needed on the model's side.

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚  opencode-vision     โ”‚
  [big-pickle] โ”€โ”€โ”€โ”€โ–บโ”‚  MCP Server          โ”‚โ”€โ”€โ”€โ”€โ–บ Google Gemini API
  [DeepSeek]   โ”€โ”€โ”€โ”€โ–บโ”‚  (Python process)    โ”‚โ”€โ”€โ”€โ”€โ–บ tesseract OCR (local)
  [MiMo]       โ”€โ”€โ”€โ”€โ–บโ”‚  Returns TEXT only   โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Quick Start

1. Install

pip install opencode-vision

2. Get a Gemini API key

Get a free key at aistudio.google.com (1,500 requests/day, no credit card required).

Set it in ~/.config/opencode/.env:

echo 'GOOGLE_API_KEY=your_key_here' >> ~/.config/opencode/.env

Or export it directly:

export GOOGLE_API_KEY=your_key_here

3. Add to OpenCode config

Add this to ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "mcp": {
    "vision": {
      "type": "local",
      "command": ["python3", "-m", "opencode_vision.server"],
      "enabled": true,
      "timeout": 30000
    }
  }
}

4. Restart OpenCode

Start a new session. The vision_describe, vision_ocr, and vision_analyze tools will be available to all models โ€” even text-only ones.

5. Ask about images

User: What's in this image?
Model: [calls vision_describe("/path/to/image.png")]
       "A dark gradient banner with 'Nicolรกs Rรญos Herrera'..."

Tools

Tool Description When to use
vision_describe(path, prompt?) Describe an image in detail "What does this show?"
vision_ocr(path) Extract all visible text "What text is in this screenshot?"
vision_analyze(path) Metadata + description + OCR Comprehensive understanding

Requirements

Component Required? Notes
Python >= 3.10 โœ… Required
GOOGLE_API_KEY โœ… Required Get free at aistudio.google.com
pillow ๐Ÿ“ฆ Recommended pip install pillow for metadata + auto-resize
tesseract-ocr ๐Ÿ”ง Recommended For local OCR. apt install tesseract-ocr or brew install tesseract

The server auto-detects the API key from (in order):

  1. GOOGLE_API_KEY environment variable
  2. GOOGLE_GENERATIVE_AI_API_KEY environment variable
  3. ~/.config/opencode/.env file
  4. ~/.env file
  5. $PWD/.env file

CLI Usage (without OpenCode)

# Start MCP server (for OpenCode integration)
opencode-vision

# Direct analysis
opencode-vision describe ~/screenshot.png
opencode-vision ocr ~/scanned-document.png
opencode-vision analyze ~/photo.jpg

# Custom prompt
opencode-vision describe ~/chart.png "What are the values in this chart?"

Architecture

Why Python?

All existing MCP vision servers for OpenCode are Node.js/TypeScript and require npm install or npx. opencode-vision is pure Python because:

  • Python is already installed on every developer machine
  • pillow (PIL) is the standard image processing library
  • tesseract has first-class Python bindings
  • Zero node_modules, zero npm, zero npx
  • The MCP protocol is simple JSON-RPC over stdio โ€” no framework needed

Hybrid OCR Strategy

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  User image โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ tesseract    โ”‚โ”€โ”€โ”€โ–บ Text found? โ”€โ”€โ–บ Return
                    โ”‚ (local,      โ”‚
                    โ”‚  private,    โ”‚
                    โ”‚  offline)    โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚ No text / failed
                           โ–ผ
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ Gemini       โ”‚โ”€โ”€โ”€โ–บ Return
                    โ”‚ Vision API   โ”‚
                    โ”‚ (FREE tier)  โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Cost: $0

  • Gemini 2.5 Flash: 1,500 free requests/day via Google AI Studio API key
  • tesseract: free and local for OCR
  • Pillow: free and local for metadata
  • No OpenCode Go credits consumed โ€” the API call happens in the vision server, not through OpenCode's model proxy

Comparison with Alternatives

Feature opencode-vision opencode-minimax-easy-vision qwen-vision-mcp opencode-image-proxy
Runtime Python (stdlib) Node.js + npm Node.js + npm Node.js
Dependencies pip install opencode-vision npm install + MiniMax API key npx + Ollama (6GB RAM) npx
OCR Local tesseract + Gemini fallback None (API only) None (API only) None (API only)
API cost $0 (Gemini FREE tier) MiniMax pricing $0 (local) or Ollama Cloud OpenCode credits
Auto .env โœ“ Reads ~/.config/opencode/.env โœ— Manual env vars โœ— Manual env vars โœ— Manual config
Image resize โœ“ Pillow auto-resize โœ— โœ— โœ—
Install size ~200 KB (pure Python) ~30 MB (node_modules) ~30 MB + ~6GB (Ollama model) ~30 MB

Why "Model-Agnostic"?

The key architectural insight: the model never needs to see pixels. The MCP server does all the visual processing externally and returns text. This means:

  • Works with any text-only model (big-pickle, DeepSeek, MiMo, MiniMax, GLM, etc.)
  • Works with any multimodal model too (it doesn't interfere)
  • No model-specific configuration
  • No provider-specific setup
  • The model can be changed at any time without reconfigureing vision

License

MIT


Built with โค๏ธ by Nicolรกs Rรญos Herrera for the OpenCode community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencode_vision-1.0.0.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opencode_vision-1.0.0-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file opencode_vision-1.0.0.tar.gz.

File metadata

  • Download URL: opencode_vision-1.0.0.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for opencode_vision-1.0.0.tar.gz
Algorithm Hash digest
SHA256 8ce0e49333e7e4e0356a8fe192a30b129547ee7c14b2cff3144dfe91c68dcef3
MD5 423f379e0846c13430b7c560c36a0227
BLAKE2b-256 3ed93a8caf9152e1a8ecf05f9b43767d095af0418a1483ce92f90dcc50317a05

See more details on using hashes here.

File details

Details for the file opencode_vision-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for opencode_vision-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a45f20dbd9918335b579dc19efebfb85a59efa50fcc2edfd11bbea2f96ef67b9
MD5 6a715cddabd3698324c93c44216bba6b
BLAKE2b-256 bde4fe0dda104af80f71b614b9b20bfcfdc9fdd7e3743a7725f9255c62ad6c23

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page