MCP server that gives vision and image generation capabilities to text-only LLMs, running 100% locally
Project description
blind-vision-mcp
Give vision to any text-only LLM — 100% local, no API costs, your privacy intact.
Why this exists
I use DeepSeek v4 Flash — an incredible text model. But it's blind. It can't see screenshots, images, or UI layouts.
I was tired of:
- Paying $20-200/month for vision-capable APIs (GPT-4 Vision, Claude)
- Sending sensitive screenshots to the cloud
- Context switching between coding and describing images manually
So I built blind-vision-mcp: an MCP server that sits between your text LLM and your desktop, letting it "see" through an on-device vision model — Google's Gemma 4 E2B running via LiteRT.
My specific use case
I control an Android emulator that takes screenshots of the device. DeepSeek v4 Flash reads those screenshots via blind-vision-mcp and tells the emulator what to do next. It works like this:
Emulator takes screenshot → blind-vision-mcp analyzes it with Gemma 4 →
DeepSeek reads the description → decides next action → ADB command
All of this happens locally, privately, and without paying per-token API fees.
What it does
| Capability | Status | Model |
|---|---|---|
| 👁️ Image analysis | ✅ Stable | Gemma 4 E2B via LiteRT (~2.6 GB VRAM) |
| 🔄 Image comparison | ✅ Stable | Gemma 4 E2B via LiteRT |
| 🎨 Image generation | ✅ Stable | SDXL-Turbo (fp16, ~7 GB VRAM, no HF token needed) |
| ✏️ Image editing | 🧪 In development | Coming soon |
Key features
- No API keys needed for vision — runs 100% on your GPU
- ~2.6 GB VRAM for vision (not 10+ GB like other solutions)
- GPU-first — falls back to CPU if GPU fails
- Google LiteRT — same stack powering Gemini Nano on Android
- Macro-friendly — perfect for automating emulators, browsers, UIs
- Works with any MCP client — OpenCode, Claude Desktop, Cursor, Cline
Quick Start
# 1. Prerequisites
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone and install
git clone https://github.com/alexjm19/blind-vision-mcp.git
cd blind-vision-mcp
uv sync
# 3. Import the vision model (one-time, downloads ~2.6 GB)
litert-lm import \
--from-huggingface-repo litert-community/gemma-4-E2B-it-litert-lm \
gemma-4-E2B-it.litertlm \
gemma4-vision
# 4. Start the server
uv run blind-vision-mcp
For image generation/editing: Create a
.envfile with your HF token:HF_TOKEN=hf_your_token_hereThen accept terms at https://huggingface.co/black-forest-labs/FLUX.1-schnell
Configuration for OpenCode
Add to your opencode.json:
{
"mcpServers": {
"blind-vision-mcp": {
"command": "uv",
"args": [
"run",
"--directory",
"/path/to/blind-vision-mcp",
"blind-vision-mcp"
]
}
}
}
Usage Examples
# Analyze a screenshot (perfect for emulator control)
vision_describe(image="/path/to/screenshot.png")
# Compare before/after
vision_compare(image_a="/path/to/before.png", image_b="/path/to/after.png")
# Generate an image (beta)
image_generate(description="a beautiful landscape")
# Check server status
get_status()
How vision works (the cool part)
┌─────────────────────────────────────────────────────────┐
│ DeepSeek v4 Flash (text-only) │
│ "What's on the screen? → vision_describe(screenshot)" │
└────────────────────────┬────────────────────────────────┘
│ MCP protocol (stdin/stdout)
┌────────────────────────▼────────────────────────────────┐
│ blind-vision-mcp server │
│ ┌────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ tools.py │→│ LiteRT server │→│ Gemma 4 E2B │ │
│ │ (MCP tools) │ │ (port 9380) │ │ (2.6 GB VRAM) │ │
│ └────────────┘ └──────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────┘
The vision model (Gemma 4 E2B) runs entirely on your GPU via Google's LiteRT runtime. No data ever leaves your machine. The model is pre-quantized (mixed 2/4/8-bit) and loads directly at ~2.6 GB — no "load BF16 first then quantize" memory spike.
Why not just use a vision LLM?
| Solution | Cost | Privacy | VRAM | Quality |
|---|---|---|---|---|
| GPT-4 Vision | $10-20/mo | ❌ Cloud | N/A | Excellent |
| Claude Vision | $20/mo | ❌ Cloud | N/A | Excellent |
| Qwen2-VL-7B (local) | Free | ✅ Local | ~10 GB VRAM | Good |
| blind-vision-mcp | Free | ✅ Local | ~2.6 GB VRAM | Great |
Requirements
| Component | Minimum |
|---|---|
| GPU | NVIDIA ≥8 GB VRAM |
| RAM | 16 GB |
| Storage | 5 GB free for vision model + 7 GB for gen model |
| CUDA | 12.x |
Project Status
- Vision: ✅ Stable and tested
- Image generation: ✅ Stable (SDXL-Turbo, pure GPU, no offload)
- Image editing: 🧪 In development
- Version: 0.2.0 — API may change
License
MIT — see LICENSE.
Support
Star History
If this saves you from another API bill, ⭐ star the repo. It helps others find local-first AI tools.
Built with ❤️ by alexjm19
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file blind_vision_mcp-0.3.0.tar.gz.
File metadata
- Download URL: blind_vision_mcp-0.3.0.tar.gz
- Upload date:
- Size: 20.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bac3bdb15c2bdcdc125af19ab8537d397337458ab297ea71ed01dd94127f7487
|
|
| MD5 |
32d955b35695c0e1743ad9b5a8112acf
|
|
| BLAKE2b-256 |
a72ebd65d26fff15d3d98488bc99ddaa1549675c99039f534874cde93395a698
|
File details
Details for the file blind_vision_mcp-0.3.0-py3-none-any.whl.
File metadata
- Download URL: blind_vision_mcp-0.3.0-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb31b5f393da2b22037aaa88415519403c351e8efad3247fa769383b52985ee2
|
|
| MD5 |
bcd63e42d110df6d5f74e3288c9f39a1
|
|
| BLAKE2b-256 |
0fc885a07ab24b0f5cdd9d6f148f31ec5accfccb982ae55f553f41e3e4ee486a
|