Skip to main content

MCP server that gives vision and image generation capabilities to text-only LLMs, running 100% locally

Project description

blind-vision-mcp

Give vision to any text-only LLM — 100% local, no API costs, your privacy intact.

Version License Python GPU Gemma 4 Runtime Image Gen


Why this exists

I use DeepSeek v4 Flash — an incredible text model. But it's blind. It can't see screenshots, images, or UI layouts.

I was tired of:

  • Paying $20-200/month for vision-capable APIs (GPT-4 Vision, Claude)
  • Sending sensitive screenshots to the cloud
  • Context switching between coding and describing images manually

So I built blind-vision-mcp: an MCP server that sits between your text LLM and your desktop, letting it "see" through an on-device vision model — Google's Gemma 4 E2B running via LiteRT.

My specific use case

I control an Android emulator that takes screenshots of the device. DeepSeek v4 Flash reads those screenshots via blind-vision-mcp and tells the emulator what to do next. It works like this:

Emulator takes screenshot → blind-vision-mcp analyzes it with Gemma 4 → 
DeepSeek reads the description → decides next action → ADB command

All of this happens locally, privately, and without paying per-token API fees.


What it does

Capability Status Model
👁️ Image analysis Stable Gemma 4 E2B via LiteRT (~2.6 GB VRAM)
🔄 Image comparison Stable Gemma 4 E2B via LiteRT
🎨 Image generation Stable SDXL-Turbo (fp16, ~7 GB VRAM, no HF token needed)
✏️ Image editing 🧪 In development Coming soon

Key features

  • No API keys needed for vision — runs 100% on your GPU
  • ~2.6 GB VRAM for vision (not 10+ GB like other solutions)
  • GPU-first — falls back to CPU if GPU fails
  • Google LiteRT — same stack powering Gemini Nano on Android
  • Macro-friendly — perfect for automating emulators, browsers, UIs
  • Works with any MCP client — OpenCode, Claude Desktop, Cursor, Cline

Quick Start

# 1. Prerequisites
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and install
git clone https://github.com/alexjm19/blind-vision-mcp.git
cd blind-vision-mcp
uv sync

# 3. Import the vision model (one-time, downloads ~2.6 GB)
litert-lm import \
  --from-huggingface-repo litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
  gemma4-vision

# 4. Start the server
uv run blind-vision-mcp

For image generation/editing: Create a .env file with your HF token:

HF_TOKEN=hf_your_token_here

Then accept terms at https://huggingface.co/black-forest-labs/FLUX.1-schnell


Configuration for OpenCode

Add to your opencode.json:

{
  "mcpServers": {
    "blind-vision-mcp": {
      "command": "uv",
      "args": [
        "run",
        "--directory",
        "/path/to/blind-vision-mcp",
        "blind-vision-mcp"
      ]
    }
  }
}

Usage Examples

# Analyze a screenshot (perfect for emulator control)
vision_describe(image="/path/to/screenshot.png")

# Compare before/after
vision_compare(image_a="/path/to/before.png", image_b="/path/to/after.png")

# Generate an image (beta)
image_generate(description="a beautiful landscape")

# Check server status
get_status()

How vision works (the cool part)

┌─────────────────────────────────────────────────────────┐
│  DeepSeek v4 Flash (text-only)                          │
│  "What's on the screen? → vision_describe(screenshot)"  │
└────────────────────────┬────────────────────────────────┘
                         │ MCP protocol (stdin/stdout)
┌────────────────────────▼────────────────────────────────┐
│  blind-vision-mcp server                                 │
│  ┌────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │ tools.py    │→│ LiteRT server │→│ Gemma 4 E2B     │  │
│  │ (MCP tools) │  │ (port 9380)  │  │ (2.6 GB VRAM)   │  │
│  └────────────┘  └──────────────┘  └──────────────────┘  │
└─────────────────────────────────────────────────────────┘

The vision model (Gemma 4 E2B) runs entirely on your GPU via Google's LiteRT runtime. No data ever leaves your machine. The model is pre-quantized (mixed 2/4/8-bit) and loads directly at ~2.6 GB — no "load BF16 first then quantize" memory spike.


Why not just use a vision LLM?

Solution Cost Privacy VRAM Quality
GPT-4 Vision $10-20/mo ❌ Cloud N/A Excellent
Claude Vision $20/mo ❌ Cloud N/A Excellent
Qwen2-VL-7B (local) Free ✅ Local ~10 GB VRAM Good
blind-vision-mcp Free ✅ Local ~2.6 GB VRAM Great

Requirements

Component Minimum
GPU NVIDIA ≥8 GB VRAM
RAM 16 GB
Storage 5 GB free for vision model + 7 GB for gen model
CUDA 12.x

Project Status

  • Vision: ✅ Stable and tested
  • Image generation: ✅ Stable (SDXL-Turbo, pure GPU, no offload)
  • Image editing: 🧪 In development
  • Version: 0.2.0 — API may change

License

MIT — see LICENSE.


Support

"Buy Me A Coffee"

Star History

Star History Chart

If this saves you from another API bill, ⭐ star the repo. It helps others find local-first AI tools.


Built with ❤️ by alexjm19

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blind_vision_mcp-0.3.0.tar.gz (20.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blind_vision_mcp-0.3.0-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file blind_vision_mcp-0.3.0.tar.gz.

File metadata

  • Download URL: blind_vision_mcp-0.3.0.tar.gz
  • Upload date:
  • Size: 20.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for blind_vision_mcp-0.3.0.tar.gz
Algorithm Hash digest
SHA256 bac3bdb15c2bdcdc125af19ab8537d397337458ab297ea71ed01dd94127f7487
MD5 32d955b35695c0e1743ad9b5a8112acf
BLAKE2b-256 a72ebd65d26fff15d3d98488bc99ddaa1549675c99039f534874cde93395a698

See more details on using hashes here.

File details

Details for the file blind_vision_mcp-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for blind_vision_mcp-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bb31b5f393da2b22037aaa88415519403c351e8efad3247fa769383b52985ee2
MD5 bcd63e42d110df6d5f74e3288c9f39a1
BLAKE2b-256 0fc885a07ab24b0f5cdd9d6f148f31ec5accfccb982ae55f553f41e3e4ee486a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page