Skip to main content

MCP server that gives vision and image generation capabilities to text-only LLMs, running 100% locally

Project description

blind-vision-mcp

Give vision to any text-only LLM — 100% local, no API costs, your privacy intact.

Version License Python GPU Gemma 4 Runtime


Why this exists

I use DeepSeek v4 Flash — an incredible text model. But it's blind. It can't see screenshots, images, or UI layouts.

I was tired of:

  • Paying $20-200/month for vision-capable APIs (GPT-4 Vision, Claude)
  • Sending sensitive screenshots to the cloud
  • Context switching between coding and describing images manually

So I built blind-vision-mcp: an MCP server that sits between your text LLM and your desktop, letting it "see" through an on-device vision model — Google's Gemma 4 E2B running via LiteRT.

My specific use case

I control an Android emulator that takes screenshots of the device. DeepSeek v4 Flash reads those screenshots via blind-vision-mcp and tells the emulator what to do next. It works like this:

Emulator takes screenshot → blind-vision-mcp analyzes it with Gemma 4 → 
DeepSeek reads the description → decides next action → ADB command

All of this happens locally, privately, and without paying per-token API fees.


What it does

Capability Status Model
👁️ Image analysis Stable Gemma 4 E2B via LiteRT (~2.6 GB VRAM)
🔄 Image comparison Stable Gemma 4 E2B via LiteRT
🎨 Image generation 🧪 Beta FLUX.1-schnell (needs HF token)
✏️ Image editing 🧪 Beta FLUX Kontext Dev (needs HF token)

Key features

  • No API keys needed for vision — runs 100% on your GPU
  • ~2.6 GB VRAM for vision (not 10+ GB like other solutions)
  • GPU-first — falls back to CPU if GPU fails
  • Google LiteRT — same stack powering Gemini Nano on Android
  • Macro-friendly — perfect for automating emulators, browsers, UIs
  • Works with any MCP client — OpenCode, Claude Desktop, Cursor, Cline

Quick Start

# 1. Prerequisites
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and install
git clone https://github.com/alexjm19/blind-vision-mcp.git
cd blind-vision-mcp
uv sync

# 3. Import the vision model (one-time, downloads ~2.6 GB)
litert-lm import \
  --from-huggingface-repo litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
  gemma4-vision

# 4. Start the server
uv run blind-vision-mcp

For image generation/editing: Create a .env file with your HF token:

HF_TOKEN=hf_your_token_here

Then accept terms at https://huggingface.co/black-forest-labs/FLUX.1-schnell


Configuration for OpenCode

Add to your opencode.json:

{
  "mcpServers": {
    "blind-vision-mcp": {
      "command": "uv",
      "args": [
        "run",
        "--directory",
        "/path/to/blind-vision-mcp",
        "blind-vision-mcp"
      ]
    }
  }
}

Usage Examples

# Analyze a screenshot (perfect for emulator control)
vision_describe(image="/path/to/screenshot.png")

# Compare before/after
vision_compare(image_a="/path/to/before.png", image_b="/path/to/after.png")

# Generate an image (beta)
image_generate(description="a beautiful landscape")

# Check server status
get_status()

How vision works (the cool part)

┌─────────────────────────────────────────────────────────┐
│  DeepSeek v4 Flash (text-only)                          │
│  "What's on the screen? → vision_describe(screenshot)"  │
└────────────────────────┬────────────────────────────────┘
                         │ MCP protocol (stdin/stdout)
┌────────────────────────▼────────────────────────────────┐
│  blind-vision-mcp server                                 │
│  ┌────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │ tools.py    │→│ LiteRT server │→│ Gemma 4 E2B     │  │
│  │ (MCP tools) │  │ (port 9380)  │  │ (2.6 GB VRAM)   │  │
│  └────────────┘  └──────────────┘  └──────────────────┘  │
└─────────────────────────────────────────────────────────┘

The vision model (Gemma 4 E2B) runs entirely on your GPU via Google's LiteRT runtime. No data ever leaves your machine. The model is pre-quantized (mixed 2/4/8-bit) and loads directly at ~2.6 GB — no "load BF16 first then quantize" memory spike.


Why not just use a vision LLM?

Solution Cost Privacy VRAM Quality
GPT-4 Vision $10-20/mo ❌ Cloud N/A Excellent
Claude Vision $20/mo ❌ Cloud N/A Excellent
Qwen2-VL-7B (local) Free ✅ Local ~10 GB VRAM Good
blind-vision-mcp Free ✅ Local ~2.6 GB VRAM Great

Requirements

Component Minimum
GPU NVIDIA ≥8 GB VRAM (vision) / ≥12 GB (vision + gen)
RAM 16 GB
Storage 5 GB free for vision model
CUDA 12.x

Project Status

  • Vision: ✅ Stable and tested
  • Image generation: 🧪 In beta (needs HF token, FLUX model)
  • Image editing: 🧪 In beta
  • Version: 0.1.0 — API may change

License

MIT — see LICENSE.


Support

"Buy Me A Coffee"

Star History

Star History Chart

If this saves you from another API bill, ⭐ star the repo. It helps others find local-first AI tools.


Built with ❤️ by alexjm19

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blind_vision_mcp-0.2.0.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blind_vision_mcp-0.2.0-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file blind_vision_mcp-0.2.0.tar.gz.

File metadata

  • Download URL: blind_vision_mcp-0.2.0.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for blind_vision_mcp-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e58b592ac38f1fc439b5deddf00940774e486edb459bf1f6be0be727ec0832f1
MD5 99e29fc71bf28adef9c9052da4ba751f
BLAKE2b-256 8766a2a2a45b74a9bb55d217a25509e0fc3af7ea3522e4eb5f6472c30b4a2c39

See more details on using hashes here.

Provenance

The following attestation bundles were made for blind_vision_mcp-0.2.0.tar.gz:

Publisher: pypi-publish.yml on alexjm19/blind-vision-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file blind_vision_mcp-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for blind_vision_mcp-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 940aa2be2f5b38d194df3ac6ee1123b532a771d01cbf239e00f832d6d0e4bf36
MD5 3e4450509e18500509b4ed5dcbe76a35
BLAKE2b-256 a95dca3f0dff22581ed6c030b9ec648e0378e5b33385e30159502574521cd078

See more details on using hashes here.

Provenance

The following attestation bundles were made for blind_vision_mcp-0.2.0-py3-none-any.whl:

Publisher: pypi-publish.yml on alexjm19/blind-vision-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page