MCP server that gives vision and image generation capabilities to text-only LLMs, running 100% locally

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

alexjm19

These details have not been verified by PyPI

Project description

blind-vision-mcp

Give vision to any text-only LLM — 100% local, no API costs, your privacy intact.

Why this exists

I use DeepSeek v4 Flash — an incredible text model. But it's blind. It can't see screenshots, images, or UI layouts.

I was tired of:

Paying $20-200/month for vision-capable APIs (GPT-4 Vision, Claude)
Sending sensitive screenshots to the cloud
Context switching between coding and describing images manually

So I built blind-vision-mcp: an MCP server that sits between your text LLM and your desktop, letting it "see" through an on-device vision model — Google's Gemma 4 E2B running via LiteRT.

My specific use case

I control an Android emulator that takes screenshots of the device. DeepSeek v4 Flash reads those screenshots via blind-vision-mcp and tells the emulator what to do next. It works like this:

Emulator takes screenshot → blind-vision-mcp analyzes it with Gemma 4 → 
DeepSeek reads the description → decides next action → ADB command

All of this happens locally, privately, and without paying per-token API fees.

What it does

Capability	Status	Model
👁️ Image analysis	✅ Stable	Gemma 4 E2B via LiteRT (~2.6 GB VRAM)
🔄 Image comparison	✅ Stable	Gemma 4 E2B via LiteRT
🎨 Image generation	✅ Stable	SDXL-Turbo (fp16, ~7 GB VRAM, no HF token needed)
✏️ Image editing	🧪 In development	Coming soon

Key features

No API keys needed for vision — runs 100% on your GPU
~2.6 GB VRAM for vision (not 10+ GB like other solutions)
GPU-first — falls back to CPU if GPU fails
Google LiteRT — same stack powering Gemini Nano on Android
Macro-friendly — perfect for automating emulators, browsers, UIs
Works with any MCP client — OpenCode, Claude Desktop, Cursor, Cline

Quick Start

# 1. Prerequisites
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and install
git clone https://github.com/alexjm19/blind-vision-mcp.git
cd blind-vision-mcp
uv sync

# 3. Import the vision model (one-time, downloads ~2.6 GB)
litert-lm import \
  --from-huggingface-repo litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
  gemma4-vision

# 4. Start the server
uv run blind-vision-mcp

For image generation/editing: Create a .env file with your HF token:
HF_TOKEN=hf_your_token_here
Then accept terms at https://huggingface.co/black-forest-labs/FLUX.1-schnell

Configuration for OpenCode

Add to your opencode.json:

{
  "mcpServers": {
    "blind-vision-mcp": {
      "command": "uv",
      "args": [
        "run",
        "--directory",
        "/path/to/blind-vision-mcp",
        "blind-vision-mcp"
      ]
    }
  }
}

Usage Examples

# Analyze a screenshot (perfect for emulator control)
vision_describe(image="/path/to/screenshot.png")

# Compare before/after
vision_compare(image_a="/path/to/before.png", image_b="/path/to/after.png")

# Generate an image (beta)
image_generate(description="a beautiful landscape")

# Check server status
get_status()

How vision works (the cool part)

┌─────────────────────────────────────────────────────────┐
│  DeepSeek v4 Flash (text-only)                          │
│  "What's on the screen? → vision_describe(screenshot)"  │
└────────────────────────┬────────────────────────────────┘
                         │ MCP protocol (stdin/stdout)
┌────────────────────────▼────────────────────────────────┐
│  blind-vision-mcp server                                 │
│  ┌────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │ tools.py    │→│ LiteRT server │→│ Gemma 4 E2B     │  │
│  │ (MCP tools) │  │ (port 9380)  │  │ (2.6 GB VRAM)   │  │
│  └────────────┘  └──────────────┘  └──────────────────┘  │
└─────────────────────────────────────────────────────────┘

The vision model (Gemma 4 E2B) runs entirely on your GPU via Google's LiteRT runtime. No data ever leaves your machine. The model is pre-quantized (mixed 2/4/8-bit) and loads directly at ~2.6 GB — no "load BF16 first then quantize" memory spike.

Why not just use a vision LLM?

Solution	Cost	Privacy	VRAM	Quality
GPT-4 Vision	$10-20/mo	❌ Cloud	N/A	Excellent
Claude Vision	$20/mo	❌ Cloud	N/A	Excellent
Qwen2-VL-7B (local)	Free	✅ Local	~10 GB VRAM	Good
blind-vision-mcp	Free	✅ Local	~2.6 GB VRAM	Great

Requirements

Component	Minimum
GPU	NVIDIA ≥8 GB VRAM
RAM	16 GB
Storage	5 GB free for vision model + 7 GB for gen model
CUDA	12.x

Project Status

Vision: ✅ Stable and tested
Image generation: ✅ Stable (SDXL-Turbo, pure GPU, no offload)
Image editing: 🧪 In development
Version: 0.2.0 — API may change

License

MIT — see LICENSE.

Support

Star History

If this saves you from another API bill, ⭐ star the repo. It helps others find local-first AI tools.

Built with ❤️ by alexjm19

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

alexjm19

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

Jun 15, 2026

0.2.0

Jun 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blind_vision_mcp-0.3.0.tar.gz (20.8 kB view details)

Uploaded Jun 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

blind_vision_mcp-0.3.0-py3-none-any.whl (17.7 kB view details)

Uploaded Jun 15, 2026 Python 3

File details

Details for the file blind_vision_mcp-0.3.0.tar.gz.

File metadata

Download URL: blind_vision_mcp-0.3.0.tar.gz
Upload date: Jun 15, 2026
Size: 20.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for blind_vision_mcp-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`bac3bdb15c2bdcdc125af19ab8537d397337458ab297ea71ed01dd94127f7487`
MD5	`32d955b35695c0e1743ad9b5a8112acf`
BLAKE2b-256	`a72ebd65d26fff15d3d98488bc99ddaa1549675c99039f534874cde93395a698`

See more details on using hashes here.

File details

Details for the file blind_vision_mcp-0.3.0-py3-none-any.whl.

File metadata

Download URL: blind_vision_mcp-0.3.0-py3-none-any.whl
Upload date: Jun 15, 2026
Size: 17.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for blind_vision_mcp-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bb31b5f393da2b22037aaa88415519403c351e8efad3247fa769383b52985ee2`
MD5	`bcd63e42d110df6d5f74e3288c9f39a1`
BLAKE2b-256	`0fc885a07ab24b0f5cdd9d6f148f31ec5accfccb982ae55f553f41e3e4ee486a`

See more details on using hashes here.

blind-vision-mcp 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

blind-vision-mcp

Why this exists

My specific use case

What it does

Key features

Quick Start

Configuration for OpenCode

Usage Examples

How vision works (the cool part)

Why not just use a vision LLM?

Requirements

Project Status

License

Support

Star History

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes