Universal vision tools for AI agents via Model Context Protocol
Project description
agent-vision-mcp
Give MCP-compatible AI agents image analysis, metadata inspection, cropping, OCR, and image comparison through any OpenAI-compatible vision model.
Features
- Analyze screenshots, charts, documents, UI, objects, and general images.
- Inspect image dimensions and metadata without calling a model.
- Crop and zoom into regions using normalized coordinates.
- Extract visible text with a VLM or an optional dedicated OCR model.
- Compare two to four images.
- Accept public URLs, local files, data URLs, and Base64 images.
- Run locally over the standard MCP stdio transport.
Claude Code
Requirements
- Python 3.10 or newer
uv- An OpenAI-compatible vision API endpoint and API key
uvx downloads the published package from PyPI into an isolated environment
and runs it. It does not use the source code in your current directory and
does not permanently install the package into your system Python.
Add To Claude Code
The command below configures Claude Code to start agent-vision-mcp from PyPI:
claude mcp add --scope user agent-vision \
--env UV_DEFAULT_INDEX=https://pypi.org/simple \
VISION_API_KEY="your-api-key" \
VISION_BASE_URL="https://your-provider.example/v1" \
VISION_MODEL_ID="your-vision-model" \
-- uvx agent-vision-mcp
Use UV_DEFAULT_INDEX=https://pypi.org/simple when your local PyPI mirror has
not synchronized the latest release.
Verify the connection:
claude mcp get agent-vision
claude mcp list
Then start Claude Code and ask:
Use vision_capabilities to show the available vision tools.
Analyze a local image:
Use vision_inspect on /data/example.png, then use vision_analyze to describe it.
By default, local image access is limited to /data and /tmp. Add another
directory with:
claude mcp remove --scope user agent-vision
claude mcp add --scope user agent-vision \
--env UV_DEFAULT_INDEX=https://pypi.org/simple \
VISION_API_KEY="your-api-key" \
VISION_BASE_URL="https://your-provider.example/v1" \
VISION_MODEL_ID="your-vision-model" \
VISION_ALLOWED_PATHS="/data,/tmp,/home/your-user/Pictures" \
-- uvx agent-vision-mcp
Dedicated OCR Model
Without dedicated OCR configuration, vision_extract_text uses the configured
vision model. To use a separate OCR model:
claude mcp add --scope user agent-vision \
--env UV_DEFAULT_INDEX=https://pypi.org/simple \
VISION_API_KEY="your-vision-api-key" \
VISION_BASE_URL="https://your-provider.example/v1" \
VISION_MODEL_ID="your-vision-model" \
OCR_ENABLED=true \
OCR_API_KEY="your-ocr-api-key" \
OCR_BASE_URL="https://your-provider.example/v1" \
OCR_MODEL_ID="your-ocr-model" \
-- uvx agent-vision-mcp
Never commit real API keys to Git.
Other MCP Clients
Use this stdio configuration with MCP clients that accept JSON configuration:
{
"mcpServers": {
"agent-vision": {
"command": "uvx",
"args": ["agent-vision-mcp"],
"env": {
"UV_DEFAULT_INDEX": "https://pypi.org/simple",
"VISION_API_KEY": "your-api-key",
"VISION_BASE_URL": "https://your-provider.example/v1",
"VISION_MODEL_ID": "your-vision-model"
}
}
}
}
Tools
| Tool | Purpose |
|---|---|
vision_analyze |
Analyze an image with task-specific prompts |
vision_inspect |
Read image dimensions, format, size, and mode |
vision_crop_analyze |
Crop and analyze a normalized image region |
vision_extract_text |
Extract visible text using OCR or the VLM |
vision_compare |
Compare two to four images |
vision_capabilities |
Show server configuration and limits |
Response format
Every tool returns a JSON string. Clients must json.loads the result
before reading any field. All top-level keys are always present (even when
empty), so consumers can iterate the envelope without dict.get(...)
guards.
Success envelope
{
"schema_version": "1.0",
"ok": true,
"tool": "vision_analyze",
"task": "general",
"model": "...",
"source": null,
"sources": [],
"result": {},
"warnings": [],
"raw_model_output": null,
"error": null
}
| Field | Type | When set |
|---|---|---|
schema_version |
string |
Always. Currently "1.0". |
ok |
bool |
Always. true on success, false on failure. |
tool |
string |
Always. The tool name (e.g. vision_analyze). |
task |
string | null |
The task argument when the tool takes one; null for vision_capabilities and vision_extract_text. |
model |
string | null |
The configured model identifier (e.g. glm-4v-flash). Set even on failure when the tool knew it. |
source |
SourceMeta | null |
Single-image tools. null for vision_compare and vision_capabilities. |
sources |
SourceMeta[] |
vision_compare only: one entry per input image. Empty for all other tools. |
result |
object |
Tool-specific (see below). null on failure. |
warnings |
string[] |
Always a list (empty on success). Soft-failure notes (e.g. vision_extract_text falling back from OCR to VLM). |
raw_model_output |
object | null |
Sanitized provider response when include_raw=true; null otherwise. |
error |
ErrorPayload | null |
null on success. Populated on failure. |
SourceMeta fields: type (url / file / data_url / base64),
mime_type, width, height, size_bytes, source_ref (only when
include_source_ref=true; redacted to host/path for URLs or basename
for files; null for data URLs and base64).
Failure envelope
{
"schema_version": "1.0",
"ok": false,
"tool": "vision_analyze",
"task": "general",
"model": "...",
"source": null,
"sources": [],
"result": null,
"warnings": [],
"raw_model_output": null,
"error": {
"code": "INVALID_INPUT",
"message": "Input is not a valid supported image",
"retryable": false,
"details": {}
}
}
error.code values: INVALID_INPUT, IMAGE_TOO_LARGE, UNSUPPORTED_FORMAT,
SECURITY_ERROR, PROVIDER_ERROR, TIMEOUT, INTERNAL_ERROR.
retryable=true means the caller may try the same call again.
Per-tool result shape
| Tool | result keys |
|---|---|
vision_analyze |
summary, observations[], inferences[], uncertainties[], suggested_followups[] |
vision_extract_text |
text, blocks[], layout_preserved, unclear_segments[] |
vision_compare |
summary, differences[], same_elements[] |
vision_crop_analyze |
crop: {x, y, width, height}, summary, observations[] |
vision_inspect |
width, height, format, mime_type, mode, size_bytes, has_transparency, source_type |
vision_capabilities |
server, version, vlm_provider, ocr_provider, ocr_enabled, tools, supports, limits, task_types |
Arrays that are not yet parsed from model output are returned as empty
arrays (no fabricated structure). observations, inferences, and
differences are empty in the current release; only summary carries
the model's free-form text.
Multi-image input
vision_compare accepts 2–4 images. The envelope reports them in
sources: [SourceMeta, ...] (one entry per input, in input order).
source is null for multi-image tools. All other image tools accept a
single image and use source; sources is [].
Opt-in flags
include_raw: bool = False— whentrue,raw_model_outputcontains a sanitized subset of the provider response:{model, response_metadata: {model_name, finish_reason, system_fingerprint}, usage_metadata: {input_tokens, output_tokens, total_tokens}}. HTTP headers, request IDs, signed URLs, and raw exception text are dropped before reaching the envelope. Off by default to keep responses small and to avoid leaking auth material.include_source_ref: bool = False— whentrue,source.source_refis populated with a redacted reference:host/pathfor URLs (query string stripped, including signed tokens) orbasenamefor local files.data_urland base64 inputs always returnnullforsource_ref. Off by default to avoid leaking paths and signed URLs.
URL Handling
VISION_URL_MODE controls remote-image handling:
autopasses URLs through for analysis and comparison, but downloads them when inspection, cropping, or OCR requires image bytes.passthroughprefers URL passthrough, except for tools that require bytes.downloadalways downloads and verifies remote images before model calls.
Downloads are streamed with byte limits, redirects are security checked, and downloaded or encoded inputs are verified as supported images.
Troubleshooting
If Claude Code cannot find the PyPI package:
UV_DEFAULT_INDEX=https://pypi.org/simple uvx --refresh agent-vision-mcp
If the MCP server does not connect:
claude mcp get agent-vision
uvx agent-vision-mcp
If you change the Claude Code configuration:
claude mcp remove --scope user agent-vision
Then add it again with the updated values.
Development
git clone https://github.com/idealizing/agent-vision-mcp.git
cd agent-vision-mcp
python -m venv .venv
.venv/bin/pip install -e ".[dev]"
cp .env.example .env
.venv/bin/python -m unittest discover -s tests -v
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_vision_mcp-0.0.3.tar.gz.
File metadata
- Download URL: agent_vision_mcp-0.0.3.tar.gz
- Upload date:
- Size: 38.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63b9828b991e83e75d5fa9dee1e568624a0d11988675886c99c2165888bf1c62
|
|
| MD5 |
24a9170402f8fbda2b23505e382eb1d1
|
|
| BLAKE2b-256 |
de51f9bb6f4da4a4b45722cb2f237cc41c140e5973809f9c2378607db83d07ea
|
Provenance
The following attestation bundles were made for agent_vision_mcp-0.0.3.tar.gz:
Publisher:
publish.yml on idealizing/agent-vision-mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_vision_mcp-0.0.3.tar.gz -
Subject digest:
63b9828b991e83e75d5fa9dee1e568624a0d11988675886c99c2165888bf1c62 - Sigstore transparency entry: 1837112690
- Sigstore integration time:
-
Permalink:
idealizing/agent-vision-mcp@bec97132288e5bbff49d43a35fc4f41b1b400d4e -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/idealizing
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bec97132288e5bbff49d43a35fc4f41b1b400d4e -
Trigger Event:
push
-
Statement type:
File details
Details for the file agent_vision_mcp-0.0.3-py3-none-any.whl.
File metadata
- Download URL: agent_vision_mcp-0.0.3-py3-none-any.whl
- Upload date:
- Size: 31.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22999a12ea1f21e7e99ca1b544dfad7ad5106db2b212526772dc49ed60134a3b
|
|
| MD5 |
145366bfa67c415320782b2c7cd578d1
|
|
| BLAKE2b-256 |
bf225ecb9e3098a2ff4223b840cd673758afa2bb87554e22bf19a9467c39b9b0
|
Provenance
The following attestation bundles were made for agent_vision_mcp-0.0.3-py3-none-any.whl:
Publisher:
publish.yml on idealizing/agent-vision-mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_vision_mcp-0.0.3-py3-none-any.whl -
Subject digest:
22999a12ea1f21e7e99ca1b544dfad7ad5106db2b212526772dc49ed60134a3b - Sigstore transparency entry: 1837112832
- Sigstore integration time:
-
Permalink:
idealizing/agent-vision-mcp@bec97132288e5bbff49d43a35fc4f41b1b400d4e -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/idealizing
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bec97132288e5bbff49d43a35fc4f41b1b400d4e -
Trigger Event:
push
-
Statement type: