Skip to main content

Moondream 3 vision processor plugin for Vision Agents

Project description

Moondream Plugin

This plugin provides Moondream 3 vision capabilities for vision-agents, including:

  • Object Detection: Real-time zero-shot object detection on video streams
  • Visual Question Answering (VQA): Answer questions about video frames
  • Image Captioning: Generate descriptions of video frames

Choose between cloud-hosted or local processing depending on your needs. When running locally, we recommend you do so on CUDA enabled devices.

Installation

uv add "vision-agents[moondream]"
# or directly
uv add vision-agents-plugins-moondream

Choosing the Right Component

Detection Processors

CloudDetectionProcessor (Recommended for Most Users)

  • Use when: You want a simple setup with no infrastructure management
  • Pros: No model download, no GPU required, automatic updates
  • Cons: Requires API key, 2 RPS rate limit by default (can be increased)
  • Best for: Development, testing, low-to-medium volume applications

LocalDetectionProcessor (For Advanced Users)

  • Use when: You need higher throughput, have your own GPU infrastructure, or want to avoid rate limits
  • Pros: No rate limits, no API costs, full control over hardware
  • Cons: Requires GPU for best performance, model download on first use, infrastructure management
  • Best for: Production deployments, high-volume applications, Digital Ocean Gradient AI GPUs, or custom infrastructure

Vision Language Models (VLM)

CloudVLM (Recommended for Most Users)

  • Use when: You want visual question answering or captioning without managing infrastructure
  • Pros: No model download, no GPU required, automatic updates
  • Cons: Requires API key, rate limits apply
  • Best for: Development, testing, applications requiring VQA or captioning

LocalVLM (For Advanced Users)

  • Use when: You need VQA or captioning with higher throughput or want to avoid rate limits
  • Pros: No rate limits, no API costs, full control over hardware
  • Cons: Requires GPU for best performance, model download on first use, infrastructure management
  • Best for: Production deployments, high-volume applications, or custom infrastructure

Quick Start

Using CloudDetectionProcessor (Hosted)

The CloudDetectionProcessor uses Moondream's hosted API. By default it has a 2 RPS (requests per second) rate limit and requires an API key. The rate limit can be adjusted by contacting the Moondream team to request a higher limit.

from vision_agents.plugins import moondream
from vision_agents.core import Agent

# Create a cloud processor with detection
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",  # or set MOONDREAM_API_KEY env var
    detect_objects="person",  # or ["person", "car", "dog"] for multiple
    fps=30
)

# Use in an agent
agent = Agent(
    processors=[processor],
    llm=your_llm,
    # ... other components
)

Using LocalDetectionProcessor (On-Device)

If you are running on your own infrastructure or using a service like Digital Ocean's Gradient AI GPUs, you can use the LocalDetectionProcessor which downloads the model from HuggingFace and runs on device. By default it will use CUDA for best performance. Performance will vary depending on your specific hardware configuration.

Note: The moondream3-preview model is gated and requires HuggingFace authentication:

from vision_agents.plugins import moondream
from vision_agents.core import Agent

# Create a local processor (no API key needed)
processor = moondream.LocalDetectionProcessor(
    detect_objects=["person", "car", "dog"],
    conf_threshold=0.3,
    force_cpu=False,  # Auto-detects CUDA, MPS, or CPU
    fps=30
)

# Use in an agent
agent = Agent(
    processors=[processor],
    llm=your_llm,
    # ... other components
)

Detect Multiple Objects

# Detect multiple object types with zero-shot detection
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",
    detect_objects=["person", "car", "dog", "basketball"],
    conf_threshold=0.3
)

Vision Language Model (VLM) Quick Start

Using CloudVLM (Hosted)

The CloudVLM uses Moondream's hosted API for visual question answering and captioning. It automatically processes video frames and responds to questions asked via STT (Speech-to-Text).

import asyncio
import os
from dotenv import load_dotenv
from vision_agents.core import User, Agent, Runner
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import deepgram, getstream, elevenlabs, moondream
from vision_agents.plugins.getstream import CallSessionParticipantJoinedEvent

load_dotenv()


async def create_agent(**kwargs) -> Agent:
    # Create a cloud VLM for visual question answering
    llm = moondream.CloudVLM(
        api_key=os.getenv("MOONDREAM_API_KEY"),  # or set MOONDREAM_API_KEY env var
        mode="vqa",  # or "caption" for image captioning
    )

    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="My happy AI friend", id="agent"),
        llm=llm,
        tts=elevenlabs.TTS(),
        stt=deepgram.STT(),
    )
    return agent


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)

    @agent.events.subscribe
    async def on_participant_joined(event: CallSessionParticipantJoinedEvent):
        if event.participant.user.id != "agent":
            await asyncio.sleep(2)
            # Ask the agent to describe what it sees
            await agent.simple_response("Describe what you currently see")

    async with agent.join(call):
        await agent.finish()


if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Using LocalVLM (On-Device)

The LocalVLM downloads the model from HuggingFace and runs on device. It supports both VQA and captioning modes.

Note: The moondream3-preview model is gated and requires HuggingFace authentication:

from vision_agents.plugins import moondream
from vision_agents.core import Agent

# Create a local VLM (no API key needed)
llm = moondream.LocalVLM(
    mode="vqa",  # or "caption" for image captioning
    force_cpu=False,  # Auto-detects CUDA, MPS, or CPU
)

# Use in an agent
agent = Agent(
    llm=llm,
    tts=your_tts,
    stt=your_stt,
    # ... other components
)

VLM Modes

The VLM supports two modes:

  • "vqa" (Visual Question Answering): Answers questions about video frames. Questions come from STT transcripts.
  • "caption" (Image Captioning): Generates descriptions of video frames automatically.
# VQA mode - answers questions about frames
llm = moondream.CloudVLM(
    api_key="your-api-key",
    mode="vqa"
)

# Caption mode - generates automatic descriptions
llm = moondream.CloudVLM(
    api_key="your-api-key",
    mode="caption"
)

Configuration

CloudDetectionProcessor Parameters

  • api_key: str - API key for Moondream Cloud API. If not provided, will attempt to read from MOONDREAM_API_KEY environment variable.
  • detect_objects: str | List[str] - Object(s) to detect using zero-shot detection. Can be any object name like " person", "car", "basketball". Default: "person"
  • conf_threshold: float - Confidence threshold for detections (default: 0.3)
  • fps: int - Frame processing rate (default: 30)
  • interval: int - Processing interval in seconds (default: 0)
  • max_workers: int - Thread pool size for CPU-intensive operations (default: 10)

Rate Limits: By default, the Moondream Cloud API has a 2rps (requests per second) rate limit. Contact the Moondream team to request a higher limit.

LocalDetectionProcessor Parameters

  • detect_objects: str | List[str] - Object(s) to detect using zero-shot detection. Can be any object name like " person", "car", "basketball". Default: "person"
  • conf_threshold: float - Confidence threshold for detections (default: 0.3)
  • fps: int - Frame processing rate (default: 30)
  • interval: int - Processing interval in seconds (default: 0)
  • max_workers: int - Thread pool size for CPU-intensive operations (default: 10)
  • force_cpu: bool - If True, force CPU usage even if CUDA/MPS is available. Auto-detects CUDA, then MPS (Apple Silicon), then defaults to CPU. We recommend running on CUDA for best performance. (default: False)
  • model_name: str - Hugging Face model identifier (default: "moondream/moondream3-preview")
  • options: AgentOptions - Model directory configuration. If not provided, uses default which defaults to tempfile.gettempdir()

Performance: Performance will vary depending on your hardware configuration. CUDA is recommended for best performance on NVIDIA GPUs. The model will be downloaded from HuggingFace on first use.

CloudVLM Parameters

  • api_key: str - API key for Moondream Cloud API. If not provided, will attempt to read from MOONDREAM_API_KEY environment variable.
  • mode: Literal["vqa", "caption"] - "vqa" for visual question answering or "caption" for image captioning (default: " vqa")
  • max_workers: int - Thread pool size for CPU-intensive operations (default: 10)

Rate Limits: By default, the Moondream Cloud API has rate limits. Contact the Moondream team to request higher limits.

LocalVLM Parameters

  • mode: Literal["vqa", "caption"] - "vqa" for visual question answering or "caption" for image captioning (default: " vqa")
  • max_workers: int - Thread pool size for async operations (default: 10)
  • force_cpu: bool - If True, force CPU usage even if CUDA/MPS is available. Auto-detects CUDA, then MPS (Apple Silicon), then defaults to CPU. Note: MPS is automatically converted to CPU due to model compatibility. We recommend running on CUDA for best performance. (default: False)
  • model_name: str - Hugging Face model identifier (default: "moondream/moondream3-preview")
  • options: AgentOptions - Model directory configuration. If not provided, uses default_agent_options()

Performance: Performance will vary depending on your hardware configuration. CUDA is recommended for best performance on NVIDIA GPUs. The model will be downloaded from HuggingFace on first use.

Video Publishing

The processor publishes annotated video frames with bounding boxes drawn on detected objects:

processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",
    detect_objects=["person", "car"]
)

# The track will show:
# - Green bounding boxes around detected objects
# - Labels with confidence scores
# - Real-time annotation overlay

Testing

The plugin includes comprehensive tests:

# Run all tests
pytest plugins/moondream/tests/ -v

# Run specific test categories
pytest plugins/moondream/tests/ -k "inference" -v
pytest plugins/moondream/tests/ -k "annotation" -v

Dependencies

Required

  • vision-agents - Core framework
  • moondream - Moondream SDK for cloud API (CloudDetectionProcessor and CloudVLM)
  • numpy>=2.0.0 - Array operations
  • pillow>=10.0.0 - Image processing
  • opencv-python>=4.8.0 - Video annotation
  • aiortc - WebRTC support

Local Components Additional Dependencies

  • torch - PyTorch for model inference
  • transformers - HuggingFace transformers library for model loading

Note: LocalDetectionProcessor and LocalVLM both require these dependencies. We recommend only running the model locally on CUDA devices.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vision_agents_plugins_moondream-0.5.9.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vision_agents_plugins_moondream-0.5.9-py3-none-any.whl (60.1 kB view details)

Uploaded Python 3

File details

Details for the file vision_agents_plugins_moondream-0.5.9.tar.gz.

File metadata

  • Download URL: vision_agents_plugins_moondream-0.5.9.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for vision_agents_plugins_moondream-0.5.9.tar.gz
Algorithm Hash digest
SHA256 361a922440d4c842d81e8a46cb6d7aa4938b9a3f6eb90dfdce9052d04c4085e9
MD5 2c76dfab3c9fff7caa9adcc7f9cc5861
BLAKE2b-256 5d91eb59e4739e96ec23e4f0f820720fe310fd9fac9e4ff8d5791f627bfd53b0

See more details on using hashes here.

File details

Details for the file vision_agents_plugins_moondream-0.5.9-py3-none-any.whl.

File metadata

  • Download URL: vision_agents_plugins_moondream-0.5.9-py3-none-any.whl
  • Upload date:
  • Size: 60.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for vision_agents_plugins_moondream-0.5.9-py3-none-any.whl
Algorithm Hash digest
SHA256 21afb14b7b81f1234d2b5c579053fd9932f6ed2f510e07d3c11cdada618de46a
MD5 2bbb1a9d572435139507b852935707e3
BLAKE2b-256 78b9c011a7f6402e484f6e0da2ae171447e754760e5a18abbab182dc91402808

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page