vision-agents-plugins-gemini

Google Gemini LLM integration for Vision Agents

These details have not been verified by PyPI

Project links

Project description

Gemini Live Speech-to-Speech Plugin

Google Gemini Live Speech-to-Speech (STS) plugin for GetStream. It connects a realtime Gemini Live session to a Stream video call so your assistant can speak and listen in the same call.

Installation

uv add "vision-agents[gemini]"
# or directly
uv add vision-agents-plugins-gemini

Requirements

Python: 3.10+
Dependencies: getstream[webrtc"], getstream-plugins-common, google-genai>=1.51.0
API key: GOOGLE_API_KEY or GEMINI_API_KEY set in your environment

Quick Start

Below is a minimal example that attaches the Gemini Live output audio track to a Stream call and streams microphone audio into Gemini. The assistant will speak back into the call, and you can also send text messages to the assistant.

from dotenv import load_dotenv
from vision_agents.core import Agent, Runner, User
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import gemini, getstream

load_dotenv()


async def create_agent(**kwargs) -> Agent:
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="AI coach"),
        instructions="Read @coaching.md",
        llm=gemini.Realtime(model="gemini-3.1-flash-live-preview"),
        processors=[],
    )
    return agent


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)

    async with agent.join(call):
        await agent.llm.simple_response(
            text="Say hi. After the user joins ask them about their day"
        )
        await agent.finish()


if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Video frames from remote participants are forwarded to Gemini automatically when fps is set and the model supports it:

llm=gemini.Realtime(fps=3)  # forward video at 3 frames per second

The Agent subscribes to track events internally, so no manual wiring is needed. For a full runnable example, see examples/02_golf_coach_example/golf_coach_example.py.

Gemini Vision (VLM)

Use Gemini 3 vision models with the Agent API (video frames are forwarded automatically when the call has active video).

from vision_agents.core import Agent, Runner, User
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import deepgram, elevenlabs, gemini, getstream


async def create_agent(**kwargs) -> Agent:
    vlm = gemini.VLM(model="gemini-3-flash-preview")
    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Gemini Vision Agent", id="gemini-vision-agent"),
        instructions="Describe what you see in one sentence.",
        llm=vlm,
        stt=deepgram.STT(),
        tts=elevenlabs.TTS(),
    )


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.finish()


Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Key configuration knobs for GeminiVLM: fps, frame_buffer_seconds, thinking_level, media_resolution. For a full example, see plugins/gemini/example/gemini_vlm_agent_example.py.

Features

Bidirectional audio: Streams microphone PCM to Gemini, and plays Gemini speech into the call using output_track.
Video frame forwarding: Sends remote participant video frames to Gemini Live for multimodal understanding. Use start_video_sender with a remote MediaStreamTrack.
Text messages: Use send_text to add text turns directly to the conversation.
**Barge-in (interruptions) **: When the user starts speaking, current playback is interrupted so Gemini can focus on the new input. Playback automatically resumes after brief silence.
Auto resampling: send_audio_pcm will resample input frames to the target rate when needed.
Events: Subscribe to "audio" for synthesized audio chunks and "text" for assistant text.

API Overview

GeminiLive(api_key: str | None = None, model: str = "gemini-live-2.5-flash-preview", config: LiveConnectConfigDict | None = None): Create a new Gemini Live session. If api_key is not provided, the plugin reads GOOGLE_API_KEY or GEMINI_API_KEY from the environment.
**GeminiVLM(model: str = "gemini-3-flash-preview", fps: int = 1, frame_buffer_seconds: int = 10, ...) **: Vision-language model that buffers video frames and sends them with prompts.
output_track: An AudioStreamTrack you can publish in your call via add_tracks(audio=...).
await send_text(text: str): Send a user text message to the current turn.
await send_audio_pcm(pcm: PcmData, target_rate: int = 48000): Stream PCM frames to Gemini. Frames are converted to the required format and resampled if necessary.
await wait_until_ready(timeout: float | None = None) -> bool: Wait until the underlying live session is connected.
await interrupt_playback() / resume_playback(): Manually stop or resume synthesized audio playback. Useful if you want to manage barge-in behavior yourself.
await start_video_sender(track: MediaStreamTrack, fps: int = 1): Start forwarding video frames from a remote MediaStreamTrack to Gemini Live at the given frame rate.
await stop_video_sender(): Stop the background video sender task, if running.
await close(): Close the session and background tasks.

Environment Variables

GOOGLE_API_KEY / GEMINI_API_KEY: Gemini API key. One must be set.
GEMINI_LIVE_MODEL: Optional override for the model name if you need a different variant.

Troubleshooting

No audio playback: Ensure you publish output_track to your call and the call is subscribed to the assistant's audio.
No responses: Verify GOOGLE_API_KEY/GEMINI_API_KEY is set and has access to the chosen model. Try a different model via model=.
Sample-rate issues: Use send_audio_pcm(..., target_rate=48000) to normalize input frames.

Migration from Gemini 2.5

When migrating to Gemini 3:

Thinking: If you were using complex prompt engineering (like Chain-of-thought) with Gemini 2.5, try Gemini 3 with thinking_level="high" and simplified prompts.
Temperature: If your code explicitly sets temperature to low values, consider removing it and using the Gemini 3 default (1.0) to avoid potential looping issues.
PDF & Document Understanding: Default OCR resolution for PDFs has changed. Test with media_resolution="high" if you need dense document parsing.
Token Consumption: Gemini 3 defaults may increase token usage for PDFs but decrease for video. If requests exceed context limits, explicitly reduce media_resolution.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.9

May 15, 2026

0.5.8

May 13, 2026

0.5.7

May 7, 2026

0.5.6

May 5, 2026

0.5.5

Apr 27, 2026

This version

0.5.4

Apr 15, 2026

0.5.3

Apr 14, 2026

0.5.2

Apr 13, 2026

0.5.1

Apr 7, 2026

0.5.0

Apr 1, 2026

0.4.7

Mar 27, 2026

0.4.6

Mar 26, 2026

0.4.5

Mar 25, 2026

0.4.4

Mar 23, 2026

0.4.3

Mar 11, 2026

0.4.2

Mar 10, 2026

0.4.1

Mar 4, 2026

0.4.0

Mar 3, 2026

0.3.8

Feb 24, 2026

0.3.7

Feb 23, 2026

0.3.6

Feb 13, 2026

0.3.5

Feb 10, 2026

0.3.4

Feb 6, 2026

0.3.3

Feb 4, 2026

0.3.2

Jan 27, 2026

0.3.1

Jan 21, 2026

0.3.0

Jan 20, 2026

0.2.10

Jan 14, 2026

0.2.9

Jan 9, 2026

0.2.8

Jan 8, 2026

0.2.7

Jan 6, 2026

0.2.6

Dec 16, 2025

0.2.5

Dec 12, 2025

0.2.4

Dec 12, 2025

0.2.3

Dec 7, 2025

0.2.2

Nov 29, 2025

0.2.1

Nov 21, 2025

0.2.0

Nov 14, 2025

0.1.14

Nov 11, 2025

0.1.13

Nov 3, 2025

0.1.12

Oct 31, 2025

0.1.11

Oct 28, 2025

0.1.9

Oct 22, 2025

0.1.8

Oct 22, 2025

0.1.7

Oct 21, 2025

0.1.6

Oct 16, 2025

0.1.5

Oct 9, 2025

0.1.3

Oct 9, 2025

0.1.0

Oct 9, 2025

0.0.18

Oct 8, 2025

0.0.17

Oct 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vision_agents_plugins_gemini-0.5.4.tar.gz (23.5 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vision_agents_plugins_gemini-0.5.4-py3-none-any.whl (27.5 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file vision_agents_plugins_gemini-0.5.4.tar.gz.

File metadata

Download URL: vision_agents_plugins_gemini-0.5.4.tar.gz
Upload date: Apr 15, 2026
Size: 23.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for vision_agents_plugins_gemini-0.5.4.tar.gz
Algorithm	Hash digest
SHA256	`1b5102edae65933ecd0a65597b5f7d41391ecd08374e86c4c2dda0c6d1611228`
MD5	`f8d6ebf3d325bc624ea993a0489f5db0`
BLAKE2b-256	`8a14cac1a75e4ae8e50d81ee771f23d2ddbe6c2dee6dc12435c45d45cb533356`

See more details on using hashes here.

File details

Details for the file vision_agents_plugins_gemini-0.5.4-py3-none-any.whl.

File metadata

Download URL: vision_agents_plugins_gemini-0.5.4-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 27.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for vision_agents_plugins_gemini-0.5.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4a58d3a2a2ecd5666b6905909204ce5b1d9ecf270ba8fae96558f05026293d7b`
MD5	`a420cacc90e4b4f148952a3d59170e8b`
BLAKE2b-256	`403fcc87206f7ca54fe31e269ed3916a9b1aca1d009ea265f27c9730833d448a`

See more details on using hashes here.

vision-agents-plugins-gemini 0.5.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Gemini Live Speech-to-Speech Plugin

Installation

Requirements

Quick Start

Gemini Vision (VLM)

Features

API Overview

Environment Variables

Troubleshooting

Migration from Gemini 2.5

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes