Google Gemini LLM integration for Vision Agents
Project description
Gemini Live Speech-to-Speech Plugin
Google Gemini Live Speech-to-Speech (STS) plugin for GetStream. It connects a realtime Gemini Live session to a Stream video call so your assistant can speak and listen in the same call.
Installation
uv add vision-agents[gemini]
Requirements
- Python: 3.10+
- Dependencies:
getstream[webrtc"],getstream-plugins-common,google-genai>=1.51.0 - API key:
GOOGLE_API_KEYorGEMINI_API_KEYset in your environment
Quick Start
Below is a minimal example that attaches the Gemini Live output audio track to a Stream call and streams microphone audio into Gemini. The assistant will speak back into the call, and you can also send text messages to the assistant.
import asyncio
import os
from getstream import Stream
from getstream.plugins.gemini.live import GeminiLive
from getstream.video import rtc
from getstream.video.rtc.track_util import PcmData
async def main():
# Ensure your key is set: export GOOGLE_API_KEY=... (or GEMINI_API_KEY)
gemini = GeminiLive(
api_key=os.getenv("GOOGLE_API_KEY"),
model="gemini-live-2.5-flash-preview",
)
client = Stream.from_env()
call = client.video.call("default", "your-call-id")
async with await rtc.join(call, user_id="assistant-bot") as connection:
# Route Gemini's synthesized speech back into the call
await connection.add_tracks(audio=gemini.output_track)
# Forward microphone PCM frames to Gemini in realtime
@connection.on("audio")
async def on_audio(pcm: PcmData):
await gemini.send_audio_pcm(pcm, target_rate=48000)
# Optionally send a kick-off text message
await gemini.send_text("Give a short greeting to the participants.")
# Keep the session running
while True:
await asyncio.sleep(1)
if __name__ == "__main__":
asyncio.run(main())
Optional: forward remote participant video frames to Gemini for multimodal context:
# Forward remote video frames to Gemini (optional)
@connection.on("track_added")
async def _on_track_added(track_id, kind, user):
if kind == "video" and connection.subscriber_pc:
track = connection.subscriber_pc.add_track_subscriber(track_id)
if track:
await gemini.watch_video_track(track)
For a full runnable example, see examples/gemini_live/main.py.
Features
- Bidirectional audio: Streams microphone PCM to Gemini, and plays Gemini speech into the call using
output_track. - Video frame forwarding: Sends remote participant video frames to Gemini Live for multimodal understanding. Use
start_video_senderwith a remoteMediaStreamTrack. - Text messages: Use
send_textto add text turns directly to the conversation. - Barge-in (interruptions): When the user starts speaking, current playback is interrupted so Gemini can focus on the new input. Playback automatically resumes after brief silence.
- Auto resampling:
send_audio_pcmwill resample input frames to the target rate when needed. - Events: Subscribe to
"audio"for synthesized audio chunks and"text"for assistant text.
API Overview
GeminiLive(api_key: str | None = None, model: str = "gemini-live-2.5-flash-preview", config: LiveConnectConfigDict | None = None): Create a new Gemini Live session. Ifapi_keyis not provided, the plugin readsGOOGLE_API_KEYorGEMINI_API_KEYfrom the environment.output_track: AnAudioStreamTrackyou can publish in your call viaadd_tracks(audio=...).await send_text(text: str): Send a user text message to the current turn.await send_audio_pcm(pcm: PcmData, target_rate: int = 48000): Stream PCM frames to Gemini. Frames are converted to the required format and resampled if necessary.await wait_until_ready(timeout: float | None = None) -> bool: Wait until the underlying live session is connected.await interrupt_playback()/resume_playback(): Manually stop or resume synthesized audio playback. Useful if you want to manage barge-in behavior yourself.await start_video_sender(track: MediaStreamTrack, fps: int = 1): Start forwarding video frames from a remoteMediaStreamTrackto Gemini Live at the given frame rate.await stop_video_sender(): Stop the background video sender task, if running.await close(): Close the session and background tasks.
Environment Variables
GOOGLE_API_KEY/GEMINI_API_KEY: Gemini API key. One must be set.GEMINI_LIVE_MODEL: Optional override for the model name if you need a different variant.
Notes on Interruptions
- How it works: The plugin detects user speech activity in incoming PCM and interrupts any ongoing playback. After a short period of silence, playback is enabled again so the assistant can speak.
- Why it matters: This enables natural barge-in experiences, where users can cut off the assistant mid-sentence and ask follow-up questions.
Troubleshooting
- No audio playback: Ensure you publish
output_trackto your call and the call is subscribed to the assistant's audio. - No responses: Verify
GOOGLE_API_KEY/GEMINI_API_KEYis set and has access to the chosen model. Try a different model viamodel=. - Sample-rate issues: Use
send_audio_pcm(..., target_rate=48000)to normalize input frames.
Migration from Gemini 2.5
When migrating to Gemini 3:
- Thinking: If you were using complex prompt engineering (like Chain-of-thought) with Gemini 2.5, try Gemini 3 with
thinking_level="high"and simplified prompts. - Temperature: If your code explicitly sets temperature to low values, consider removing it and using the Gemini 3 default (1.0) to avoid potential looping issues.
- PDF & Document Understanding: Default OCR resolution for PDFs has changed. Test with
media_resolution="high"if you need dense document parsing. - Token Consumption: Gemini 3 defaults may increase token usage for PDFs but decrease for video. If requests exceed context limits, explicitly reduce
media_resolution.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vision_agents_plugins_gemini-0.3.3.tar.gz.
File metadata
- Download URL: vision_agents_plugins_gemini-0.3.3.tar.gz
- Upload date:
- Size: 20.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fedb5c6d2001ac274f97c65847c30fd5eb6241235c809d8e7640855f942a2c18
|
|
| MD5 |
7703246f6aa358b576de7dc77bc9e490
|
|
| BLAKE2b-256 |
72f22f8f3643ca5e79b10779bf94cff8d8b3c32f4cc3114984d7c0e1f6b27e25
|
File details
Details for the file vision_agents_plugins_gemini-0.3.3-py3-none-any.whl.
File metadata
- Download URL: vision_agents_plugins_gemini-0.3.3-py3-none-any.whl
- Upload date:
- Size: 23.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fdbbacba372dcf8ec88c5fbd90803f4ca07162bff102451c545de5c9110085ef
|
|
| MD5 |
0558c0cd737a7de9258d39a92f7b93bd
|
|
| BLAKE2b-256 |
cb8d45dd9bd0159bb4e4c257cf563d58f02c4f93273dca5dd569eeb68fe08b22
|