SmolVLM vision service integration for the Pipecat voice agent framework
Project description
pipecat-smolvlm
A community integration that brings SmolVLM vision-language models into Pipecat voice agent pipelines.
SmolVLM is a family of compact, instruction-tuned vision-language models from HuggingFace that run entirely on your own hardware — no external API calls, no per-image billing, no data leaving your infrastructure.
Models
| Key | HuggingFace ID | VRAM (fp32 CPU) | Notes |
|---|---|---|---|
256M |
HuggingFaceTB/SmolVLM-256M-Instruct |
~1 GB | Default. Fastest, lowest VRAM. |
500M |
HuggingFaceTB/SmolVLM-500M-Instruct |
~2 GB | Better scene understanding. |
2B |
HuggingFaceTB/SmolVLM-2B-Instruct |
~8 GB | Highest quality. Needs GPU. |
All three checkpoints are interchangeable via the model setting.
Installation
# Core install
pip install pipecat-smolvlm
# With Flash Attention 2 (CUDA only — significant speedup)
pip install "pipecat-smolvlm[flash-attn]"
# Intel XPU support
pip install "pipecat-smolvlm[xpu]"
Model weights are downloaded from HuggingFace on first use and cached in ~/.cache/huggingface/.
Quick start
from pipecat_smolvlm import SmolVlmService
# Default: SmolVLM-256M-Instruct, auto device detection
service = SmolVlmService()
# Custom model and generation settings
service = SmolVlmService(
settings=SmolVlmService.Settings(
model="HuggingFaceTB/SmolVLM-500M-Instruct",
max_new_tokens=256,
default_prompt="What objects can you see?",
temperature=0.3,
do_sample=True,
)
)
# Force CPU (no GPU required)
service = SmolVlmService(use_cpu=True)
Drop SmolVlmService into any Pipecat Pipeline wherever a VisionService is expected:
pipeline = Pipeline([
transport.input(),
image_capture_processor, # converts camera frames → UserImageRawFrame
SmolVlmService(),
vision_to_speech, # VisionTextFrame → TextFrame
tts_service,
transport.output(),
])
Settings reference
| Setting | Type | Default | Description |
|---|---|---|---|
model |
str |
HuggingFaceTB/SmolVLM-256M-Instruct |
HuggingFace model ID or local path |
max_new_tokens |
int |
500 |
Maximum tokens to generate per image |
default_prompt |
str |
"Describe the given image." |
Fallback prompt when frame carries no text |
temperature |
float | None |
None |
Sampling temperature. None = greedy decoding |
do_sample |
bool |
False |
Enable sampling without a fixed temperature |
Per-image prompts
Pass a custom prompt in the text field of UserImageRawFrame:
frame = UserImageRawFrame(
image=raw_bytes,
size=(width, height),
format="RGB",
text="Is there a person in this image? Answer yes or no.",
)
If frame.text is empty or None, the service falls back to settings.default_prompt.
Device selection
The service automatically picks the best available device:
| Priority | Device | dtype |
|---|---|---|
| 1 | Intel XPU (if intel-extension-for-pytorch installed) |
float32 |
| 2 | CUDA GPU | bfloat16 |
| 3 | Apple MPS (M-series) | bfloat16 |
| 4 | CPU (fallback) | float32 |
Pass use_cpu=True to the constructor to override this and force CPU.
Examples
Foundational — camera description over voice
Describes whatever a WebRTC participant's camera sees each time they speak.
export DAILY_API_KEY=...
export CARTESIA_API_KEY=...
python examples/foundational/smolvlm_basic.py \
--room-url https://yourapp.daily.co/my-room \
--model 500M \
--prompt "Describe what you see in one sentence."
See examples/foundational/smolvlm_basic.py for the full source.
Architecture notes
Thread safety
SmolVlmService.run_vision runs the synchronous HuggingFace inference in asyncio.to_thread, so the event loop is never blocked. Each call holds the GIL only during the actual model.generate() call.
Frame flow
UserImageRawFrame
│
SmolVlmService.run_vision()
│
├─ VisionFullResponseStartFrame
├─ VisionTextFrame(text="…")
└─ VisionFullResponseEndFrame
On error a single ErrorFrame is yielded instead.
Image format handling
Incoming frames are converted to RGB before processing so the service handles RGBA, L, and palette-mode images transparently.
Development
git clone https://github.com/pipecat-ai/pipecat-community
cd pipecat-community/integrations/smolvlm
pip install -e ".[dev]"
# Lint
ruff check .
# Type check
mypy pipecat_smolvlm
# Tests
pytest
License
BSD 2-Clause. See LICENSE for details.
The SmolVLM model weights are released by HuggingFace under the Apache 2.0 license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pipecat_smolvlm-0.1.0.tar.gz.
File metadata
- Download URL: pipecat_smolvlm-0.1.0.tar.gz
- Upload date:
- Size: 10.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9492b0d6bfe04fb8db335b812d7c0030c5d534172c1d4f0391757399b97efeda
|
|
| MD5 |
d45ddad0816489038eb593d747b2f487
|
|
| BLAKE2b-256 |
ffa14688425f8a42f3bd8ced4c10fa072cdf3c158ba2105af99eefa3e6bbb517
|
File details
Details for the file pipecat_smolvlm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pipecat_smolvlm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
711ed21aa3fd25186c7e447bd214e8eedc166bff138938dc792c6231170d51b7
|
|
| MD5 |
8baf84a97208c90c77ff6a74e0190c8e
|
|
| BLAKE2b-256 |
f8106b575ec3221cabf621294a21e8c3131b6c0de34a906950aa2e064a6c4195
|