SmolVLM vision service integration for the Pipecat voice agent framework

These details have not been verified by PyPI

Project links

Project description

pipecat-smolvlm

A community integration that brings SmolVLM vision-language models into Pipecat voice agent pipelines.

SmolVLM is a family of compact, instruction-tuned vision-language models from HuggingFace that run entirely on your own hardware — no external API calls, no per-image billing, no data leaving your infrastructure.

Models

Key	HuggingFace ID	VRAM (fp32 CPU)	Notes
`256M`	`HuggingFaceTB/SmolVLM-256M-Instruct`	~1 GB	Default. Fastest, lowest VRAM.
`500M`	`HuggingFaceTB/SmolVLM-500M-Instruct`	~2 GB	Better scene understanding.
`2B`	`HuggingFaceTB/SmolVLM-2B-Instruct`	~8 GB	Highest quality. Needs GPU.

All three checkpoints are interchangeable via the model setting.

Installation

# Core install
pip install pipecat-smolvlm

# With Flash Attention 2 (CUDA only — significant speedup)
pip install "pipecat-smolvlm[flash-attn]"

# Intel XPU support
pip install "pipecat-smolvlm[xpu]"

Model weights are downloaded from HuggingFace on first use and cached in ~/.cache/huggingface/.

Quick start

from pipecat_smolvlm import SmolVlmService

# Default: SmolVLM-256M-Instruct, auto device detection
service = SmolVlmService()

# Custom model and generation settings
service = SmolVlmService(
    settings=SmolVlmService.Settings(
        model="HuggingFaceTB/SmolVLM-500M-Instruct",
        max_new_tokens=256,
        default_prompt="What objects can you see?",
        temperature=0.3,
        do_sample=True,
    )
)

# Force CPU (no GPU required)
service = SmolVlmService(use_cpu=True)

Drop SmolVlmService into any Pipecat Pipeline wherever a VisionService is expected:

pipeline = Pipeline([
    transport.input(),
    image_capture_processor,   # converts camera frames → UserImageRawFrame
    SmolVlmService(),
    vision_to_speech,          # VisionTextFrame → TextFrame
    tts_service,
    transport.output(),
])

Settings reference

Setting	Type	Default	Description
`model`	`str`	`HuggingFaceTB/SmolVLM-256M-Instruct`	HuggingFace model ID or local path
`max_new_tokens`	`int`	`500`	Maximum tokens to generate per image
`default_prompt`	`str`	`"Describe the given image."`	Fallback prompt when frame carries no text
`temperature`	`float \| None`	`None`	Sampling temperature. `None` = greedy decoding
`do_sample`	`bool`	`False`	Enable sampling without a fixed temperature

Per-image prompts

Pass a custom prompt in the text field of UserImageRawFrame:

frame = UserImageRawFrame(
    image=raw_bytes,
    size=(width, height),
    format="RGB",
    text="Is there a person in this image? Answer yes or no.",
)

If frame.text is empty or None, the service falls back to settings.default_prompt.

Device selection

The service automatically picks the best available device:

Priority	Device	dtype
1	Intel XPU (if `intel-extension-for-pytorch` installed)	float32
2	CUDA GPU	bfloat16
3	Apple MPS (M-series)	bfloat16
4	CPU (fallback)	float32

Pass use_cpu=True to the constructor to override this and force CPU.

Examples

Foundational — camera description over voice

Describes whatever a WebRTC participant's camera sees each time they speak.

export DAILY_API_KEY=...
export CARTESIA_API_KEY=...

python examples/foundational/smolvlm_basic.py \
    --room-url https://yourapp.daily.co/my-room \
    --model 500M \
    --prompt "Describe what you see in one sentence."

See examples/foundational/smolvlm_basic.py for the full source.

Architecture notes

Thread safety

SmolVlmService.run_vision runs the synchronous HuggingFace inference in asyncio.to_thread, so the event loop is never blocked. Each call holds the GIL only during the actual model.generate() call.

Frame flow

UserImageRawFrame
        │
  SmolVlmService.run_vision()
        │
  ├─ VisionFullResponseStartFrame
  ├─ VisionTextFrame(text="…")
  └─ VisionFullResponseEndFrame

On error a single ErrorFrame is yielded instead.

Image format handling

Incoming frames are converted to RGB before processing so the service handles RGBA, L, and palette-mode images transparently.

Development

git clone https://github.com/pipecat-ai/pipecat-community
cd pipecat-community/integrations/smolvlm

pip install -e ".[dev]"

# Lint
ruff check .

# Type check
mypy pipecat_smolvlm

# Tests
pytest

License

BSD 2-Clause. See LICENSE for details.

The SmolVLM model weights are released by HuggingFace under the Apache 2.0 license.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipecat_smolvlm-0.1.0.tar.gz (10.2 kB view details)

Uploaded Apr 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pipecat_smolvlm-0.1.0-py3-none-any.whl (4.4 kB view details)

Uploaded Apr 21, 2026 Python 3

File details

Details for the file pipecat_smolvlm-0.1.0.tar.gz.

File metadata

Download URL: pipecat_smolvlm-0.1.0.tar.gz
Upload date: Apr 21, 2026
Size: 10.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for pipecat_smolvlm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9492b0d6bfe04fb8db335b812d7c0030c5d534172c1d4f0391757399b97efeda`
MD5	`d45ddad0816489038eb593d747b2f487`
BLAKE2b-256	`ffa14688425f8a42f3bd8ced4c10fa072cdf3c158ba2105af99eefa3e6bbb517`

See more details on using hashes here.

File details

Details for the file pipecat_smolvlm-0.1.0-py3-none-any.whl.

File metadata

Download URL: pipecat_smolvlm-0.1.0-py3-none-any.whl
Upload date: Apr 21, 2026
Size: 4.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for pipecat_smolvlm-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`711ed21aa3fd25186c7e447bd214e8eedc166bff138938dc792c6231170d51b7`
MD5	`8baf84a97208c90c77ff6a74e0190c8e`
BLAKE2b-256	`f8106b575ec3221cabf621294a21e8c3131b6c0de34a906950aa2e064a6c4195`

See more details on using hashes here.

pipecat-smolvlm 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pipecat-smolvlm

Models

Installation

Quick start

Settings reference

Per-image prompts

Device selection

Examples

Foundational — camera description over voice

Architecture notes

Thread safety

Frame flow

Image format handling

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes