Skip to main content

SmolVLM vision service integration for the Pipecat voice agent framework

Project description

pipecat-smolvlm

A community integration that brings SmolVLM vision-language models into Pipecat voice agent pipelines.

SmolVLM is a family of compact, instruction-tuned vision-language models from HuggingFace that run entirely on your own hardware — no external API calls, no per-image billing, no data leaving your infrastructure.


Models

Key HuggingFace ID VRAM (fp32 CPU) Notes
256M HuggingFaceTB/SmolVLM-256M-Instruct ~1 GB Default. Fastest, lowest VRAM.
500M HuggingFaceTB/SmolVLM-500M-Instruct ~2 GB Better scene understanding.
2B HuggingFaceTB/SmolVLM-2B-Instruct ~8 GB Highest quality. Needs GPU.

All three checkpoints are interchangeable via the model setting.


Installation

# Core install
pip install pipecat-smolvlm

# With Flash Attention 2 (CUDA only — significant speedup)
pip install "pipecat-smolvlm[flash-attn]"

# Intel XPU support
pip install "pipecat-smolvlm[xpu]"

Model weights are downloaded from HuggingFace on first use and cached in ~/.cache/huggingface/.


Quick start

from pipecat_smolvlm import SmolVlmService

# Default: SmolVLM-256M-Instruct, auto device detection
service = SmolVlmService()

# Custom model and generation settings
service = SmolVlmService(
    settings=SmolVlmService.Settings(
        model="HuggingFaceTB/SmolVLM-500M-Instruct",
        max_new_tokens=256,
        default_prompt="What objects can you see?",
        temperature=0.3,
        do_sample=True,
    )
)

# Force CPU (no GPU required)
service = SmolVlmService(use_cpu=True)

Drop SmolVlmService into any Pipecat Pipeline wherever a VisionService is expected:

pipeline = Pipeline([
    transport.input(),
    image_capture_processor,   # converts camera frames → UserImageRawFrame
    SmolVlmService(),
    vision_to_speech,          # VisionTextFrame → TextFrame
    tts_service,
    transport.output(),
])

Settings reference

Setting Type Default Description
model str HuggingFaceTB/SmolVLM-256M-Instruct HuggingFace model ID or local path
max_new_tokens int 500 Maximum tokens to generate per image
default_prompt str "Describe the given image." Fallback prompt when frame carries no text
temperature float | None None Sampling temperature. None = greedy decoding
do_sample bool False Enable sampling without a fixed temperature

Per-image prompts

Pass a custom prompt in the text field of UserImageRawFrame:

frame = UserImageRawFrame(
    image=raw_bytes,
    size=(width, height),
    format="RGB",
    text="Is there a person in this image? Answer yes or no.",
)

If frame.text is empty or None, the service falls back to settings.default_prompt.


Device selection

The service automatically picks the best available device:

Priority Device dtype
1 Intel XPU (if intel-extension-for-pytorch installed) float32
2 CUDA GPU bfloat16
3 Apple MPS (M-series) bfloat16
4 CPU (fallback) float32

Pass use_cpu=True to the constructor to override this and force CPU.


Examples

Foundational — camera description over voice

Describes whatever a WebRTC participant's camera sees each time they speak.

export DAILY_API_KEY=...
export CARTESIA_API_KEY=...

python examples/foundational/smolvlm_basic.py \
    --room-url https://yourapp.daily.co/my-room \
    --model 500M \
    --prompt "Describe what you see in one sentence."

See examples/foundational/smolvlm_basic.py for the full source.


Architecture notes

Thread safety

SmolVlmService.run_vision runs the synchronous HuggingFace inference in asyncio.to_thread, so the event loop is never blocked. Each call holds the GIL only during the actual model.generate() call.

Frame flow

UserImageRawFrame
        │
  SmolVlmService.run_vision()
        │
  ├─ VisionFullResponseStartFrame
  ├─ VisionTextFrame(text="…")
  └─ VisionFullResponseEndFrame

On error a single ErrorFrame is yielded instead.

Image format handling

Incoming frames are converted to RGB before processing so the service handles RGBA, L, and palette-mode images transparently.


Development

git clone https://github.com/pipecat-ai/pipecat-community
cd pipecat-community/integrations/smolvlm

pip install -e ".[dev]"

# Lint
ruff check .

# Type check
mypy pipecat_smolvlm

# Tests
pytest

License

BSD 2-Clause. See LICENSE for details.

The SmolVLM model weights are released by HuggingFace under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipecat_smolvlm-0.1.0.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pipecat_smolvlm-0.1.0-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file pipecat_smolvlm-0.1.0.tar.gz.

File metadata

  • Download URL: pipecat_smolvlm-0.1.0.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for pipecat_smolvlm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9492b0d6bfe04fb8db335b812d7c0030c5d534172c1d4f0391757399b97efeda
MD5 d45ddad0816489038eb593d747b2f487
BLAKE2b-256 ffa14688425f8a42f3bd8ced4c10fa072cdf3c158ba2105af99eefa3e6bbb517

See more details on using hashes here.

File details

Details for the file pipecat_smolvlm-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pipecat_smolvlm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 711ed21aa3fd25186c7e447bd214e8eedc166bff138938dc792c6231170d51b7
MD5 8baf84a97208c90c77ff6a74e0190c8e
BLAKE2b-256 f8106b575ec3221cabf621294a21e8c3131b6c0de34a906950aa2e064a6c4195

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page