The universal entrypoint to HuggingFace transformers for Strands agents - 100% task & modality coverage, zero hardcoding.

These details have not been verified by PyPI

Project links

Project description

🤗 Strands Transformers

Run any HuggingFace transformers model from a Strands agent - as a tool, or as the agent's own brain.

Two entry points: use_transformers (a tool for all 24 transformers tasks) and TransformerModel (a local model provider that consumes image / video / audio / document content blocks).

use_transformers is one tool that exposes every transformers task. It reads transformers' task taxonomy at runtime, so a model or task added upstream works here without a code change - discover tasks, run a pipeline, or call any class/ method directly.

TransformerModel plugs a local HF model in as a Strands Agent(model=…). It speaks the agent content-block protocol, so the model receives image, video, audio and document blocks directly. Vision-language models see images; audio models hear; Qwen2.5-Omni hears and replies with generated speech.

flowchart LR
    IN["📥 text · image · video<br/>audio · document · robot-state"]
    TOOL["🛠️ use_transformers<br/><i>tool</i>"]
    BRAIN["🧠 TransformerModel<br/><i>local agent brain</i>"]
    OUT["📤 text · speech · image<br/>labels · actions"]
    IN --> TOOL --> OUT
    IN --> BRAIN --> OUT
    classDef i fill:#7C4DFF,stroke:#5b34d6,color:#fff;
    classDef c fill:#FFD21E,stroke:#E68A00,color:#3a2d00;
    classDef o fill:#00E5FF,stroke:#00b3cc,color:#003844;
    class IN i;
    class TOOL,BRAIN c;
    class OUT o;

📖 Full documentation (built with MkDocs, see docs/)

Install

uv pip install strands-transformers        # from PyPI
# or from source:
uv pip install -e .                         # or: pip install -e .
PYTHONPATH=. python examples/smoke.py       # verify → "18/18 checks passed"

Optional extras (audio · vision · training · docs)

uv pip install -e ".[audio]"      # soundfile, librosa  (mp3/flac/ogg decode)
uv pip install -e ".[vision]"     # torchvision (VLMs!), opencv, av
uv pip install -e ".[training]"   # trl, peft, accelerate
uv pip install -e ".[docs]"       # mkdocs-material, mkdocstrings
uv pip install -e ".[all]"        # everything

Vision models (SmolVLM, etc.) need the [vision] extra (torchvision). WAV audio works without extras. device="auto" picks cuda → mps → cpu (bf16 on GPU).

60-second hello - a local vision agent

import io
from PIL import Image
from strands import Agent
from strands_transformers import TransformerModel

buf = io.BytesIO(); Image.new("RGB", (64, 64), (20, 200, 40)).save(buf, "PNG")  # green square

model = TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct")
agent = Agent(model=model, system_prompt="You are concise.")

print(agent([
    {"image": {"format": "png", "source": {"bytes": buf.getvalue()}}},
    {"text": "Color? One word."},
]))
# → Green.

A 256M-param model in the standard Strands loop, seeing pixels through a content block - no API key, no server. Swap model_path for any HF VLM.

See it work

Every output below is a real model result (CUDA · transformers 5.12 · torch 2.10):

You give it	Script	It returns
🖼️ a green image + "Color?"	`examples/multimodal_agent.py`	`"Green."`
🎬 brightening frames	`examples/multimodal_advanced.py`	`"BRIGHTER."`
🧰 a tool screenshot (blue)	`examples/multimodal_advanced.py`	`"Blue."`
📄 a text document	`examples/document_and_audio.py`	recovers `BANANA-42`
🔊 a 440 Hz tone (Omni)	`examples/omni_audio.py`	`"It's a pure tone."`
💬 "say: …can speak" (Omni)	`examples/omni_audio.py`	🔊 real 24 kHz speech

Real agent outputs - detection boxes, depth, panoptic segmentation (one COCO photo):

detection · depth · segmentation

🎬 Video understanding - frames in, label out:

🔊 Speech - text-to-audio then re-transcribed by whisper (the library narrating itself):

▶️ Listen on the docs site

▶️ Hear it speak + play every example in the docs →

Featured models

The examples use tiny models so they run in seconds. In practice you point the same code at any current library_name: transformers model - swap the id, the plumbing is identical. A few strong open ones, by modality:

Modality	Model	How to use
Vision-language	`Qwen/Qwen3-VL-8B-Instruct` · `google/gemma-3-4b-it`	`TransformerModel` brain or `run` (image-text-to-text)
Speech → text	`openai/whisper-large-v3-turbo` · `Qwen/Qwen3-ASR-1.7B`	`run` (automatic-speech-recognition)
Audio in + speech out	`Qwen/Qwen2.5-Omni-3B`	`TransformerModel` brain (`speak=True`)
Multimodal (audio+vision+text)	`microsoft/Phi-4-multimodal-instruct`	`TransformerModel` brain
Robot actions (VLA)	`allenai/MolmoAct2` · `openvla/openvla-7b`	`call` → `predict_action`
Embodied reasoning	`nvidia/Cosmos-Reason2-2B`	`run` (image-text-to-text)

# swap the tiny demo model for a SOTA one - same code:
model = TransformerModel(model_path="Qwen/Qwen3-VL-8B-Instruct")

Two ways to use it

As a tool - use_transformers (discover · run · call)

from strands import Agent
from strands_transformers import use_transformers

agent = Agent(tools=[use_transformers])
agent("Transcribe recording.wav")                  # automatic-speech-recognition
agent("What's in scene.jpg?")                       # image-text-to-text
agent("Say 'hello from strands' as audio")          # text-to-audio
agent("Detect objects in https://.../street.jpg")   # object-detection

Discover everything at runtime (action="tasks" | "modalities" | "inspect" | …), run high-level pipelines, or call any class/fn/method for custom models. → The tool guide

As the agent's brain - TransformerModel (multimodal content blocks)

Pass image / video / audio / document content blocks (and media inside a toolResult) - the provider auto-detects the model's processor and routes them. All outputs below are real results (CUDA, transformers 5.12 / torch 2.10):

Content block	Example	Verified output
`image`	`multimodal_agent.py`	`"Green."`
`video` (with `fps`)	`multimodal_advanced.py`	`"BRIGHTER."`
`image` in `toolResult`	`multimodal_advanced.py`	`"Blue."`
`document`	`document_and_audio.py`	recovers `BANANA-42`
`audio` (our schema extension)	`audio_content_block.py`	audio → text
`audio` in and speech out	`omni_audio.py`	hears + speaks (Qwen2.5-Omni)

→ Agent brain · Content blocks · Audio

Robotics / VLA - camera + instruction → robot actions

Two layers, both transformers-native and GPU-verified:

🧠 reason - Cosmos-Reason2-2B (a physical-AI VLM) plans over a scene via the run path: "the red cube is in the bottom left corner, so the arm should move there first."
⚙️ act - VLA models expose predict_action via the call path: MolmoAct2 → [1,30,6]; OpenVLA-7b → 7-DoF (auto 4.x→5.x shims).

🔗 Full agentic loop (examples/robot_reason_act_agent.py): Cosmos-Reason plans over real RealSense frames → MolmoAct acts ([1,30,6]) - perception→plan→action through one tool.

Lerobot-ecosystem policies (SmolVLA, π0, ACT, GR00T) use their own runtimes - pair with use_lerobot. → Robotics guide

How it works

Nothing is hardcoded per task - core/registry.py reads transformers' own SUPPORTED_TASKS at runtime, so coverage tracks upstream automatically.

Project layout

strands_transformers/
├── tools/use_transformers.py   # the one @tool: discover · run · call
├── models/transformers.py      # TransformerModel - local multimodal agent brain
├── types/audio.py              # audio content-block extension
└── core/{registry,engine,io,compat}.py   # taxonomy · load/cache · I/O · legacy shims

→ Architecture · API reference

Examples

12 runnable, GPU-verified examples in examples/ - image, video, audio, document, Omni speech, VLA, and pipelines. Run any:

PYTHONPATH=. python examples/<name>.py

→ Examples & FAQ

Star history

License

MIT - built with Strands Agents SDK and HuggingFace Transformers.

_{If this saved you a pile of per-model glue code, consider giving it a ⭐}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Jun 15, 2026

This version

0.3.0

Jun 15, 2026

0.2.0

Jun 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strands_transformers-0.3.0.tar.gz (1.5 MB view details)

Uploaded Jun 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

strands_transformers-0.3.0-py3-none-any.whl (45.4 kB view details)

Uploaded Jun 15, 2026 Python 3

File details

Details for the file strands_transformers-0.3.0.tar.gz.

File metadata

Download URL: strands_transformers-0.3.0.tar.gz
Upload date: Jun 15, 2026
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for strands_transformers-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`b9b5a22af079ea210d3b095478f5416ee7de8e7e7cb27433d7590fe894fc8d5e`
MD5	`e7d284911cd3549fb21d474a94cb1795`
BLAKE2b-256	`63cc0fad36c0fd4ab893b8230a73d11f8c4a9e64950d6e7b008aa0ca7a2d5112`

See more details on using hashes here.

File details

Details for the file strands_transformers-0.3.0-py3-none-any.whl.

File metadata

Download URL: strands_transformers-0.3.0-py3-none-any.whl
Upload date: Jun 15, 2026
Size: 45.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for strands_transformers-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9c03cfe6dfffbd90e346286de51824b153db5da33714c21e6284846c56d005e4`
MD5	`8922a66e15f9db65bab0c16eddf373a2`
BLAKE2b-256	`5eeb66a9dd37bd32b97704a65bffb2646607d6eff1b1607c237dfee49c27cd3b`

See more details on using hashes here.

strands-transformers 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🤗 Strands Transformers

Run any HuggingFace transformers model from a Strands agent - as a tool, or as the agent's own brain.

Install

60-second hello - a local vision agent

See it work

Featured models

Two ways to use it

How it works

Examples

Star history

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes