The universal entrypoint to HuggingFace transformers for Strands agents - 100% task & modality coverage, zero hardcoding.
Project description
🤗 Strands Transformers
Run any HuggingFace transformers model from a Strands agent - as a tool, or as the agent's own brain.
Two entry points: use_transformers (a tool for all 24 transformers tasks) and TransformerModel (a local model provider that consumes image / video / audio / document content blocks).
use_transformers is one tool that exposes every transformers task. It reads
transformers' task taxonomy at runtime, so a model or task added upstream works
here without a code change - discover tasks, run a pipeline, or call any class/
method directly.
TransformerModel plugs a local HF model in as a Strands Agent(model=…).
It speaks the agent content-block protocol, so the model receives image,
video, audio and document blocks directly. Vision-language models see
images; audio models hear; Qwen2.5-Omni hears and replies with generated speech.
flowchart LR
IN["📥 text · image · video<br/>audio · document · robot-state"]
TOOL["🛠️ use_transformers<br/><i>tool</i>"]
BRAIN["🧠 TransformerModel<br/><i>local agent brain</i>"]
OUT["📤 text · speech · image<br/>labels · actions"]
IN --> TOOL --> OUT
IN --> BRAIN --> OUT
classDef i fill:#7C4DFF,stroke:#5b34d6,color:#fff;
classDef c fill:#FFD21E,stroke:#E68A00,color:#3a2d00;
classDef o fill:#00E5FF,stroke:#00b3cc,color:#003844;
class IN i;
class TOOL,BRAIN c;
class OUT o;
📖 Full documentation (built with MkDocs, see docs/)
Install
uv pip install strands-transformers # from PyPI
# or from source:
uv pip install -e . # or: pip install -e .
PYTHONPATH=. python examples/smoke.py # verify → "18/18 checks passed"
Optional extras (audio · vision · training · docs)
uv pip install -e ".[audio]" # soundfile, librosa (mp3/flac/ogg decode)
uv pip install -e ".[vision]" # torchvision (VLMs!), opencv, av
uv pip install -e ".[training]" # trl, peft, accelerate
uv pip install -e ".[docs]" # mkdocs-material, mkdocstrings
uv pip install -e ".[all]" # everything
Vision models (SmolVLM, etc.) need the [vision] extra (torchvision). WAV
audio works without extras. device="auto" picks cuda → mps → cpu (bf16 on GPU).
60-second hello - a local vision agent
import io
from PIL import Image
from strands import Agent
from strands_transformers import TransformerModel
buf = io.BytesIO(); Image.new("RGB", (64, 64), (20, 200, 40)).save(buf, "PNG") # green square
model = TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct")
agent = Agent(model=model, system_prompt="You are concise.")
print(agent([
{"image": {"format": "png", "source": {"bytes": buf.getvalue()}}},
{"text": "Color? One word."},
]))
# → Green.
A 256M-param model in the standard Strands loop, seeing pixels through a content
block - no API key, no server. Swap model_path for any HF VLM.
See it work
Every output below is a real model result (CUDA · transformers 5.12 · torch 2.10):
| You give it | Script | It returns |
|---|---|---|
| 🖼️ a green image + "Color?" | examples/multimodal_agent.py |
"Green." |
| 🎬 brightening frames | examples/multimodal_advanced.py |
"BRIGHTER." |
| 🧰 a tool screenshot (blue) | examples/multimodal_advanced.py |
"Blue." |
| 📄 a text document | examples/document_and_audio.py |
recovers BANANA-42 |
| 🔊 a 440 Hz tone (Omni) | examples/omni_audio.py |
"It's a pure tone." |
| 💬 "say: …can speak" (Omni) | examples/omni_audio.py |
🔊 real 24 kHz speech |
Real agent outputs - detection boxes, depth, panoptic segmentation (one COCO photo):
|
🎬 Video understanding - frames in, label out: |
🔊 Speech - |
▶️ Hear it speak + play every example in the docs →
Featured models
The examples use tiny models so they run in seconds. In practice you point the
same code at any current library_name: transformers model - swap the id, the
plumbing is identical. A few strong open ones, by modality:
| Modality | Model | How to use |
|---|---|---|
| Vision-language | Qwen/Qwen3-VL-8B-Instruct · google/gemma-3-4b-it |
TransformerModel brain or run (image-text-to-text) |
| Speech → text | openai/whisper-large-v3-turbo · Qwen/Qwen3-ASR-1.7B |
run (automatic-speech-recognition) |
| Audio in + speech out | Qwen/Qwen2.5-Omni-3B |
TransformerModel brain (speak=True) |
| Multimodal (audio+vision+text) | microsoft/Phi-4-multimodal-instruct |
TransformerModel brain |
| Robot actions (VLA) | allenai/MolmoAct2 · openvla/openvla-7b |
call → predict_action |
| Embodied reasoning | nvidia/Cosmos-Reason2-2B |
run (image-text-to-text) |
# swap the tiny demo model for a SOTA one - same code:
model = TransformerModel(model_path="Qwen/Qwen3-VL-8B-Instruct")
Two ways to use it
As a tool - use_transformers (discover · run · call)
from strands import Agent
from strands_transformers import use_transformers
agent = Agent(tools=[use_transformers])
agent("Transcribe recording.wav") # automatic-speech-recognition
agent("What's in scene.jpg?") # image-text-to-text
agent("Say 'hello from strands' as audio") # text-to-audio
agent("Detect objects in https://.../street.jpg") # object-detection
Discover everything at runtime (action="tasks" | "modalities" | "inspect" | …),
run high-level pipelines, or call any class/fn/method for custom models.
→ The tool guide
As the agent's brain - TransformerModel (multimodal content blocks)
Pass image / video / audio / document content blocks (and media inside a
toolResult) - the provider auto-detects the model's processor and routes them.
All outputs below are real results (CUDA, transformers 5.12 / torch 2.10):
| Content block | Example | Verified output |
|---|---|---|
image |
multimodal_agent.py |
"Green." |
video (with fps) |
multimodal_advanced.py |
"BRIGHTER." |
image in toolResult |
multimodal_advanced.py |
"Blue." |
document |
document_and_audio.py |
recovers BANANA-42 |
audio (our schema extension) |
audio_content_block.py |
audio → text |
audio in and speech out |
omni_audio.py |
hears + speaks (Qwen2.5-Omni) |
→ Agent brain · Content blocks · Audio
Robotics / VLA - camera + instruction → robot actions
Two layers, both transformers-native and GPU-verified:
- 🧠 reason - Cosmos-Reason2-2B
(a physical-AI VLM) plans over a scene via the
runpath: "the red cube is in the bottom left corner, so the arm should move there first." - ⚙️ act - VLA models expose
predict_actionvia thecallpath: MolmoAct2 →[1,30,6]; OpenVLA-7b → 7-DoF (auto 4.x→5.x shims).
🔗 Full agentic loop (examples/robot_reason_act_agent.py):
Cosmos-Reason plans over real RealSense frames → MolmoAct acts ([1,30,6]) -
perception→plan→action through one tool.
Lerobot-ecosystem policies (SmolVLA, π0, ACT, GR00T) use their own runtimes -
pair with use_lerobot.
→ Robotics guide
How it works
Nothing is hardcoded per task - core/registry.py reads transformers' own
SUPPORTED_TASKS at runtime, so coverage tracks upstream automatically.
Project layout
strands_transformers/
├── tools/use_transformers.py # the one @tool: discover · run · call
├── models/transformers.py # TransformerModel - local multimodal agent brain
├── types/audio.py # audio content-block extension
└── core/{registry,engine,io,compat}.py # taxonomy · load/cache · I/O · legacy shims
Examples
12 runnable, GPU-verified examples in examples/ - image, video,
audio, document, Omni speech, VLA, and pipelines. Run any:
PYTHONPATH=. python examples/<name>.py
Star history
License
MIT - built with Strands Agents SDK and HuggingFace Transformers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strands_transformers-0.3.0.tar.gz.
File metadata
- Download URL: strands_transformers-0.3.0.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9b5a22af079ea210d3b095478f5416ee7de8e7e7cb27433d7590fe894fc8d5e
|
|
| MD5 |
e7d284911cd3549fb21d474a94cb1795
|
|
| BLAKE2b-256 |
63cc0fad36c0fd4ab893b8230a73d11f8c4a9e64950d6e7b008aa0ca7a2d5112
|
File details
Details for the file strands_transformers-0.3.0-py3-none-any.whl.
File metadata
- Download URL: strands_transformers-0.3.0-py3-none-any.whl
- Upload date:
- Size: 45.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c03cfe6dfffbd90e346286de51824b153db5da33714c21e6284846c56d005e4
|
|
| MD5 |
8922a66e15f9db65bab0c16eddf373a2
|
|
| BLAKE2b-256 |
5eeb66a9dd37bd32b97704a65bffb2646607d6eff1b1607c237dfee49c27cd3b
|