Skip to main content

The universal entrypoint to HuggingFace transformers for Strands agents - 100% task & modality coverage, zero hardcoding.

Project description

Strands Transformers - every modality in, every modality out: one tool, one local brain, zero hardcoding

Run any HuggingFace transformers model from a Strands agent. A tool for every task, or the agent's own multimodal brain. Local. No API keys.

pypi docs build docs site python transformers license

📖 Docs  ·  ⚡ 60-second hello  ·  👁️ See it work  ·  🧩 Two ways  ·  🧪 Examples


🤔 The idea

HuggingFace transformers already runs every model on earth. The missing piece is a clean, dynamic bridge into an agent loop - without writing per-model glue every time. This library is that bridge, two ways:

What it is You get
🛠️ use_transformers one tool exposing every transformers task discover · run a pipeline · call any class/method
🧠 TransformerModel a local model as your Agent(model=…) brain it sees, hears & speaks via content blocks

Zero hardcoding. core/registry.py reads transformers' own SUPPORTED_TASKS taxonomy at runtime - the day a task or model lands upstream, it works here. No code change. No version bump.

📦 Install

uv pip install strands-transformers          # from PyPI
PYTHONPATH=. python examples/smoke.py         # verify → "18/18 checks passed"
From source · optional extras (audio · vision · training)
uv pip install -e .                # editable from source
uv pip install -e ".[audio]"       # soundfile, librosa  (mp3/flac/ogg decode)
uv pip install -e ".[vision]"      # torchvision (needed by VLMs!), opencv, av
uv pip install -e ".[training]"    # trl, peft, accelerate
uv pip install -e ".[all]"         # everything

WAV audio works with no extras. Vision models (SmolVLM, Qwen-VL, …) need [vision]. device="auto" picks cuda → mps → cpu (bf16 on GPU).

⚡ 60-second hello

A 256M-param vision model, seeing pixels in the standard Strands loop - no key, no server:

import io
from PIL import Image
from strands import Agent
from strands_transformers import TransformerModel

buf = io.BytesIO(); Image.new("RGB", (64, 64), (20, 200, 40)).save(buf, "PNG")  # a green square

model = TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct")
agent = Agent(model=model, system_prompt="You are concise.")

print(agent([
    {"image": {"format": "png", "source": {"bytes": buf.getvalue()}}},
    {"text": "Color? One word."},
]))
# → Green.

Swap model_path for any HF VLM and the code is identical.

👁️ See it work

Every result below is a real model output (CUDA · transformers 5.12 · torch 2.10):

You give it It returns Example
🖼️ a green image + "Color?" "Green." multimodal_agent.py
🎬 brightening video frames "BRIGHTER." multimodal_advanced.py
🧰 a blue tool screenshot (in toolResult) "Blue." multimodal_advanced.py
📄 a text document recovers BANANA-42 document_and_audio.py
🔊 a 440 Hz tone (Omni) "It's a pure tone." omni_audio.py
💬 "say: …can speak" (Omni) 🔊 real 24 kHz speech omni_audio.py
🦾 camera + "pick the cube" actions [1, 30, 6] molmoact_vla.py
📋 Copy-paste & run - the snippet behind each row
# 🖼️ image → "Green."   (local VLM brain, content blocks)
import io
from PIL import Image
from strands import Agent
from strands_transformers import TransformerModel

png = io.BytesIO(); Image.new("RGB", (224, 224), (20, 200, 40)).save(png, "PNG")
agent = Agent(model=TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct"))
print(agent([
    {"image": {"format": "png", "source": {"bytes": png.getvalue()}}},
    {"text": "What color is this image? One word."},
]))  # → Green.
# 🎬 video → "BRIGHTER."   (a video content block of brightening frames)
import asyncio, numpy as np
from PIL import Image
from strands_transformers import TransformerModel

model = TransformerModel(model_path="HuggingFaceTB/SmolVLM2-500M-Video-Instruct",
                         params={"max_tokens": 48, "do_sample": False})
frames = [Image.fromarray(np.full((224, 224, 3), v, np.uint8)) for v in (10,40,80,120,160,200,230,250)]
msgs = [{"role": "user", "content": [
    {"video": {"format": "mp4", "fps": 2.0, "source": {"bytes": frames}}},
    {"text": "Does this video get brighter or darker? Answer brighter or darker."},
]}]
async def go():
    return "".join([e.get("contentBlockDelta",{}).get("delta",{}).get("text","")
                    async for e in model.stream(msgs)])
print(asyncio.run(go()))  # → ...brighter...
# 🧰 tool screenshot → "Blue."   (an image returned inside a toolResult)
import asyncio, io
from PIL import Image
from strands_transformers import TransformerModel

blue = io.BytesIO(); Image.new("RGB", (224, 224), (25, 25, 210)).save(blue, "PNG")
model = TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct",
                         params={"max_tokens": 32, "do_sample": False})
msgs = [
    {"role": "user", "content": [{"text": "Capture the screen, then name its color."}]},
    {"role": "assistant", "content": [{"toolUse": {"name": "capture", "toolUseId": "t1", "input": {}}}]},
    {"role": "user", "content": [{"toolResult": {"toolUseId": "t1", "status": "success", "content": [
        {"text": "Here is the captured screen:"},
        {"image": {"format": "png", "source": {"bytes": blue.getvalue()}}}]}}]},
    {"role": "user", "content": [{"text": "Dominant color of the captured screen? One word."}]},
]
async def go():
    return "".join([e.get("contentBlockDelta",{}).get("delta",{}).get("text","")
                    async for e in model.stream(msgs)])
print(asyncio.run(go()))  # → Blue.
# 📄 document → recovers "BANANA-42"   (a document content block → text LM prompt)
import asyncio
from strands_transformers import TransformerModel

model = TransformerModel(model_path="Qwen/Qwen3-0.6B", enable_thinking=False,
                         params={"max_tokens": 64, "do_sample": False})
body = b"The secret passphrase for the vault is BANANA-42. Keep it safe."
msgs = [{"role": "user", "content": [
    {"document": {"name": "secret", "format": "txt", "source": {"bytes": body}}},
    {"text": "What is the secret passphrase? Answer with just the passphrase."},
]}]
async def go():
    return "".join([e.get("contentBlockDelta",{}).get("delta",{}).get("text","")
                    async for e in model.stream(msgs)])
print(asyncio.run(go()))  # → BANANA-42
# 🔊 text → speech → text   (TTS then ASR, the tool path; the library narrating itself)
from strands_transformers import use_transformers

tts = use_transformers(action="run", task="text-to-audio",
                       model="facebook/mms-tts-eng",
                       inputs="the quick brown fox jumps over the lazy dog")
wav = tts["artifacts"][0]
asr = use_transformers(action="run", task="automatic-speech-recognition",
                       model="openai/whisper-tiny", inputs=wav)
print(asr["content"][0]["text"])  # → "...quick brown fox..."
# 🦾 camera + instruction → robot actions [1, 30, 6]   (VLA via the `call` path)
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download
from strands_transformers import use_transformers

REPO = "allenai/MolmoAct2-SO100_101"
top  = Image.open(hf_hub_download(REPO, "assets/sample_realsense_top_rgb.png")).convert("RGB")
side = Image.open(hf_hub_download(REPO, "assets/sample_realsense_side_rgb.png")).convert("RGB")
state = [-0.527, 189.14, 181.41, 60.64, -3.60, 1.097]

use_transformers(action="call", target="AutoProcessor.from_pretrained",
    parameters={"pretrained_model_name_or_path": REPO, "trust_remote_code": True}, cache_key="proc")
use_transformers(action="call", target="AutoModelForImageTextToText.from_pretrained",
    parameters={"pretrained_model_name_or_path": REPO, "trust_remote_code": True, "dtype": "float32"}, cache_key="vla")
print(use_transformers(action="call", target="cached:vla.predict_action",
    parameters={"processor": "cached:proc", "images": [top, side],
                "task": "Pick up the lemon and drop it in the red bowl.",
                "state": state, "norm_tag": "so100_so101_molmoact2",
                "inference_action_mode": "continuous", "num_steps": 10})["content"][0]["text"][:200])
# → MolmoAct2ActionOutput ... actions [1, 30, 6]

Omni audio-in + speech-out needs one bigger model - see examples/omni_audio.py.

Vision tasks on one COCO photo - detection · depth · panoptic segmentation:

object detection, depth estimation, and panoptic segmentation on a single photo

video understanding demo    generated speech waveform

🎬 video → label  ·  🔊 text-to-audio then re-transcribed by whisper (the library narrating itself)
▶️ Hear it speak & play every example on the docs site →

🧩 Two ways to use it

🛠️ As a tool - use_transformers

from strands import Agent
from strands_transformers import use_transformers

agent = Agent(tools=[use_transformers])
agent("Transcribe recording.wav")                  # automatic-speech-recognition
agent("What's in scene.jpg?")                       # image-text-to-text
agent("Say 'hello from strands' as audio")          # text-to-audio
agent("Detect objects in https://.../street.jpg")   # object-detection

Discover everything at runtime (action="tasks" | "modalities" | "inspect" | …), run high-level pipelines, or call any class / fn / method for custom models. → The tool guide

🧠 As the agent's brain - TransformerModel

Pass image / video / audio / document blocks (and media inside a toolResult) - the provider auto-detects the model's processor and routes them.

Content block Verified output Example
image "Green." multimodal_agent.py
video (with fps) "BRIGHTER." multimodal_advanced.py
image in toolResult "Blue." multimodal_advanced.py
document recovers BANANA-42 document_and_audio.py
audio (our schema extension) audio → text audio_content_block.py
audio in and speech out hears + speaks omni_audio.py

Agent brain · Content blocks · Audio

🦾 Robotics / VLA - camera + instruction → actions

Two transformers-native layers, both GPU-verified:

  • 🧠 reasonCosmos-Reason2-2B (a physical-AI VLM) plans over a scene via run: "the red cube is bottom-left, move the arm there first."
  • ⚙️ act — VLA models expose predict_action via call: MolmoAct2[1,30,6]; OpenVLA-7b → 7-DoF (auto 4.x→5.x shims).

🔗 Full agentic loop (robot_reason_act_agent.py): Cosmos plans over real RealSense frames → MolmoAct acts — perception → plan → action through one tool. (Lerobot policies like SmolVLA / π0 / GR00T run their own runtimes — pair with use_lerobot.)Robotics guide

🌟 Featured models

Examples use tiny models so they run in seconds. Point the same code at any current library_name: transformers model - swap the id, the plumbing is identical:

Modality Strong open model How
Vision-language Qwen/Qwen3-VL-8B-Instruct · google/gemma-3-4b-it brain or run (image-text-to-text)
Speech → text openai/whisper-large-v3-turbo run (automatic-speech-recognition)
Audio in + speech out Qwen/Qwen2.5-Omni-3B brain (speak=True)
Multimodal (audio+vision+text) microsoft/Phi-4-multimodal-instruct brain
Robot actions (VLA) allenai/MolmoAct2 · openvla/openvla-7b callpredict_action
Embodied reasoning nvidia/Cosmos-Reason2-2B run (image-text-to-text)
# swap the tiny demo model for a SOTA one - same code:
model = TransformerModel(model_path="Qwen/Qwen3-VL-8B-Instruct")

🏗️ How it works

strands_transformers/
├── tools/use_transformers.py            # the one @tool: discover · run · call
├── models/transformers.py               # TransformerModel - local multimodal brain
├── types/audio.py                       # audio content-block extension
└── core/{registry,engine,io,compat}.py  # taxonomy · load/cache · I/O · legacy shims

Nothing is hardcoded per task - registry.py reads transformers' SUPPORTED_TASKS at runtime, so coverage tracks upstream automatically. → Architecture · API reference

🧪 Examples

Runnable, GPU-verified examples in examples/ - image, video, audio, document, Omni speech, VLA, and pipelines. Run any:

PYTHONPATH=. python examples/<name>.py

Examples & FAQ

⭐ Star history

Star History Chart

License

MIT - built with the Strands Agents SDK and HuggingFace Transformers.

If this saved you a pile of per-model glue code, consider giving it a ⭐

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strands_transformers-0.4.0.tar.gz (2.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strands_transformers-0.4.0-py3-none-any.whl (46.7 kB view details)

Uploaded Python 3

File details

Details for the file strands_transformers-0.4.0.tar.gz.

File metadata

  • Download URL: strands_transformers-0.4.0.tar.gz
  • Upload date:
  • Size: 2.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for strands_transformers-0.4.0.tar.gz
Algorithm Hash digest
SHA256 4c8bc81d07563d91206ac73d549c9a3c36346c0037ec61387aabceb55aa7de71
MD5 34d1c41786cecee31b40c7038e5815ab
BLAKE2b-256 b21037720e23160eadfdbc1ad81300d71d96699071e5a9a95df687770d65f63f

See more details on using hashes here.

File details

Details for the file strands_transformers-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for strands_transformers-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 60f0da611d24bcfc5a3da12da0cbc9152cc6e3fb6c1ad9b6284ccfbc35ddeec4
MD5 dab2bd49e4fc58d0067ab63af518c3d1
BLAKE2b-256 66b1e0ab437208f1c817a5c79f9a99397d9dbf2975c37fc98819004e94707765

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page