Community vLLM provider utilities for Strands Agents (OpenAI-compatible).
Project description
Strands-vLLM
Community vLLM provider for Strands Agents SDK with Token-In/Token-Out (TITO) support and Agent Lightning integration.
Features
This package provides convenient utilities for using vLLM with the Strands Agents SDK, designed for training-ready agent rollouts:
- Token-In/Token-Out (TITO): capture token IDs directly from vLLM streaming responses (no retokenization drift)
- Agent Lightning integration: automatic OpenTelemetry span attributes for token IDs
- Tool calling support: validation hooks for vLLM's server-side tool call post-processing
- OpenAI-compatible API: works with vLLM's OpenAI-compatible endpoint
Requirements
- Python 3.10+
- Strands Agents SDK
- vLLM server running with your model
Installation
pip install strands-vllm
Or install from source with development dependencies:
git clone https://github.com/agents-community/strands-vllm.git
cd strands-vllm
pip install -e ".[dev]"
Quick Start
1. Start vLLM Server
vllm serve <MODEL_ID> \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser llama3_json
2. Basic Agent
from strands import Agent
from strands_vllm import VLLMModel
model = VLLMModel(
base_url="http://localhost:8000/v1",
model_id="AMead10/Llama-3.2-3B-Instruct-AWQ",
return_token_ids=True,
)
agent = Agent(model=model)
result = agent("Say hello")
print(result)
3. Token IDs for RL Training
from strands import Agent
from strands.handlers.callback_handler import CompositeCallbackHandler, PrintingCallbackHandler
from strands_vllm import VLLMModel, VLLMTokenRecorder
model = VLLMModel(
base_url="http://localhost:8000/v1",
model_id="AMead10/Llama-3.2-3B-Instruct-AWQ",
return_token_ids=True,
)
recorder = VLLMTokenRecorder()
printer = PrintingCallbackHandler(verbose_tool_use=False)
callback = CompositeCallbackHandler(printer, recorder)
agent = Agent(model=model, callback_handler=callback)
agent("What is 17 * 19?")
# Access TITO data for RL training
print(f"Prompt token IDs: {recorder.prompt_token_ids}")
print(f"Response token IDs: {recorder.token_ids}")
Note: VLLMTokenRecorder automatically adds token IDs as OpenTelemetry span attributes (llm.hosted_vllm.prompt_token_ids, llm.hosted_vllm.response_token_ids) for Agent Lightning compatibility.
Slime Training
For RL training with Slime, VLLMModel with VLLMTokenRecorder eliminates the retokenization step by capturing token IDs directly from vLLM streaming responses.
Note: This requires THUDM/slime to be installed (not the pip slime package):
pip install git+https://github.com/THUDM/slime.git
from strands import Agent, tool
from strands_vllm import VLLMModel, VLLMTokenRecorder, TokenManager
from slime.utils.types import Sample
SYSTEM_PROMPT = "..."
MAX_TOOL_ITERATIONS = ... # e.g., 5
@tool
def execute_python_code(code: str):
"""Execute Python code and return the output."""
...
async def generate(args, sample: Sample, sampling_params) -> Sample:
"""Generate with TITO: tokens captured during generation, no retokenization."""
assert not args.partial_rollout, "Partial rollout not supported."
# Set up Agent with VLLMModel and VLLMTokenRecorder
model = VLLMModel(
base_url=args.vllm_base_url,
model_id=args.hf_checkpoint.split("/")[-1],
return_token_ids=True,
params={k: sampling_params[k] for k in ["max_new_tokens", "temperature", "top_p"]},
)
recorder = VLLMTokenRecorder()
agent = Agent(
model=model,
tools=[execute_python_code],
callback_handler=recorder,
system_prompt=SYSTEM_PROMPT,
)
# Run Agent Loop
prompt = sample.prompt if isinstance(sample.prompt, str) else sample.prompt[0]["content"]
try:
await agent.invoke_async(prompt)
sample.status = Sample.Status.COMPLETED
except Exception as e:
# Always use TRUNCATED instead of ABORTED because Slime doesn't properly
# handle ABORTED samples in reward processing. See: https://github.com/THUDM/slime/issues/200
sample.status = Sample.Status.TRUNCATED
logger.warning(f"TRUNCATED: {type(e).__name__}: {e}")
# TITO: extract trajectory from recorder and TokenManager
tm = TokenManager()
for entry in recorder.history:
pti = entry.get("prompt_token_ids")
ti = entry.get("token_ids")
if pti:
tm.add_prompt(pti)
if ti:
tm.add_response(ti)
prompt_len = len(tm.segments[0]) # system + user are first segment
sample.tokens = tm.token_ids
sample.loss_mask = tm.loss_mask[prompt_len:]
sample.rollout_log_probs = tm.logprobs[prompt_len:]
sample.response_length = len(sample.tokens) - prompt_len
# Extract response from agent messages (vLLM returns text directly, no tokenizer needed)
response_text = ""
for msg in reversed(agent.messages):
if msg.get("role") == "assistant":
content = msg.get("content", [])
if isinstance(content, list):
for block in content:
if isinstance(block, dict) and "text" in block:
response_text = block["text"]
break
if response_text:
break
sample.response = response_text
# Cleanup and return
recorder.reset()
agent.cleanup()
return sample
Examples
All examples can be configured with environment variables:
export VLLM_BASE_URL="http://localhost:8000/v1"
export VLLM_MODEL_ID="AMead10/Llama-3.2-3B-Instruct-AWQ"
Math agent with tools
pip install strands-agents-tools
python examples/math_agent.py
Agent Lightning integration
Demonstrates token IDs in OpenTelemetry spans for Agent Lightning compatibility:
python examples/agent_lightning.py
Tool-call validation
vLLM tool calling can involve server-side post-processing. Use validation hooks to guard tool execution:
from strands import Agent
from strands_tools.calculator import calculator
from strands_vllm import VLLMModel, VLLMToolValidationHooks
model = VLLMModel(base_url="http://localhost:8000/v1", model_id="...", return_token_ids=True)
agent = Agent(model=model, tools=[calculator], hooks=[VLLMToolValidationHooks()])
print(agent("Compute 17 * 19 using the calculator tool."))
Retokenization drift (educational)
This demo shows why TITO matters: encode(decode(tokens)) != tokens can happen.
pip install "strands-vllm[drift]" strands-agents-tools
python examples/retokenization_drift.py
Testing
# Unit tests
uv run pytest tests/unit/ -v
# Integration tests (requires vLLM server)
export VLLM_BASE_URL="http://localhost:8000/v1"
export VLLM_MODEL_ID="AMead10/Llama-3.2-3B-Instruct-AWQ"
uv run pytest tests/integration/ -v
Integration tests include:
test_agent_math500.py- Agent tests with real MATH-500 problems and TITO consistency checkstest_slime_integration.py- Slime training pattern using Slime'sSampletype (requirespip install git+https://github.com/THUDM/slime.git)
Contributing
Contributions welcome! Install pre-commit hooks for code style and commit message validation:
pip install -e ".[dev]"
pre-commit install -t pre-commit -t commit-msg
This project uses Conventional Commits. Commit messages must follow the format:
<type>(<scope>): <description>
# Examples:
feat(recorder): add Agent Lightning span attributes
fix(vllm): handle empty response from server
docs: update TITO usage examples
Allowed types: feat, fix, docs, style, refactor, perf, test, build, ci, chore, revert
Related Projects
- strands-sglang - SGLang provider for Strands Agents SDK
License
Apache License 2.0 - see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strands_vllm-0.0.6.tar.gz.
File metadata
- Download URL: strands_vllm-0.0.6.tar.gz
- Upload date:
- Size: 211.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3c8c267524aad4d5a8b9607882589e149df8097c091c3f90f87ac0ed522c070
|
|
| MD5 |
7c75e09d254e5af5c1536767f144e2e9
|
|
| BLAKE2b-256 |
5c84280f11dfaceab5be4f8517af35412e9f07837ff8479131bc65db552aeb3d
|
File details
Details for the file strands_vllm-0.0.6-py3-none-any.whl.
File metadata
- Download URL: strands_vllm-0.0.6-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2d82a3de0e81a3270be0e40627387debec7ef876f435b7869184d80fbd9dce9
|
|
| MD5 |
8c6401af60aa69f11b42cc6cb397c379
|
|
| BLAKE2b-256 |
6a0d9116d3fb7ae5350072a84adafc53b13638b67daa12210a5a3b19a6d488ee
|