Skip to content

{{ community_contribution_banner }}

!!! info “Language Support” This provider is only supported in Python.

strands-vllm is a vLLM model provider for Strands Agents SDK with Token-In/Token-Out (TITO) support for agentic RL training. It provides integration with vLLM’s OpenAI-compatible API, optimized for reinforcement learning workflows with Agent Lightning.

Features:

  • OpenAI-Compatible API: Uses vLLM’s OpenAI-compatible /v1/chat/completions endpoint with streaming
  • TITO Support: Captures prompt_token_ids and token_ids directly from vLLM - no retokenization drift
  • Tool Call Validation: Optional hooks for RL-friendly error messages (allowed tools list, schema validation)
  • Agent Lightning Integration: Automatically adds token IDs to OpenTelemetry spans for RL training data extraction
  • Streaming: Full streaming support with token ID capture via VLLMTokenRecorder

!!! tip “Why TITO?” Traditional retokenization can cause drift in RL training—the same text may tokenize differently during inference vs. training (e.g., “HAVING” → H+AVING vs. HAV+ING). TITO captures exact tokens from vLLM, eliminating this issue. See No More Retokenization Drift for details.

Install strands-vllm along with the Strands Agents SDK:

Terminal window
pip install strands-vllm strands-agents-tools

For retokenization drift demos (requires HuggingFace tokenizer):

Terminal window
pip install "strands-vllm[drift]" strands-agents-tools
  • vLLM server running with your model (v0.10.2+ for return_token_ids support)
  • For tool calling: vLLM must be started with tool-calling enabled and appropriate chat template

First, start a vLLM server with your model:

Terminal window
vllm serve <MODEL_ID> \
--host 0.0.0.0 \
--port 8000

For tool calling support, add the appropriate flags for your model:

Terminal window
vllm serve <MODEL_ID> \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser <PARSER> # e.g., llama3_json, hermes, etc.

See vLLM tool calling documentation for supported parsers and chat templates.

import os
from strands import Agent
from strands_vllm import VLLMModel, VLLMTokenRecorder
# Configure via environment variables or directly
base_url = os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1")
model_id = os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>")
model = VLLMModel(
base_url=base_url,
model_id=model_id,
return_token_ids=True,
)
recorder = VLLMTokenRecorder()
agent = Agent(model=model, callback_handler=recorder)
result = agent("What is the capital of France?")
print(result)
# Access TITO data for RL training
print(f"Prompt tokens: {len(recorder.prompt_token_ids or [])}")
print(f"Response tokens: {len(recorder.token_ids or [])}")
Section titled “3. Tool Call Validation (Optional, Recommended for RL)”

Strands SDK already handles unknown tools and malformed JSON gracefully. VLLMToolValidationHooks adds RL-friendly enhancements:

import os
from strands import Agent
from strands_tools.calculator import calculator
from strands_vllm import VLLMModel, VLLMToolValidationHooks
model = VLLMModel(
base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
model_id=os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>"),
return_token_ids=True,
)
agent = Agent(
model=model,
tools=[calculator],
hooks=[VLLMToolValidationHooks()],
)
result = agent("Compute 17 * 19 using the calculator tool.")
print(result)

What it adds beyond Strands defaults:

  • Unknown tool errors include allowed tools list — helps RL training learn valid tool names
  • Schema validation — catches missing required args and unknown args before tool execution

Invalid tool calls receive deterministic error messages, providing cleaner RL training signals.

VLLMTokenRecorder automatically adds token IDs to OpenTelemetry spans for Agent Lightning compatibility:

import os
from strands import Agent
from strands_vllm import VLLMModel, VLLMTokenRecorder
model = VLLMModel(
base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
model_id=os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>"),
return_token_ids=True,
)
# add_to_span=True (default) adds token IDs to OpenTelemetry spans
recorder = VLLMTokenRecorder(add_to_span=True)
agent = Agent(model=model, callback_handler=recorder)
result = agent("Hello!")

The following span attributes are set:

AttributeDescription
llm.token_count.promptToken count for the prompt (OpenTelemetry semantic convention)
llm.token_count.completionToken count for the completion (OpenTelemetry semantic convention)
llm.hosted_vllm.prompt_token_idsToken ID array for the prompt
llm.hosted_vllm.response_token_idsToken ID array for the response

For building RL-ready trajectories with loss masks:

import asyncio
import os
from strands import Agent, tool
from strands_tools.calculator import calculator as _calculator_impl
from strands_vllm import TokenManager, VLLMModel, VLLMTokenRecorder, VLLMToolValidationHooks
@tool
def calculator(expression: str) -> dict:
return _calculator_impl(expression=expression)
async def main():
model = VLLMModel(
base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"),
model_id=os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>"),
return_token_ids=True,
)
recorder = VLLMTokenRecorder()
agent = Agent(
model=model,
tools=[calculator],
hooks=[VLLMToolValidationHooks()],
callback_handler=recorder,
)
await agent.invoke_async("What is 25 * 17?")
# Build RL trajectory with loss mask
tm = TokenManager()
for entry in recorder.history:
if entry.get("prompt_token_ids"):
tm.add_prompt(entry["prompt_token_ids"]) # loss_mask=0
if entry.get("token_ids"):
tm.add_response(entry["token_ids"]) # loss_mask=1
print(f"Total tokens: {len(tm)}")
print(f"Prompt tokens: {sum(1 for m in tm.loss_mask if m == 0)}")
print(f"Response tokens: {sum(1 for m in tm.loss_mask if m == 1)}")
print(f"Token IDs: {tm.token_ids[:20]}...") # First 20 tokens
print(f"Loss mask: {tm.loss_mask[:20]}...")
asyncio.run(main())

The VLLMModel accepts the following parameters:

ParameterDescriptionExampleRequired
base_urlvLLM server URL"http://localhost:8000/v1"Yes
model_idModel identifier"<YOUR_MODEL_ID>"Yes
api_keyAPI key (usually “EMPTY” for local vLLM)"EMPTY"No (default: “EMPTY”)
return_token_idsRequest token IDs from vLLMTrueNo (default: False)
disable_toolsRemove tools/tool_choice from requestsTrueNo (default: False)
paramsAdditional generation parameters{"temperature": 0, "max_tokens": 256}No
ParameterDescriptionDefault
innerInner callback handler to chainNone
add_to_spanAdd token IDs to OpenTelemetry spansTrue
ParameterDescriptionDefault
include_allowed_tools_in_errorsInclude list of allowed tools in error messagesTrue
max_allowed_tools_in_errorMaximum tool names to show in error messages25
validate_input_shapeValidate required/unknown args against schemaTrue

Example error messages (more informative than Strands defaults):

  • Unknown tool: Error: unknown tool: fake_tool | allowed_tools=[calculator, search, ...]
  • Missing argument: Error: tool_name=<calculator> | missing required argument(s): expression
  • Unknown argument: Error: tool_name=<calculator> | unknown argument(s): invalid_param

Ensure your vLLM server is running and accessible:

Terminal window
# Check if server is responding
curl http://localhost:8000/health

Ensure:

  1. vLLM version is 0.10.2 or later
  2. return_token_ids=True is set on VLLMModel
  3. Your vLLM server supports return_token_ids in streaming mode

Strands handles unknown tools gracefully, but for RL training you may want more informative errors. Add VLLMToolValidationHooks to get errors that include the list of allowed tools and validate argument schemas.

Some models/chat templates only support one tool call per message. If you see "This model only supports single tool-calls at once!", adjust your prompts to request one tool at a time.