{{ community_contribution_banner }}
!!! info “Language Support” This provider is only supported in Python.
strands-vllm is a vLLM model provider for Strands Agents SDK with Token-In/Token-Out (TITO) support for agentic RL training. It provides integration with vLLM’s OpenAI-compatible API, optimized for reinforcement learning workflows with Agent Lightning.
Features:
- OpenAI-Compatible API: Uses vLLM’s OpenAI-compatible
/v1/chat/completionsendpoint with streaming - TITO Support: Captures
prompt_token_idsandtoken_idsdirectly from vLLM - no retokenization drift - Tool Call Validation: Optional hooks for RL-friendly error messages (allowed tools list, schema validation)
- Agent Lightning Integration: Automatically adds token IDs to OpenTelemetry spans for RL training data extraction
- Streaming: Full streaming support with token ID capture via
VLLMTokenRecorder
!!! tip “Why TITO?”
Traditional retokenization can cause drift in RL training—the same text may tokenize differently during inference vs. training (e.g., “HAVING” → H+AVING vs. HAV+ING). TITO captures exact tokens from vLLM, eliminating this issue. See No More Retokenization Drift for details.
Installation
Section titled “Installation”Install strands-vllm along with the Strands Agents SDK:
pip install strands-vllm strands-agents-toolsFor retokenization drift demos (requires HuggingFace tokenizer):
pip install "strands-vllm[drift]" strands-agents-toolsRequirements
Section titled “Requirements”- vLLM server running with your model (v0.10.2+ for
return_token_idssupport) - For tool calling: vLLM must be started with tool-calling enabled and appropriate chat template
1. Start vLLM Server
Section titled “1. Start vLLM Server”First, start a vLLM server with your model:
vllm serve <MODEL_ID> \ --host 0.0.0.0 \ --port 8000For tool calling support, add the appropriate flags for your model:
vllm serve <MODEL_ID> \ --host 0.0.0.0 \ --port 8000 \ --enable-auto-tool-choice \ --tool-call-parser <PARSER> # e.g., llama3_json, hermes, etc.See vLLM tool calling documentation for supported parsers and chat templates.
2. Basic Agent
Section titled “2. Basic Agent”import osfrom strands import Agentfrom strands_vllm import VLLMModel, VLLMTokenRecorder
# Configure via environment variables or directlybase_url = os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1")model_id = os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>")
model = VLLMModel( base_url=base_url, model_id=model_id, return_token_ids=True,)
recorder = VLLMTokenRecorder()agent = Agent(model=model, callback_handler=recorder)
result = agent("What is the capital of France?")print(result)
# Access TITO data for RL trainingprint(f"Prompt tokens: {len(recorder.prompt_token_ids or [])}")print(f"Response tokens: {len(recorder.token_ids or [])}")3. Tool Call Validation (Optional, Recommended for RL)
Section titled “3. Tool Call Validation (Optional, Recommended for RL)”Strands SDK already handles unknown tools and malformed JSON gracefully. VLLMToolValidationHooks adds RL-friendly enhancements:
import osfrom strands import Agentfrom strands_tools.calculator import calculatorfrom strands_vllm import VLLMModel, VLLMToolValidationHooks
model = VLLMModel( base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"), model_id=os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>"), return_token_ids=True,)
agent = Agent( model=model, tools=[calculator], hooks=[VLLMToolValidationHooks()],)
result = agent("Compute 17 * 19 using the calculator tool.")print(result)What it adds beyond Strands defaults:
- Unknown tool errors include allowed tools list — helps RL training learn valid tool names
- Schema validation — catches missing required args and unknown args before tool execution
Invalid tool calls receive deterministic error messages, providing cleaner RL training signals.
4. Agent Lightning Integration
Section titled “4. Agent Lightning Integration”VLLMTokenRecorder automatically adds token IDs to OpenTelemetry spans for Agent Lightning compatibility:
import osfrom strands import Agentfrom strands_vllm import VLLMModel, VLLMTokenRecorder
model = VLLMModel( base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"), model_id=os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>"), return_token_ids=True,)
# add_to_span=True (default) adds token IDs to OpenTelemetry spansrecorder = VLLMTokenRecorder(add_to_span=True)agent = Agent(model=model, callback_handler=recorder)
result = agent("Hello!")The following span attributes are set:
| Attribute | Description |
|---|---|
llm.token_count.prompt | Token count for the prompt (OpenTelemetry semantic convention) |
llm.token_count.completion | Token count for the completion (OpenTelemetry semantic convention) |
llm.hosted_vllm.prompt_token_ids | Token ID array for the prompt |
llm.hosted_vllm.response_token_ids | Token ID array for the response |
5. RL Training with TokenManager
Section titled “5. RL Training with TokenManager”For building RL-ready trajectories with loss masks:
import asyncioimport osfrom strands import Agent, toolfrom strands_tools.calculator import calculator as _calculator_implfrom strands_vllm import TokenManager, VLLMModel, VLLMTokenRecorder, VLLMToolValidationHooks
@tooldef calculator(expression: str) -> dict: return _calculator_impl(expression=expression)
async def main(): model = VLLMModel( base_url=os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1"), model_id=os.getenv("VLLM_MODEL_ID", "<YOUR_MODEL_ID>"), return_token_ids=True, )
recorder = VLLMTokenRecorder() agent = Agent( model=model, tools=[calculator], hooks=[VLLMToolValidationHooks()], callback_handler=recorder, )
await agent.invoke_async("What is 25 * 17?")
# Build RL trajectory with loss mask tm = TokenManager() for entry in recorder.history: if entry.get("prompt_token_ids"): tm.add_prompt(entry["prompt_token_ids"]) # loss_mask=0 if entry.get("token_ids"): tm.add_response(entry["token_ids"]) # loss_mask=1
print(f"Total tokens: {len(tm)}") print(f"Prompt tokens: {sum(1 for m in tm.loss_mask if m == 0)}") print(f"Response tokens: {sum(1 for m in tm.loss_mask if m == 1)}") print(f"Token IDs: {tm.token_ids[:20]}...") # First 20 tokens print(f"Loss mask: {tm.loss_mask[:20]}...")
asyncio.run(main())Configuration
Section titled “Configuration”Model Configuration
Section titled “Model Configuration”The VLLMModel accepts the following parameters:
| Parameter | Description | Example | Required |
|---|---|---|---|
base_url | vLLM server URL | "http://localhost:8000/v1" | Yes |
model_id | Model identifier | "<YOUR_MODEL_ID>" | Yes |
api_key | API key (usually “EMPTY” for local vLLM) | "EMPTY" | No (default: “EMPTY”) |
return_token_ids | Request token IDs from vLLM | True | No (default: False) |
disable_tools | Remove tools/tool_choice from requests | True | No (default: False) |
params | Additional generation parameters | {"temperature": 0, "max_tokens": 256} | No |
VLLMTokenRecorder Configuration
Section titled “VLLMTokenRecorder Configuration”| Parameter | Description | Default |
|---|---|---|
inner | Inner callback handler to chain | None |
add_to_span | Add token IDs to OpenTelemetry spans | True |
VLLMToolValidationHooks Configuration
Section titled “VLLMToolValidationHooks Configuration”| Parameter | Description | Default |
|---|---|---|
include_allowed_tools_in_errors | Include list of allowed tools in error messages | True |
max_allowed_tools_in_error | Maximum tool names to show in error messages | 25 |
validate_input_shape | Validate required/unknown args against schema | True |
Example error messages (more informative than Strands defaults):
- Unknown tool:
Error: unknown tool: fake_tool | allowed_tools=[calculator, search, ...] - Missing argument:
Error: tool_name=<calculator> | missing required argument(s): expression - Unknown argument:
Error: tool_name=<calculator> | unknown argument(s): invalid_param
Troubleshooting
Section titled “Troubleshooting”Connection errors to vLLM server
Section titled “Connection errors to vLLM server”Ensure your vLLM server is running and accessible:
# Check if server is respondingcurl http://localhost:8000/healthNo token IDs captured
Section titled “No token IDs captured”Ensure:
- vLLM version is 0.10.2 or later
return_token_ids=Trueis set onVLLMModel- Your vLLM server supports
return_token_idsin streaming mode
RL training needs cleaner error signals
Section titled “RL training needs cleaner error signals”Strands handles unknown tools gracefully, but for RL training you may want more informative errors. Add VLLMToolValidationHooks to get errors that include the list of allowed tools and validate argument schemas.
Model only supports single tool calls
Section titled “Model only supports single tool calls”Some models/chat templates only support one tool call per message. If you see "This model only supports single tool-calls at once!", adjust your prompts to request one tool at a time.
References
Section titled “References”- strands-vllm Repository
- vLLM Documentation
- Agent Lightning GitHub - The absolute trainer to light up AI agents
- Agent Lightning Blog Post - No More Retokenization Drift
- Strands Agents API