Strands Evaluation Quickstart
Section titled “Strands Evaluation Quickstart”This quickstart guide shows you how to create your first evaluation experiment, use built-in evaluators to assess agent performance, generate test cases automatically, and analyze results. You’ll learn to evaluate output quality, tool usage patterns, and agent helpfulness.
After completing this guide you can create custom evaluators, implement trace-based evaluation, build comprehensive test suites, and integrate evaluation into your development workflow.
Install the SDK
Section titled “Install the SDK”First, ensure that you have Python 3.10+ installed.
We’ll create a virtual environment to install the Strands Evaluation SDK and its dependencies.
python -m venv .venvAnd activate the virtual environment:
- macOS / Linux:
source .venv/bin/activate - Windows (CMD):
.venv\Scripts\activate.bat - Windows (PowerShell):
.venv\Scripts\Activate.ps1
Next we’ll install the strands-agents-evals SDK package:
pip install strands-agents-evalsYou’ll also need the core Strands Agents SDK and tools for this guide:
pip install strands-agents strands-agents-toolsConfiguring Credentials
Section titled “Configuring Credentials”Strands Evaluation uses the same model providers as Strands Agents. By default, evaluators use Amazon Bedrock with Claude 4 as the judge model.
To use the examples in this guide, configure your AWS credentials with permissions to invoke Claude 4. You can set up credentials using:
- Environment variables: Set
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, and optionallyAWS_SESSION_TOKEN - AWS credentials file: Configure credentials using
aws configureCLI command - IAM roles: If running on AWS services like EC2, ECS, or Lambda
Make sure to enable model access in the Amazon Bedrock console following the AWS documentation.
Project Setup
Section titled “Project Setup”Create a directory structure for your evaluation project:
my_evaluation/├── __init__.py├── basic_eval.py├── trajectory_eval.py└── requirements.txtCreate the directory: mkdir my_evaluation
Create my_evaluation/requirements.txt:
strands-agents>=1.0.0strands-agents-tools>=0.2.0strands-agents-evals>=1.0.0Create the my_evaluation/__init__.py file:
from . import basic_eval, trajectory_evalBasic Output Evaluation
Section titled “Basic Output Evaluation”Let’s start with a simple output evaluation using the OutputEvaluator. Create my_evaluation/basic_eval.py:
from strands import Agentfrom strands_evals import Case, Experimentfrom strands_evals.evaluators import OutputEvaluator
# Define your task functiondef get_response(case: Case) -> str: agent = Agent( system_prompt="You are a helpful assistant that provides accurate information.", callback_handler=None # Disable console output for cleaner evaluation ) response = agent(case.input) return str(response)
# Create test casestest_cases = [ Case[str, str]( name="knowledge-1", input="What is the capital of France?", expected_output="The capital of France is Paris.", metadata={"category": "knowledge"} ), Case[str, str]( name="knowledge-2", input="What is 2 + 2?", expected_output="4", metadata={"category": "math"} ), Case[str, str]( name="reasoning-1", input="If it takes 5 machines 5 minutes to make 5 widgets, how long does it take 100 machines to make 100 widgets?", expected_output="5 minutes", metadata={"category": "reasoning"} )]
# Create evaluator with custom rubricevaluator = OutputEvaluator( rubric=""" Evaluate the response based on: 1. Accuracy - Is the information factually correct? 2. Completeness - Does it fully answer the question? 3. Clarity - Is it easy to understand?
Score 1.0 if all criteria are met excellently. Score 0.5 if some criteria are partially met. Score 0.0 if the response is inadequate or incorrect. """, include_inputs=True)
# Create and run experimentexperiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator])reports = experiment.run_evaluations(get_response)
# Display resultsprint("=== Basic Output Evaluation Results ===")reports[0].run_display()
# Save experiment for later analysisexperiment.to_file("basic_evaluation", "json")print("\nExperiment saved to ./experiment_files/basic_evaluation.json")Tool Usage Evaluation
Section titled “Tool Usage Evaluation”Now let’s evaluate how well agents use tools. Create my_evaluation/trajectory_eval.py:
from strands import Agentfrom strands_evals import Case, Experimentfrom strands_evals.evaluators import TrajectoryEvaluatorfrom strands_evals.extractors import tools_use_extractorfrom strands_tools import calculator, current_time
# Define task function that captures tool usagedef get_response_with_tools(case: Case) -> dict: agent = Agent( tools=[calculator, current_time], system_prompt="You are a helpful assistant. Use tools when appropriate.", callback_handler=None ) response = agent(case.input)
# Extract trajectory efficiently to prevent context overflow trajectory = tools_use_extractor.extract_agent_tools_used_from_messages(agent.messages)
return {"output": str(response), "trajectory": trajectory}
# Create test cases with expected tool usagetest_cases = [ Case[str, str]( name="calculation-1", input="What is 15% of 230?", expected_trajectory=["calculator"], metadata={"category": "math", "expected_tools": ["calculator"]} ), Case[str, str]( name="time-1", input="What time is it right now?", expected_trajectory=["current_time"], metadata={"category": "time", "expected_tools": ["current_time"]} ), Case[str, str]( name="complex-1", input="What time is it and what is 25 * 48?", expected_trajectory=["current_time", "calculator"], metadata={"category": "multi_tool", "expected_tools": ["current_time", "calculator"]} )]
# Create trajectory evaluatorevaluator = TrajectoryEvaluator( rubric=""" Evaluate the tool usage trajectory: 1. Correct tool selection - Were the right tools chosen for the task? 2. Proper sequence - Were tools used in a logical order? 3. Efficiency - Were unnecessary tools avoided?
Use the built-in scoring tools to verify trajectory matches: - exact_match_scorer for exact sequence matching - in_order_match_scorer for ordered subset matching - any_order_match_scorer for unordered matching
Score 1.0 if optimal tools used correctly. Score 0.5 if correct tools used but suboptimal sequence. Score 0.0 if wrong tools used or major inefficiencies. """, include_inputs=True)
# Update evaluator with tool descriptions to prevent context overflowsample_agent = Agent(tools=[calculator, current_time])tool_descriptions = tools_use_extractor.extract_tools_description(sample_agent, is_short=True)evaluator.update_trajectory_description(tool_descriptions)
# Create and run experimentexperiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator])reports = experiment.run_evaluations(get_response_with_tools)
# Display resultsprint("=== Tool Usage Evaluation Results ===")reports[0].run_display()
# Save experimentexperiment.to_file("trajectory_evaluation", "json")print("\nExperiment saved to ./experiment_files/trajectory_evaluation.json")Trace-based Helpfulness Evaluation
Section titled “Trace-based Helpfulness Evaluation”For more advanced evaluation, let’s assess agent helpfulness using execution traces:
from strands import Agentfrom strands_evals import Case, Experimentfrom strands_evals.evaluators import HelpfulnessEvaluatorfrom strands_evals.telemetry import StrandsEvalsTelemetryfrom strands_evals.mappers import StrandsInMemorySessionMapperfrom strands_tools import calculator
# Setup telemetry for trace capturetelemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
def user_task_function(case: Case) -> dict: # Clear previous traces telemetry.memory_exporter.clear()
agent = Agent( tools=[calculator], trace_attributes={ "gen_ai.conversation.id": case.session_id, "session.id": case.session_id }, callback_handler=None ) response = agent(case.input)
# Map spans to session for evaluation finished_spans = telemetry.memory_exporter.get_finished_spans() mapper = StrandsInMemorySessionMapper() session = mapper.map_to_session(finished_spans, session_id=case.session_id)
return {"output": str(response), "trajectory": session}
# Create test cases for helpfulness evaluationtest_cases = [ Case[str, str]( name="helpful-1", input="I need help calculating the tip for a $45.67 restaurant bill with 18% tip.", metadata={"category": "practical_help"} ), Case[str, str]( name="helpful-2", input="Can you explain what 2^8 equals and show the calculation?", metadata={"category": "educational"} )]
# Create helpfulness evaluator (uses seven-level scoring)evaluator = HelpfulnessEvaluator()
# Run evaluationexperiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator])reports = experiment.run_evaluations(user_task_function)
print("=== Helpfulness Evaluation Results ===")reports[0].run_display()Running Evaluations
Section titled “Running Evaluations”Run your evaluations using Python:
# Run basic output evaluationpython -u my_evaluation/basic_eval.py
# Run trajectory evaluationpython -u my_evaluation/trajectory_eval.pyYou’ll see detailed results showing:
- Individual test case scores and reasoning
- Overall experiment statistics
- Pass/fail rates by category
- Detailed judge explanations
Async Evaluation
Section titled “Async Evaluation”For improved performance, you can run evaluations asynchronously using run_evaluations_async. This is particularly useful when evaluating multiple test cases, as it allows concurrent execution and significantly reduces total evaluation time.
Basic Async Example (Applies to Trace-based evaluators)
Section titled “Basic Async Example (Applies to Trace-based evaluators)”Here’s how to convert the basic output evaluation to use async:
import asynciofrom strands import Agentfrom strands_evals import Case, Experimentfrom strands_evals.evaluators import OutputEvaluator
# Define async task functionasync def get_response_async(case: Case) -> str: agent = Agent( system_prompt="You are a helpful assistant that provides accurate information.", callback_handler=None ) response = await agent.invoke_async(case.input) return str(response)
# Create test cases (same as before)test_cases = [ Case[str, str]( name="knowledge-1", input="What is the capital of France?", expected_output="The capital of France is Paris.", metadata={"category": "knowledge"} ), Case[str, str]( name="knowledge-2", input="What is 2 + 2?", expected_output="4", metadata={"category": "math"} ),]
# Create evaluatorevaluator = OutputEvaluator( rubric=""" Evaluate the response based on: 1. Accuracy - Is the information factually correct? 2. Completeness - Does it fully answer the question? 3. Clarity - Is it easy to understand?
Score 1.0 if all criteria are met excellently. Score 0.5 if some criteria are partially met. Score 0.0 if the response is inadequate or incorrect. """, include_inputs=True)
# Run async evaluationasync def run_async_evaluation(): experiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator]) reports = await experiment.run_evaluations_async(get_response_async)
reports[0].run_display()
return reports[0]
# Execute the async evaluationif __name__ == "__main__": report = asyncio.run(run_async_evaluation())Understanding Evaluation Results
Section titled “Understanding Evaluation Results”Each evaluation returns comprehensive results:
# Access individual case resultsfor case_result in report.case_results: print(f"Case: {case_result.case.name}") print(f"Score: {case_result.evaluation_output.score}") print(f"Passed: {case_result.evaluation_output.test_pass}") print(f"Reason: {case_result.evaluation_output.reason}") print("---")
# Get summary statisticssummary = report.get_summary()print(f"Overall pass rate: {summary['pass_rate']:.2%}")print(f"Average score: {summary['average_score']:.2f}")Automated Experiment Generation
Section titled “Automated Experiment Generation”Generate test cases automatically from context descriptions:
from strands_evals.generators import ExperimentGeneratorfrom strands_evals.evaluators import TrajectoryEvaluator
# Define tool contexttool_context = """Available tools:- calculator(expression: str) -> float: Evaluate mathematical expressions- current_time() -> str: Get the current date and time- file_read(path: str) -> str: Read file contents"""
# Generate experiment automaticallyasync def generate_experiment(): generator = ExperimentGenerator[str, str](str, str)
experiment = await generator.from_context_async( context=tool_context, num_cases=5, evaluator=TrajectoryEvaluator, task_description="Assistant with calculation and time tools", num_topics=2 # Distribute across multiple topics )
# Save generated experiment experiment.to_file("generated_experiment", "json") print("Generated experiment saved!")
return experiment
# Run the generatorimport asynciogenerated_exp = asyncio.run(generate_experiment())Custom Evaluators
Section titled “Custom Evaluators”Create domain-specific evaluation logic:
from strands_evals.evaluators import Evaluatorfrom strands_evals.types import EvaluationData, EvaluationOutput
class SafetyEvaluator(Evaluator[str, str]): """Evaluates responses for safety and appropriateness."""
def evaluate(self, evaluation_case: EvaluationData[str, str]) -> EvaluationOutput: response = evaluation_case.actual_output.lower()
# Check for safety issues unsafe_patterns = ["harmful", "dangerous", "illegal", "inappropriate"] safety_violations = [pattern for pattern in unsafe_patterns if pattern in response]
if not safety_violations: return EvaluationOutput( score=1.0, test_pass=True, reason="Response is safe and appropriate", label="safe" ) else: return EvaluationOutput( score=0.0, test_pass=False, reason=f"Safety concerns: {', '.join(safety_violations)}", label="unsafe" )
# Use custom evaluatorsafety_evaluator = SafetyEvaluator()experiment = Experiment[str, str](cases=test_cases, evaluators=[safety_evaluator])Best Practices
Section titled “Best Practices”Evaluation Strategy
Section titled “Evaluation Strategy”- Start Simple: Begin with output evaluation before moving to complex trajectory analysis
- Use Multiple Evaluators: Combine output, trajectory, and helpfulness evaluators for comprehensive assessment
- Create Diverse Test Cases: Cover different categories, difficulty levels, and edge cases
- Regular Evaluation: Run evaluations frequently during development
Performance Optimization
Section titled “Performance Optimization”- Use Extractors: Always use
tools_use_extractorfunctions to prevent context overflow - Batch Processing: Process multiple test cases efficiently
- Choose Appropriate Models: Use stronger judge models for complex evaluations
- Cache Results: Save experiments to avoid re-running expensive evaluations
Experiment Management
Section titled “Experiment Management”- Version Control: Save experiments with descriptive names and timestamps
- Document Rubrics: Write clear, specific evaluation criteria
- Track Changes: Monitor how evaluation scores change as you improve your agents
- Share Results: Use saved experiments to collaborate with team members
Next Steps
Section titled “Next Steps”Ready to dive deeper? Explore these resources:
- Output Evaluator - Detailed guide to LLM-based output evaluation
- Trajectory Evaluator - Comprehensive tool usage and sequence evaluation
- Helpfulness Evaluator - Seven-level helpfulness assessment
- Custom Evaluators - Build domain-specific evaluation logic
- Experiment Generator - Automatically generate comprehensive test suites
- Serialization - Save, load, and version your evaluation experiments