Skip to content

Evaluators assess the quality and performance of conversational agents by analyzing their outputs, behaviors, and goal achievement. The Strands Evals SDK provides a comprehensive set of evaluators that can assess different aspects of agent performance, from individual response quality to multi-turn conversation success.

Evaluating conversational agents requires more than simple accuracy metrics. Agents must be assessed across multiple dimensions:

Traditional Metrics:

  • Limited to exact match or similarity scores
  • Don’t capture subjective qualities like helpfulness
  • Can’t assess multi-turn conversation flow
  • Miss goal-oriented success patterns

Strands Evaluators:

  • Assess subjective qualities using LLM-as-a-judge
  • Evaluate multi-turn conversations and trajectories
  • Measure goal completion and user satisfaction
  • Provide structured reasoning for evaluation decisions
  • Support both synchronous and asynchronous evaluation

Use evaluators when you need to:

  • Assess Response Quality: Evaluate helpfulness, faithfulness, and appropriateness
  • Measure Goal Achievement: Determine if user objectives were met
  • Analyze Tool Usage: Evaluate tool selection and parameter accuracy
  • Track Conversation Success: Assess multi-turn interaction effectiveness
  • Compare Agent Configurations: Benchmark different prompts or models
  • Monitor Production Performance: Continuously evaluate deployed agents

Evaluators operate at different levels of granularity:

LevelScopeUse Case
OUTPUT_LEVELSingle responseQuality of individual outputs
TRACE_LEVELSingle turnTurn-by-turn conversation analysis
SESSION_LEVELFull conversationEnd-to-end goal achievement

OutputEvaluator

  • Level: OUTPUT_LEVEL
  • Purpose: Flexible LLM-based evaluation with custom rubrics
  • Use Case: Assess any subjective quality (safety, relevance, tone)

HelpfulnessEvaluator

  • Level: TRACE_LEVEL
  • Purpose: Evaluate response helpfulness from user perspective
  • Use Case: Measure user satisfaction and response utility

FaithfulnessEvaluator

  • Level: TRACE_LEVEL
  • Purpose: Assess factual accuracy and groundedness
  • Use Case: Verify responses are truthful and well-supported

ToolSelectionEvaluator

  • Level: TRACE_LEVEL
  • Purpose: Evaluate whether correct tools were selected
  • Use Case: Assess tool choice accuracy in multi-tool scenarios

ToolParameterEvaluator

  • Level: TRACE_LEVEL
  • Purpose: Evaluate accuracy of tool parameters
  • Use Case: Verify correct parameter values for tool calls

TrajectoryEvaluator

  • Level: SESSION_LEVEL
  • Purpose: Assess sequence of actions and tool usage patterns
  • Use Case: Evaluate multi-step reasoning and workflow adherence

InteractionsEvaluator

  • Level: SESSION_LEVEL
  • Purpose: Analyze conversation patterns and interaction quality
  • Use Case: Assess conversation flow and engagement patterns

GoalSuccessRateEvaluator

  • Level: SESSION_LEVEL
  • Purpose: Determine if user goals were successfully achieved
  • Use Case: Measure end-to-end task completion success

Create domain-specific evaluators by extending the base Evaluator class:

CustomEvaluator

  • Purpose: Implement specialized evaluation logic
  • Use Case: Domain-specific requirements not covered by built-in evaluators

Understanding when to use evaluators versus simulators:

AspectEvaluatorsSimulators
RoleAssess qualityGenerate interactions
TimingPost-conversationDuring conversation
PurposeScore/judgeDrive/participate
OutputEvaluation scoresConversation turns
Use CaseQuality assessmentInteraction generation

Use Together: Evaluators and simulators complement each other. Use simulators to generate realistic multi-turn conversations, then use evaluators to assess the quality of those interactions.

Evaluators work seamlessly with simulator-generated conversations:

from strands import Agent
from strands_evals import Case, Experiment, ActorSimulator
from strands_evals.evaluators import HelpfulnessEvaluator, GoalSuccessRateEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry
def task_function(case: Case) -> dict:
# Generate multi-turn conversation with simulator
simulator = ActorSimulator.from_case_for_user_simulator(case=case, max_turns=10)
agent = Agent(trace_attributes={"session.id": case.session_id})
# Collect conversation data
all_spans = []
user_message = case.input
while simulator.has_next():
agent_response = agent(user_message)
turn_spans = list(memory_exporter.get_finished_spans())
all_spans.extend(turn_spans)
user_result = simulator.act(str(agent_response))
user_message = str(user_result.structured_output.message)
# Map to session for evaluation
mapper = StrandsInMemorySessionMapper()
session = mapper.map_to_session(all_spans, session_id=case.session_id)
return {"output": str(agent_response), "trajectory": session}
# Use multiple evaluators to assess different aspects
evaluators = [
HelpfulnessEvaluator(), # Response quality
GoalSuccessRateEvaluator(), # Goal achievement
ToolSelectionEvaluator(), # Tool usage
TrajectoryEvaluator(rubric="...") # Action sequences
]
experiment = Experiment(cases=test_cases, evaluators=evaluators)
reports = experiment.run_evaluations(task_function)

Match evaluator level to your assessment needs:

# For individual response quality
evaluators = [OutputEvaluator(rubric="Assess response clarity")]
# For turn-by-turn analysis
evaluators = [HelpfulnessEvaluator(), FaithfulnessEvaluator()]
# For end-to-end success
evaluators = [GoalSuccessRateEvaluator(), TrajectoryEvaluator(rubric="...")]

Assess different aspects comprehensively:

evaluators = [
HelpfulnessEvaluator(), # User experience
FaithfulnessEvaluator(), # Accuracy
ToolSelectionEvaluator(), # Tool usage
GoalSuccessRateEvaluator() # Success rate
]

For custom evaluators, define specific criteria:

rubric = """
Score 1.0 if the response:
- Directly answers the user's question
- Provides accurate information
- Uses appropriate tone
Score 0.5 if the response partially meets criteria
Score 0.0 if the response fails to meet criteria
"""
evaluator = OutputEvaluator(rubric=rubric)

For better performance with multiple evaluators:

import asyncio
async def run_evaluations():
evaluators = [HelpfulnessEvaluator(), FaithfulnessEvaluator()]
tasks = [evaluator.aevaluate(data) for evaluator in evaluators]
results = await asyncio.gather(*tasks)
return results
def assess_response_quality(case: Case, agent_output: str) -> dict:
evaluators = [
HelpfulnessEvaluator(),
FaithfulnessEvaluator(),
OutputEvaluator(rubric="Assess professional tone")
]
results = {}
for evaluator in evaluators:
result = evaluator.evaluate(EvaluationData(
input=case.input,
output=agent_output
))
results[evaluator.__class__.__name__] = result.score
return results
def analyze_tool_usage(session: Session) -> dict:
evaluators = [
ToolSelectionEvaluator(),
ToolParameterEvaluator(),
TrajectoryEvaluator(rubric="Assess tool usage efficiency")
]
results = {}
for evaluator in evaluators:
result = evaluator.evaluate(EvaluationData(trajectory=session))
results[evaluator.__class__.__name__] = {
"score": result.score,
"reasoning": result.reasoning
}
return results
def compare_agent_versions(cases: list, agents: dict) -> dict:
evaluators = [HelpfulnessEvaluator(), GoalSuccessRateEvaluator()]
results = {}
for agent_name, agent in agents.items():
agent_scores = []
for case in cases:
output = agent(case.input)
for evaluator in evaluators:
result = evaluator.evaluate(EvaluationData(
input=case.input,
output=output
))
agent_scores.append(result.score)
results[agent_name] = {
"average_score": sum(agent_scores) / len(agent_scores),
"scores": agent_scores
}
return results