Interactions Evaluator

Overview

The InteractionsEvaluator is designed for evaluating interactions between agents or components in multi-agent systems or complex workflows. It assesses each interaction step-by-step, considering dependencies, message flow, and the overall sequence of interactions.

Key Features

Interaction-Level Evaluation: Evaluates each interaction in a sequence
Multi-Agent Support: Designed for evaluating multi-agent systems and workflows
Node-Specific Rubrics: Supports different evaluation criteria for different nodes/agents
Sequential Context: Maintains context across interactions using sliding window
Dependency Tracking: Considers dependencies between interactions
Async Support: Supports both synchronous and asynchronous evaluation

When to Use

Use the InteractionsEvaluator when you need to:

Evaluate multi-agent system interactions
Assess workflow execution across multiple components
Validate message passing between agents
Ensure proper dependency handling in complex systems
Track interaction quality in agent orchestration
Debug multi-agent coordination issues

Parameters

`rubric` (required)

Type: str | dict[str, str]
Description: Evaluation criteria. Can be a single string for all nodes or a dictionary mapping node names to specific rubrics.

`interaction_description` (optional)

Type: dict | None
Default: None
Description: A dictionary describing available interactions. Can be updated dynamically using update_interaction_description().

`model` (optional)

Type: Union[Model, str, None]
Default: None (uses default Bedrock model)
Description: The model to use as the judge. Can be a model ID string or a Model instance.

`system_prompt` (optional)

Type: str
Default: Built-in template
Description: Custom system prompt to guide the judge model’s behavior.

`include_inputs` (optional)

Type: bool
Default: True
Description: Whether to include inputs in the evaluation context.

Interaction Structure

Each interaction should contain:

node_name: Name of the agent/component involved
dependencies: List of nodes this interaction depends on
messages: Messages exchanged in this interaction

Basic Usage

from strands_evals import Case, Experiment
from strands_evals.evaluators import InteractionsEvaluator

# Define task function that returns interactions
def multi_agent_task(case: Case) -> dict:
    # Execute multi-agent workflow
    # ...

    # Return interactions
    interactions = [
        {
            "node_name": "planner",
            "dependencies": [],
            "messages": "Created execution plan"
        },
        {
            "node_name": "executor",
            "dependencies": ["planner"],
            "messages": "Executed plan steps"
        },
        {
            "node_name": "validator",
            "dependencies": ["executor"],
            "messages": "Validated results"
        }
    ]

    return {
        "output": "Task completed",
        "interactions": interactions
    }

# Create test cases
test_cases = [
    Case[str, str](
        name="workflow-1",
        input="Process data pipeline",
        expected_interactions=[
            {"node_name": "planner", "dependencies": [], "messages": "Plan created"},
            {"node_name": "executor", "dependencies": ["planner"], "messages": "Executed"},
            {"node_name": "validator", "dependencies": ["executor"], "messages": "Validated"}
        ],
        metadata={"category": "workflow"}
    ),
]

# Create evaluator with single rubric for all nodes
evaluator = InteractionsEvaluator(
    rubric="""
    Evaluate the interaction based on:
    1. Correct node execution order
    2. Proper dependency handling
    3. Clear message communication

    Score 1.0 if all criteria are met.
    Score 0.5 if some issues exist.
    Score 0.0 if interaction is incorrect.
    """
)

# Or use node-specific rubrics
evaluator = InteractionsEvaluator(
    rubric={
        "planner": "Evaluate if planning is thorough and logical",
        "executor": "Evaluate if execution follows the plan correctly",
        "validator": "Evaluate if validation is comprehensive"
    }
)

# Run evaluation
experiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator])
reports = experiment.run_evaluations(multi_agent_task)
reports[0].run_display()

Evaluation Output

The InteractionsEvaluator returns a list of EvaluationOutput objects (one per interaction) with:

score: Float between 0.0 and 1.0 for each interaction
test_pass: Boolean indicating if the interaction passed
reason: Step-by-step reasoning for the evaluation
label: Optional label categorizing the result

The final interaction’s evaluation includes context from all previous interactions.

What Gets Evaluated

For each interaction, the evaluator examines:

Current Interaction: Node name, dependencies, and messages
Expected Sequence: Overview of the expected interaction sequence
Relevant Expected Interactions: Window of expected interactions around current position
Previous Evaluations: Context from earlier interactions (for later interactions)
Final Output: Overall output (only for the last interaction)

Best Practices

Define Clear Interaction Structure: Ensure interactions have consistent node_name, dependencies, and messages
Use Node-Specific Rubrics: Provide tailored evaluation criteria for different agent types
Track Dependencies: Clearly specify which nodes depend on others
Update Descriptions: Use update_interaction_description() to provide context about available interactions
Test Sequences: Include test cases with various interaction patterns

Common Patterns

Pattern 1: Linear Workflow

interactions = [
    {"node_name": "input_validator", "dependencies": [], "messages": "Input validated"},
    {"node_name": "processor", "dependencies": ["input_validator"], "messages": "Data processed"},
    {"node_name": "output_formatter", "dependencies": ["processor"], "messages": "Output formatted"}
]

Pattern 2: Parallel Execution

interactions = [
    {"node_name": "coordinator", "dependencies": [], "messages": "Tasks distributed"},
    {"node_name": "worker_1", "dependencies": ["coordinator"], "messages": "Task 1 completed"},
    {"node_name": "worker_2", "dependencies": ["coordinator"], "messages": "Task 2 completed"},
    {"node_name": "aggregator", "dependencies": ["worker_1", "worker_2"], "messages": "Results aggregated"}
]

Pattern 3: Conditional Flow

interactions = [
    {"node_name": "analyzer", "dependencies": [], "messages": "Analysis complete"},
    {"node_name": "decision_maker", "dependencies": ["analyzer"], "messages": "Decision: proceed"},
    {"node_name": "executor", "dependencies": ["decision_maker"], "messages": "Action executed"}
]

Example Scenarios

Scenario 1: Successful Multi-Agent Workflow

# Task: Research and summarize a topic
interactions = [
    {
        "node_name": "researcher",
        "dependencies": [],
        "messages": "Found 5 relevant sources"
    },
    {
        "node_name": "analyzer",
        "dependencies": ["researcher"],
        "messages": "Extracted key points from sources"
    },
    {
        "node_name": "writer",
        "dependencies": ["analyzer"],
        "messages": "Created comprehensive summary"
    }
]
# Evaluation: Each interaction scored based on quality and dependency adherence

Scenario 2: Failed Dependency

# Task: Process data pipeline
interactions = [
    {
        "node_name": "validator",
        "dependencies": [],
        "messages": "Validation skipped"  # Should depend on data_loader
    },
    {
        "node_name": "processor",
        "dependencies": ["validator"],
        "messages": "Processing failed"
    }
]
# Evaluation: Low scores due to incorrect dependency handling

Common Issues and Solutions

Issue 1: Missing Interaction Keys

Problem: Interactions missing required keys (node_name, dependencies, messages). Solution: Ensure all interactions include all three required fields.

Issue 2: Incorrect Dependency Specification

Problem: Dependencies don’t match actual execution order. Solution: Verify dependency lists accurately reflect the workflow.

Issue 3: Rubric Key Mismatch

Problem: Node-specific rubric dictionary missing keys for some nodes. Solution: Ensure rubric dictionary contains entries for all node names, or use a single string rubric.

Use Cases

Use Case 1: Multi-Agent Orchestration

Evaluate coordination between multiple specialized agents.

Use Case 2: Workflow Validation

Assess execution of complex, multi-step workflows.

Use Case 3: Agent Handoff Quality

Measure quality of information transfer between agents.

Use Case 4: Dependency Compliance

Verify that agents respect declared dependencies.

TrajectoryEvaluator: Evaluates tool call sequences (single agent)
GoalSuccessRateEvaluator: Evaluates overall goal achievement
OutputEvaluator: Evaluates final output quality
HelpfulnessEvaluator: Evaluates individual response helpfulness