AgentCore Evaluation Dashboard Configuration
Section titled “AgentCore Evaluation Dashboard Configuration”This guide explains how to configure AWS Distro for OpenTelemetry (ADOT) to send Strands evaluation results to Amazon CloudWatch, enabling visualization in the GenAI Observability: Bedrock AgentCore Observability dashboard.
Overview
Section titled “Overview”The Strands Evals SDK integrates with AWS Bedrock AgentCore’s observability infrastructure to provide comprehensive evaluation metrics and dashboards. By configuring ADOT environment variables, you can:
- Send evaluation results to CloudWatch Logs in EMF (Embedded Metric Format)
- View evaluation metrics in the GenAI Observability dashboard
- Track evaluation scores, pass/fail rates, and detailed explanations
- Correlate evaluations with agent traces and sessions
Prerequisites
Section titled “Prerequisites”Before configuring the evaluation dashboard, ensure you have:
- AWS Account with appropriate permissions for CloudWatch and Bedrock AgentCore
- CloudWatch Transaction Search enabled (one-time setup)
- ADOT SDK installed in your environment (guidance)
- Strands Evals SDK installed (
pip install strands-evals)
Step 1: Enable CloudWatch Transaction Search
Section titled “Step 1: Enable CloudWatch Transaction Search”CloudWatch Transaction Search must be enabled to view evaluation data in the GenAI Observability dashboard. This is a one-time setup per AWS account and region.
Using the CloudWatch Console
Section titled “Using the CloudWatch Console”- Open the CloudWatch console
- In the navigation pane, expand Application Signals (APM) and choose Transaction search
- Choose Enable Transaction Search
- Select the checkbox to ingest spans as structured logs
- Choose Save
Step 2: Configure Environment Variables
Section titled “Step 2: Configure Environment Variables”Configure the following environment variables to enable ADOT integration and send evaluation results to CloudWatch.
Complete Environment Variable Configuration
Section titled “Complete Environment Variable Configuration”# Enable agent observabilityexport AGENT_OBSERVABILITY_ENABLED="true"
# Configure ADOT for Pythonexport OTEL_PYTHON_DISTRO="aws_distro"export OTEL_PYTHON_CONFIGURATOR="aws_configurator"
# Set log level for debugging (optional, use "info" for production)export OTEL_LOG_LEVEL="debug"
# Configure exportersexport OTEL_METRICS_EXPORTER="awsemf"export OTEL_TRACES_EXPORTER="otlp"export OTEL_LOGS_EXPORTER="otlp"
# Set OTLP protocolexport OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
# Configure service name and log groupexport OTEL_RESOURCE_ATTRIBUTES="service.name=my-evaluation-service,aws.log.group.names=/aws/bedrock-agentcore/runtimes/my-eval-logs"
# Enable Python logging auto-instrumentationexport OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED="true"
# Capture GenAI message contentexport OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT="true"
# Disable AWS Application Signals (not needed for evaluations)export OTEL_AWS_APPLICATION_SIGNALS_ENABLED="true"
# Configure OTLP endpointsexport OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="https://xray.us-east-1.amazonaws.com/v1/traces"export OTEL_EXPORTER_OTLP_LOGS_ENDPOINT="https://logs.us-east-1.amazonaws.com/v1/logs"
# Configure log export headersexport OTEL_EXPORTER_OTLP_LOGS_HEADERS="x-aws-log-group=/aws/bedrock-agentcore/runtimes/my-eval-logs,x-aws-log-stream=default,x-aws-metric-namespace=my-evaluation-namespace"
# Disable unnecessary instrumentations for better performanceexport OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="http,sqlalchemy,psycopg2,pymysql,sqlite3,aiopg,asyncpg,mysql_connector,urllib3,requests,system_metrics,google-genai"
# Configure evaluation results log group (used by Strands Evals)export EVALUATION_RESULTS_LOG_GROUP="my-evaluation-results"
# AWS configurationexport AWS_REGION="us-east-1"export AWS_DEFAULT_REGION="us-east-1"Environment Variable Descriptions
Section titled “Environment Variable Descriptions”| Variable | Description | Example Value |
|---|---|---|
AGENT_OBSERVABILITY_ENABLED | Enables CloudWatch logging for evaluations | true |
OTEL_PYTHON_DISTRO | Specifies ADOT distribution | aws_distro |
OTEL_PYTHON_CONFIGURATOR | Configures ADOT for AWS | aws_configurator |
OTEL_LOG_LEVEL | Sets OpenTelemetry log level | debug or info |
OTEL_METRICS_EXPORTER | Metrics exporter type | awsemf |
OTEL_TRACES_EXPORTER | Traces exporter type | otlp |
OTEL_LOGS_EXPORTER | Logs exporter type | otlp |
OTEL_EXPORTER_OTLP_PROTOCOL | OTLP protocol format | http/protobuf |
OTEL_RESOURCE_ATTRIBUTES | Service name and log group for resource attributes | service.name=my-service,aws.log.group.names=/aws/bedrock-agentcore/runtimes/logs |
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED | Auto-instrument Python logging | true |
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT | Capture GenAI message content | true |
OTEL_AWS_APPLICATION_SIGNALS_ENABLED | Enable AWS Application Signals | false |
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT | X-Ray traces endpoint | https://xray.us-east-1.amazonaws.com/v1/traces |
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT | CloudWatch logs endpoint | https://logs.us-east-1.amazonaws.com/v1/logs |
OTEL_EXPORTER_OTLP_LOGS_HEADERS | CloudWatch log destination headers | x-aws-log-group=/aws/bedrock-agentcore/runtimes/logs,x-aws-log-stream=default,x-aws-metric-namespace=namespace |
OTEL_PYTHON_DISABLED_INSTRUMENTATIONS | Disable unnecessary instrumentations | http,sqlalchemy,psycopg2,... |
EVALUATION_RESULTS_LOG_GROUP | Base name for evaluation results log group | my-evaluation-results |
AWS_REGION | AWS region for CloudWatch | us-east-1 |
Step 3: Install ADOT SDK
Section titled “Step 3: Install ADOT SDK”Install the AWS Distro for OpenTelemetry SDK in your Python environment:
pip install aws-opentelemetry-distro>=0.10.0 boto3Or add to your requirements.txt:
aws-opentelemetry-distro>=0.10.0boto3strands-evalsStep 4: Run Evaluations with ADOT
Section titled “Step 4: Run Evaluations with ADOT”Execute your evaluation script using the OpenTelemetry auto-instrumentation command:
opentelemetry-instrument python my_evaluation_script.pyComplete Setup and Execution Script
Section titled “Complete Setup and Execution Script”#!/bin/bash
# AWS Configurationexport AWS_REGION="us-east-1"export AWS_DEFAULT_REGION="us-east-1"
# Enable Agent Observabilityexport AGENT_OBSERVABILITY_ENABLED="true"
# ADOT Configurationexport OTEL_LOG_LEVEL="debug"export OTEL_METRICS_EXPORTER="awsemf"export OTEL_TRACES_EXPORTER="otlp"export OTEL_LOGS_EXPORTER="otlp"export OTEL_PYTHON_DISTRO="aws_distro"export OTEL_PYTHON_CONFIGURATOR="aws_configurator"export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
# Service ConfigurationSERVICE_NAME="test-agent-3"LOG_GROUP="/aws/bedrock-agentcore/runtimes/strands-agents-tests"METRIC_NAMESPACE="test-strands-agentcore"
export OTEL_RESOURCE_ATTRIBUTES="service.name=${SERVICE_NAME},aws.log.group.names=${LOG_GROUP}"export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED="true"export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT="true"export OTEL_AWS_APPLICATION_SIGNALS_ENABLED="false"
# OTLP Endpointsexport OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="https://xray.${AWS_REGION}.amazonaws.com/v1/traces"export OTEL_EXPORTER_OTLP_LOGS_ENDPOINT="https://logs.${AWS_REGION}.amazonaws.com/v1/logs"export OTEL_EXPORTER_OTLP_LOGS_HEADERS="x-aws-log-group=${LOG_GROUP},x-aws-log-stream=default,x-aws-metric-namespace=${METRIC_NAMESPACE}"
# Disable Unnecessary Instrumentationsexport OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="http,sqlalchemy,psycopg2,pymysql,sqlite3,aiopg,asyncpg,mysql_connector,urllib3,requests,system_metrics,google-genai"
# Evaluation Results Configurationexport EVALUATION_RESULTS_LOG_GROUP="strands-agents-tests"
# Run evaluations with ADOT instrumentationopentelemetry-instrument python evaluation_agentcore_dashboard.pyExample Evaluation Script
Section titled “Example Evaluation Script”from strands_evals import Experiment, Casefrom strands_evals.evaluators import OutputEvaluator
# Create evaluation casescases = [ Case( name="Knowledge Test", input="What is the capital of France?", expected_output="The capital of France is Paris.", metadata={"category": "knowledge"} ), Case( name="Math Test", input="What is 2+2?", expected_output="2+2 equals 4.", metadata={"category": "math"} )]
# Create evaluatorevaluator = OutputEvaluator( rubric="The output is accurate and complete. Score 1 if correct, 0 if incorrect.")
# Create experimentexperiment = Experiment(cases=cases, evaluator=evaluator)
# Define your task functiondef my_agent_task(case: Case) -> str: # Your agent logic here # This should return the agent's response return f"Response to: {case.input}"
# Run evaluationsreports = experiment.run_evaluations(my_agent_task)
print(f"Overall Score: {report.overall_score}")print(f"Pass Rate: {sum(report.test_passes)}/{len(report.test_passes)}")For Containerized Environments (Docker)
Section titled “For Containerized Environments (Docker)”Add the OpenTelemetry instrumentation to your Dockerfile CMD:
FROM python:3.11
WORKDIR /app
# Install dependenciesCOPY requirements.txt .RUN pip install -r requirements.txt
# Copy application codeCOPY . .
# Set environment variablesENV AGENT_OBSERVABILITY_ENABLED=true \ OTEL_PYTHON_DISTRO=aws_distro \ OTEL_PYTHON_CONFIGURATOR=aws_configurator \ OTEL_METRICS_EXPORTER=awsemf \ OTEL_TRACES_EXPORTER=otlp \ OTEL_LOGS_EXPORTER=otlp \ OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
# Run with ADOT instrumentationCMD ["opentelemetry-instrument", "python", "evaluation_agentcore_dashboard.py"]Step 5: View Evaluation Results in CloudWatch
Section titled “Step 5: View Evaluation Results in CloudWatch”Once your evaluations are running with ADOT configured, you can view the results in multiple locations:
GenAI Observability Dashboard
Section titled “GenAI Observability Dashboard”- Open the CloudWatch GenAI Observability page
- Navigate to Bedrock AgentCore Observability section
- View evaluation metrics including:
- Evaluation scores by service name
- Pass/fail rates by label
- Evaluation trends over time
- Detailed evaluation explanations
CloudWatch Logs
Section titled “CloudWatch Logs”Evaluation results are stored in the log group:
/aws/bedrock-agentcore/evaluations/results/{EVALUATION_RESULTS_LOG_GROUP}Each log entry contains:
- Evaluation score and label (YES/NO)
- Evaluator name (e.g.,
Custom.OutputEvaluator) - Trace ID for correlation
- Session ID
- Detailed explanation
- Input/output data
CloudWatch Metrics
Section titled “CloudWatch Metrics”Metrics are published to the namespace specified in x-aws-metric-namespace with dimensions:
service.name: Your service namelabel: Evaluation label (YES/NO)onlineEvaluationConfigId: Configuration identifier
Advanced Configuration
Section titled “Advanced Configuration”Custom Service Names
Section titled “Custom Service Names”Set a custom service name to organize evaluations:
export OTEL_RESOURCE_ATTRIBUTES="service.name=my-custom-agent,aws.log.group.names=/aws/bedrock-agentcore/runtimes/custom-logs"Session ID Propagation
Section titled “Session ID Propagation”To correlate evaluations with agent sessions, set the session ID in your cases:
case = Case( name="Test Case", input="Test input", expected_output="Expected output", session_id="my-session-123" # Links evaluation to agent session)Async Evaluations
Section titled “Async Evaluations”For better performance with multiple test cases, use async evaluations:
import asyncio
async def run_async_evaluations(): report = await experiment.run_evaluations_async( my_agent_task, max_workers=10 # Parallel execution ) return report
# Run async evaluationsreport = asyncio.run(run_async_evaluations())Custom Evaluators
Section titled “Custom Evaluators”Create custom evaluators with specific scoring logic:
from strands_evals.evaluators import Evaluatorfrom strands_evals.types.evaluation import EvaluationData, EvaluationOutput
class CustomEvaluator(Evaluator): def __init__(self, threshold: float = 0.8): super().__init__() self.threshold = threshold self._score_mapping = {"PASS": 1.0, "FAIL": 0.0}
def evaluate(self, data: EvaluationData) -> list[EvaluationOutput]: # Your custom evaluation logic score = 1.0 if self._check_quality(data.actual_output) else 0.0 label = "PASS" if score >= self.threshold else "FAIL"
return [EvaluationOutput( score=score, passed=(score >= self.threshold), reason=f"Quality check: {label}" )]
def _check_quality(self, output) -> bool: # Implement your quality check return TruePerformance Optimization
Section titled “Performance Optimization”Disable unnecessary instrumentations to improve performance:
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="http,sqlalchemy,psycopg2,pymysql,sqlite3,aiopg,asyncpg,mysql_connector,urllib3,requests,system_metrics,google-genai"This disables instrumentation for libraries that aren’t needed for evaluation telemetry, reducing overhead.
Troubleshooting
Section titled “Troubleshooting”Evaluations Not Appearing in Dashboard
Section titled “Evaluations Not Appearing in Dashboard”-
Verify CloudWatch Transaction Search is enabled
Terminal window aws xray get-trace-segment-destinationShould return:
{"Destination": "CloudWatchLogs"} -
Check environment variables are set correctly
Terminal window echo $AGENT_OBSERVABILITY_ENABLEDecho $OTEL_RESOURCE_ATTRIBUTESecho $OTEL_EXPORTER_OTLP_LOGS_ENDPOINT -
Verify log group exists
Terminal window aws logs describe-log-groups \--log-group-name-prefix "/aws/bedrock-agentcore" -
Check IAM permissions - Ensure your execution role has:
logs:CreateLogGrouplogs:CreateLogStreamlogs:PutLogEventsxray:PutTraceSegmentsxray:PutTelemetryRecords
Missing Metrics
Section titled “Missing Metrics”If metrics aren’t appearing in CloudWatch:
- Verify the
OTEL_EXPORTER_OTLP_LOGS_HEADERSincludesx-aws-metric-namespace - Check that
OTEL_METRICS_EXPORTER="awsemf"is set - Ensure evaluations are completing successfully (no exceptions)
- Wait 5-10 minutes for metrics to propagate to CloudWatch
Log Format Issues
Section titled “Log Format Issues”If logs aren’t in the correct format:
- Ensure
OTEL_PYTHON_DISTRO=aws_distrois set - Verify
OTEL_PYTHON_CONFIGURATOR=aws_configuratoris set - Check that
aws-opentelemetry-distro>=0.10.0is installed - Verify
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobufis set
Debug Mode
Section titled “Debug Mode”Enable debug logging to troubleshoot issues:
export OTEL_LOG_LEVEL="debug"This will output detailed ADOT logs to help identify configuration problems.
Best Practices
Section titled “Best Practices”-
Use Consistent Service Names: Use the same service name across related evaluations for easier filtering and analysis
-
Include Session IDs: Always include session IDs in your test cases to correlate evaluations with agent interactions
-
Set Appropriate Sampling: For high-volume evaluations, adjust the X-Ray sampling percentage to balance cost and visibility
-
Monitor Log Group Size: Evaluation logs can grow quickly; set up log retention policies:
Terminal window aws logs put-retention-policy \--log-group-name "/aws/bedrock-agentcore/evaluations/results/my-eval" \--retention-in-days 30 -
Use Descriptive Evaluator Names: Custom evaluators should have clear, descriptive names that appear in the dashboard
-
Optimize Performance: Disable unnecessary instrumentations to reduce overhead in production environments
-
Tag Evaluations: Use metadata in test cases to add context:
Case(name="Test",input="...",expected_output="...",metadata={"environment": "production","version": "v1.2.3","category": "accuracy"}) -
Use Info Log Level in Production: Set
OTEL_LOG_LEVEL="info"in production to reduce log volume