Faithfulness Evaluator
Section titled “Faithfulness Evaluator”Overview
Section titled “Overview”The FaithfulnessEvaluator evaluates whether agent responses are grounded in the conversation history. It assesses if the agent’s statements are faithful to the information available in the preceding context, helping detect hallucinations and unsupported claims. A complete example can be found here.
Key Features
Section titled “Key Features”- Trace-Level Evaluation: Evaluates the most recent turn in the conversation
- Context Grounding: Checks if responses are based on conversation history
- Categorical Scoring: Five-level scale from “Not At All” to “Completely Yes”
- Structured Reasoning: Provides step-by-step reasoning for each evaluation
- Async Support: Supports both synchronous and asynchronous evaluation
- Hallucination Detection: Identifies fabricated or unsupported information
When to Use
Section titled “When to Use”Use the FaithfulnessEvaluator when you need to:
- Detect hallucinations in agent responses
- Verify that responses are grounded in available context
- Ensure agents don’t fabricate information
- Validate that claims are supported by conversation history
- Assess information accuracy in multi-turn conversations
- Debug issues with context adherence
Evaluation Level
Section titled “Evaluation Level”This evaluator operates at the TRACE_LEVEL, meaning it evaluates the most recent turn in the conversation (the last agent response and its context).
Parameters
Section titled “Parameters”model (optional)
Section titled “model (optional)”- Type:
Union[Model, str, None] - Default:
None(uses default Bedrock model) - Description: The model to use as the judge. Can be a model ID string or a Model instance.
system_prompt (optional)
Section titled “system_prompt (optional)”- Type:
str | None - Default:
None(uses built-in template) - Description: Custom system prompt to guide the judge model’s behavior.
Scoring System
Section titled “Scoring System”The evaluator uses a five-level categorical scoring system:
- Not At All (0.0): Response contains significant fabrications or unsupported claims
- Not Generally (0.25): Response is mostly unfaithful with some grounded elements
- Neutral/Mixed (0.5): Response has both faithful and unfaithful elements
- Generally Yes (0.75): Response is mostly faithful with minor issues
- Completely Yes (1.0): Response is completely grounded in conversation history
A response passes the evaluation if the score is >= 0.5.
Basic Usage
Section titled “Basic Usage”from strands import Agentfrom strands_evals import Case, Experimentfrom strands_evals.evaluators import FaithfulnessEvaluatorfrom strands_evals.mappers import StrandsInMemorySessionMapperfrom strands_evals.telemetry import StrandsEvalsTelemetry
# Setup telemetrytelemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()memory_exporter = telemetry.in_memory_exporter
# Define task functiondef user_task_function(case: Case) -> dict: memory_exporter.clear()
agent = Agent( trace_attributes={ "gen_ai.conversation.id": case.session_id, "session.id": case.session_id }, callback_handler=None ) agent_response = agent(case.input)
# Map spans to session finished_spans = memory_exporter.get_finished_spans() mapper = StrandsInMemorySessionMapper() session = mapper.map_to_session(finished_spans, session_id=case.session_id)
return {"output": str(agent_response), "trajectory": session}
# Create test casestest_cases = [ Case[str, str]( name="knowledge-1", input="What is the capital of France?", metadata={"category": "knowledge"} ), Case[str, str]( name="knowledge-2", input="What color is the ocean?", metadata={"category": "knowledge"} ),]
# Create evaluatorevaluator = FaithfulnessEvaluator()
# Run evaluationexperiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator])reports = experiment.run_evaluations(user_task_function)reports[0].run_display()Evaluation Output
Section titled “Evaluation Output”The FaithfulnessEvaluator returns EvaluationOutput objects with:
- score: Float between 0.0 and 1.0 (0.0, 0.25, 0.5, 0.75, or 1.0)
- test_pass:
Trueif score >= 0.5,Falseotherwise - reason: Step-by-step reasoning explaining the evaluation
- label: One of the categorical labels (e.g., “Completely Yes”, “Neutral/Mixed”)
What Gets Evaluated
Section titled “What Gets Evaluated”The evaluator examines:
- Conversation History: All prior messages and tool executions
- Assistant’s Response: The most recent agent response
- Context Grounding: Whether claims in the response are supported by the history
The judge determines if the agent’s statements are faithful to the available information or if they contain fabrications, assumptions, or unsupported claims.
Best Practices
Section titled “Best Practices”- Use with Proper Telemetry Setup: The evaluator requires trajectory information captured via OpenTelemetry
- Provide Complete Context: Ensure full conversation history is captured in traces
- Test with Known Facts: Include test cases with verifiable information
- Monitor Hallucination Patterns: Track which types of queries lead to unfaithful responses
- Combine with Other Evaluators: Use alongside output quality evaluators for comprehensive assessment
Common Patterns
Section titled “Common Patterns”Pattern 1: Detecting Fabrications
Section titled “Pattern 1: Detecting Fabrications”Identify when agents make up information not present in the context.
Pattern 2: Validating Tool Results
Section titled “Pattern 2: Validating Tool Results”Ensure agents accurately represent information from tool calls.
Pattern 3: Multi-Turn Consistency
Section titled “Pattern 3: Multi-Turn Consistency”Check that agents maintain consistency across conversation turns.
Example Scenarios
Section titled “Example Scenarios”Scenario 1: Faithful Response
Section titled “Scenario 1: Faithful Response”User: "What did the search results say about Python?"Agent: "The search results indicated that Python is a high-level programming language."Evaluation: Completely Yes (1.0) - Response accurately reflects search resultsScenario 2: Unfaithful Response
Section titled “Scenario 2: Unfaithful Response”User: "What did the search results say about Python?"Agent: "Python was created in 1991 by Guido van Rossum and is the most popular language."Evaluation: Not Generally (0.25) - Response adds information not in search resultsScenario 3: Mixed Response
Section titled “Scenario 3: Mixed Response”User: "What did the search results say about Python?"Agent: "The search results showed Python is a programming language. It's also the fastest language."Evaluation: Neutral/Mixed (0.5) - First part faithful, second part unsupportedCommon Issues and Solutions
Section titled “Common Issues and Solutions”Issue 1: No Evaluation Returned
Section titled “Issue 1: No Evaluation Returned”Problem: Evaluator returns empty results. Solution: Ensure trajectory contains at least one agent invocation span.
Issue 2: Overly Strict Evaluation
Section titled “Issue 2: Overly Strict Evaluation”Problem: Evaluator marks reasonable inferences as unfaithful. Solution: Review system prompt and consider if agent is expected to make reasonable inferences.
Issue 3: Context Not Captured
Section titled “Issue 3: Context Not Captured”Problem: Evaluation doesn’t consider full conversation history. Solution: Verify telemetry setup captures all messages and tool executions.
Related Evaluators
Section titled “Related Evaluators”- HelpfulnessEvaluator: Evaluates helpfulness from user perspective
- OutputEvaluator: Evaluates overall output quality
- ToolParameterAccuracyEvaluator: Evaluates if tool parameters are grounded in context
- GoalSuccessRateEvaluator: Evaluates if overall goals were achieved