Tool Parameter Accuracy Evaluator
Section titled “Tool Parameter Accuracy Evaluator”Overview
Section titled “Overview”The ToolParameterAccuracyEvaluator is a specialized evaluator that assesses whether tool call parameters faithfully use information from the preceding conversation context. It evaluates each tool call individually to ensure parameters are grounded in available information rather than hallucinated or incorrectly inferred. A complete example can be found here.
Key Features
Section titled “Key Features”- Tool-Level Evaluation: Evaluates each tool call independently
- Context Faithfulness: Checks if parameters are derived from conversation history
- Binary Scoring: Simple Yes/No evaluation for clear pass/fail criteria
- Structured Reasoning: Provides step-by-step reasoning for each evaluation
- Async Support: Supports both synchronous and asynchronous evaluation
- Multiple Evaluations: Returns one evaluation result per tool call
When to Use
Section titled “When to Use”Use the ToolParameterAccuracyEvaluator when you need to:
- Verify that tool parameters are based on actual conversation context
- Detect hallucinated or fabricated parameter values
- Ensure agents don’t make assumptions beyond available information
- Validate that agents correctly extract information for tool calls
- Debug issues with incorrect tool parameter usage
- Ensure data integrity in tool-based workflows
Evaluation Level
Section titled “Evaluation Level”This evaluator operates at the TOOL_LEVEL, meaning it evaluates each individual tool call in the trajectory separately. If an agent makes 3 tool calls, you’ll receive 3 evaluation results.
Parameters
Section titled “Parameters”model (optional)
Section titled “model (optional)”- Type:
Union[Model, str, None] - Default:
None(uses default Bedrock model) - Description: The model to use as the judge. Can be a model ID string or a Model instance.
system_prompt (optional)
Section titled “system_prompt (optional)”- Type:
str | None - Default:
None(uses built-in template) - Description: Custom system prompt to guide the judge model’s behavior.
Scoring System
Section titled “Scoring System”The evaluator uses a binary scoring system:
- Yes (1.0): Parameters faithfully use information from the context
- No (0.0): Parameters contain hallucinated, fabricated, or incorrectly inferred values
Basic Usage
Section titled “Basic Usage”from strands import Agentfrom strands_tools import calculatorfrom strands_evals import Case, Experimentfrom strands_evals.evaluators import ToolParameterAccuracyEvaluatorfrom strands_evals.mappers import StrandsInMemorySessionMapperfrom strands_evals.telemetry import StrandsEvalsTelemetry
# Setup telemetrytelemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()memory_exporter = telemetry.in_memory_exporter
# Define task functiondef user_task_function(case: Case) -> dict: memory_exporter.clear()
agent = Agent( trace_attributes={ "gen_ai.conversation.id": case.session_id, "session.id": case.session_id }, tools=[calculator], callback_handler=None ) agent_response = agent(case.input)
# Map spans to session finished_spans = memory_exporter.get_finished_spans() mapper = StrandsInMemorySessionMapper() session = mapper.map_to_session(finished_spans, session_id=case.session_id)
return {"output": str(agent_response), "trajectory": session}
# Create test casestest_cases = [ Case[str, str]( name="simple-calculation", input="Calculate the square root of 144", metadata={"category": "math", "difficulty": "easy"} ),]
# Create evaluatorevaluator = ToolParameterAccuracyEvaluator()
# Run evaluationexperiment = Experiment[str, str](cases=test_cases, evaluators=[evaluator])reports = experiment.run_evaluations(user_task_function)reports[0].run_display()Evaluation Output
Section titled “Evaluation Output”The ToolParameterAccuracyEvaluator returns a list of EvaluationOutput objects (one per tool call) with:
- score:
1.0(Yes) or0.0(No) - test_pass:
Trueif score is 1.0,Falseotherwise - reason: Step-by-step reasoning explaining the evaluation
- label: “Yes” or “No”
What Gets Evaluated
Section titled “What Gets Evaluated”The evaluator examines:
- Available Tools: The tools that were available to the agent
- Previous Conversation History: All prior messages and tool executions
- Target Tool Call: The specific tool call being evaluated, including:
- Tool name
- All parameter values
The judge determines if each parameter value can be traced back to information in the conversation history.
Best Practices
Section titled “Best Practices”- Use with Proper Telemetry Setup: The evaluator requires trajectory information captured via OpenTelemetry
- Test Edge Cases: Include test cases that challenge parameter accuracy (missing info, ambiguous info, etc.)
- Combine with Other Evaluators: Use alongside tool selection and output evaluators for comprehensive assessment
- Review Reasoning: Always review the reasoning provided in evaluation results
- Use Appropriate Models: Consider using stronger models for evaluation
Common Issues and Solutions
Section titled “Common Issues and Solutions”Issue 1: No Evaluations Returned
Section titled “Issue 1: No Evaluations Returned”Problem: Evaluator returns empty list or no results. Solution: Ensure trajectory is properly captured and includes tool calls.
Issue 2: False Negatives
Section titled “Issue 2: False Negatives”Problem: Evaluator marks valid parameters as inaccurate. Solution: Ensure conversation history is complete and context is clear.
Issue 3: Inconsistent Results
Section titled “Issue 3: Inconsistent Results”Problem: Same test case produces different evaluation results. Solution: This is expected due to LLM non-determinism. Run multiple times and aggregate.
Related Evaluators
Section titled “Related Evaluators”- ToolSelectionAccuracyEvaluator: Evaluates if correct tools were selected
- TrajectoryEvaluator: Evaluates the overall sequence of tool calls
- FaithfulnessEvaluator: Evaluates if responses are grounded in context
- OutputEvaluator: Evaluates the quality of final outputs