Experiment Management
Section titled “Experiment Management”Overview
Section titled “Overview”Test cases in Strands Evals are organized into Experiment objects. This guide covers practical patterns for managing experiments and test cases.
Organizing Test Cases
Section titled “Organizing Test Cases”Using Metadata for Organization
Section titled “Using Metadata for Organization”from strands_evals import Case
# Add metadata for filtering and organizationcases = [ Case( name="easy-math", input="What is 2 + 2?", metadata={ "category": "math", "difficulty": "easy", "tags": ["arithmetic"] } ), Case( name="hard-math", input="Solve x^2 + 5x + 6 = 0", metadata={ "category": "math", "difficulty": "hard", "tags": ["algebra"] } )]
# Filter by metadataeasy_cases = [c for c in cases if c.metadata.get("difficulty") == "easy"]Naming Conventions
Section titled “Naming Conventions”# Pattern: {category}-{subcategory}-{number}Case(name="knowledge-geography-001", input="..."),Case(name="math-arithmetic-001", input="..."),Managing Multiple Experiments
Section titled “Managing Multiple Experiments”Experiment Collections
Section titled “Experiment Collections”from strands_evals import Experiment
experiments = { "baseline": Experiment(cases=baseline_cases, evaluators=[...]), "with_tools": Experiment(cases=tool_cases, evaluators=[...]), "edge_cases": Experiment(cases=edge_cases, evaluators=[...])}
# Run allfor name, exp in experiments.items(): print(f"Running {name}...") reports = exp.run_evaluations(task_function)Combining Experiments
Section titled “Combining Experiments”# Merge cases from multiple experimentscombined = Experiment( cases=exp1.cases + exp2.cases + exp3.cases, evaluators=[OutputEvaluator()])Modifying Experiments
Section titled “Modifying Experiments”Adding Cases
Section titled “Adding Cases”# Add single caseexperiment.cases.append(new_case)
# Add multipleexperiment.cases.extend(additional_cases)Updating Evaluators
Section titled “Updating Evaluators”from strands_evals.evaluators import HelpfulnessEvaluator
# Replace evaluatorsexperiment.evaluators = [ OutputEvaluator(), HelpfulnessEvaluator()]Session IDs
Section titled “Session IDs”Each case gets a unique session ID automatically:
case = Case(input="test")print(case.session_id) # Auto-generated UUID
# Or provide customcase = Case(input="test", session_id="custom-123")Best Practices
Section titled “Best Practices”1. Use Descriptive Names
Section titled “1. Use Descriptive Names”# GoodCase(name="customer-service-refund-request", input="...")
# Less helpfulCase(name="test1", input="...")2. Include Rich Metadata
Section titled “2. Include Rich Metadata”Case( name="complex-query", input="...", metadata={ "category": "customer_service", "difficulty": "medium", "expected_tools": ["search_orders"], "created_date": "2025-01-15" })3. Version Your Experiments
Section titled “3. Version Your Experiments”experiment.to_file("experiment_v1.json")experiment.to_file("experiment_v2.json")
# Or with timestampsfrom datetime import datetimetimestamp = datetime.now().strftime("%Y%m%d_%H%M%S")experiment.to_file(f"experiment_{timestamp}.json")Related Documentation
Section titled “Related Documentation”- Serialization: Save and load experiments
- Experiment Generator: Generate experiments automatically
- Quickstart Guide: Get started with experiments