Evaluation Strategies for AI Agents - Building Reliable Agents

Why Evaluate?

AI agents are non-deterministic. You can’t just run a test suite and call it done. Instead, you need systematic evaluation to:

Catch regressions when you change prompts, models, or tools
Measure improvements objectively when iterating on your agent
Identify failure patterns across different types of inputs
Build confidence before deploying to production
Monitor quality continuously as your agent serves real users

Evaluation is not a one-time task. It’s a continuous process that runs throughout development and in production.

The Evaluation Foundation: Datasets

Before you can evaluate, you need a dataset—a collection of test cases with inputs and (optionally) expected outputs.

Anatomy of a Dataset

Each example in your dataset should include:

Inputs: The question or request the user makes
Reference outputs (optional): The expected or ideal response
Metadata: Tags, categories, or context about the test case

Creating Datasets

You can create datasets in several ways:

From Real Traffic

Export actual user interactions from your traces. This ensures your evaluations reflect real-world usage.

Handcrafted Test Cases

Write specific scenarios to test edge cases, known failure modes, or requirements from your PRD.

Synthetic Generation

Use an LLM to generate diverse test cases based on your agent’s capabilities.

From Bugs

When users report issues, add them to your dataset to prevent regressions.

Example: OfficeFlow Dataset

The OfficeFlow agent uses a dataset with questions like:

{
  "question": "Do you have printer paper in stock?",
  "reference_answer": "Should check database and provide current inventory",
  "metadata": {
    "category": "inventory_check",
    "requires_tools": ["query_database"]
  }
}

{
  "question": "What's your return policy?",
  "reference_answer": "Should search knowledge base for return policy",
  "metadata": {
    "category": "policy_question",
    "requires_tools": ["search_knowledge_base"]
  }
}

This dataset covers different agent capabilities and ensures you test both tools.

Running Experiments

An experiment runs your agent against a dataset and applies evaluators to measure performance.

Basic Experiment Structure

import asyncio
from pathlib import Path
from langsmith import aevaluate
from agent_v1 import chat, load_knowledge_base

async def chat_wrapper(inputs: dict) -> dict:
    """Wrapper to adapt dataset inputs to chat function signature."""
    question = inputs.get("question", "")
    result = await chat(question)
    return {"answer": result["output"], "messages": result["messages"]}

async def main():
    # Load any required resources (like knowledge base)
    await load_knowledge_base(kb_dir="./knowledge_base")
    
    # Run evaluation
    results = await aevaluate(
        chat_wrapper,
        data="officeflow-dataset",  # Dataset name in LangSmith
        experiment_prefix="agent-v1"
    )
    
    print(f"Evaluation complete! Results: {results}")

if __name__ == "__main__":
    asyncio.run(main())

This runs your agent on every example in the dataset and creates a new experiment in LangSmith.

Use experiment_prefix to organize experiments by version. This makes it easy to compare “agent-v1” vs “agent-v2” later.

Types of Evaluators

Evaluators are functions that score agent outputs. There are three main types:

1. Code-Based Evaluators

Deterministic evaluators that check specific conditions. These work like unit tests.

Example: Schema-Before-Query Evaluator

This evaluator checks that the agent inspects the database schema before running queries:

import re

SCHEMA_PATTERNS = [
    r"PRAGMA\s+table_info",
    r"SELECT\s+.*FROM\s+sqlite_master",
    r"PRAGMA\s+database_list",
]

def _is_schema_query(sql: str) -> bool:
    """Return True if the SQL is a schema-inspection query."""
    for pattern in SCHEMA_PATTERNS:
        if re.search(pattern, sql, re.IGNORECASE):
            return True
    return False

def schema_before_query(run, example) -> dict:
    """Score 1 if agent checks DB schema before querying data, 0 otherwise."""
    # Extract tool calls from the run
    tool_calls = _extract_tool_calls(run)
    db_calls = [tc for tc in tool_calls if tc["name"] == "query_database"]
    
    if not db_calls:
        return {"score": 1, "comment": "No database calls - N/A"}
    
    # Check if schema query appears before first data query
    seen_schema_check = False
    for tc in db_calls:
        sql = tc.get("arguments", "")
        if _is_schema_query(sql):
            seen_schema_check = True
        else:
            # First data query - was schema checked first?
            if not seen_schema_check:
                return {
                    "score": 0,
                    "comment": f"Agent queried data without checking schema first."
                }
            break
    
    return {"score": 1, "comment": "Agent checked schema before querying"}

Use cases for code-based evaluators:

Checking tool usage patterns
Validating output format (JSON schema, specific fields)
Verifying security requirements (no PII in logs)
Measuring response length or token usage

2. LLM-as-Judge Evaluators

Use an LLM to evaluate subjective criteria that are hard to code.

Example: Correctness Evaluator

from openai import OpenAI

client = OpenAI()

CORRECTNESS_PROMPT = """You are evaluating an AI customer support agent's response.

User Question: {question}

Agent Response: {response}

Reference Context: {reference}

Is the agent's response correct and helpful? Consider:
- Does it answer the question accurately?
- Does it use appropriate tools if needed?
- Is the information factually correct?

Respond with a score from 0 to 1:
- 1.0: Completely correct and helpful
- 0.5: Partially correct but missing important details
- 0.0: Incorrect or unhelpful

Output only the numeric score."""

def evaluate_correctness(run, example) -> dict:
    """Use GPT to evaluate response correctness."""
    question = example.inputs.get("question", "")
    response = run.outputs.get("answer", "")
    reference = example.outputs.get("reference_answer", "N/A")
    
    result = client.chat.completions.create(
        model="gpt-5-nano",
        messages=[
            {"role": "system", "content": "You are an evaluator. Respond with only a number."},
            {"role": "user", "content": CORRECTNESS_PROMPT.format(
                question=question,
                response=response,
                reference=reference
            )}
        ]
    )
    
    score = float(result.choices[0].message.content.strip())
    
    return {
        "score": score,
        "comment": f"LLM judge scored {score} for correctness"
    }

Use cases for LLM-as-judge:

Evaluating helpfulness or tone
Checking semantic correctness when exact wording varies
Assessing whether the response follows specific guidelines
Measuring conciseness or clarity

Best practice: Use more powerful models (like GPT-4) as judges, even if your agent uses smaller models. The judge’s accuracy is critical for reliable evaluation.

3. Pairwise Evaluators

Compare two versions of your agent side-by-side to measure relative improvement.

Example: Conciseness Comparison

from openai import OpenAI
from langsmith import evaluate

client = OpenAI()

CONCISENESS_PROMPT = """You are evaluating two responses to the same customer question.
Determine which response is MORE CONCISE while still providing all crucial information.

**Question:** {question}

**Response A:**
{response_a}

**Response B:**
{response_b}

Output your verdict as a single number:
1 if Response A is more concise while preserving crucial information
2 if Response B is more concise while preserving crucial information
0 if they are roughly equal"""

def conciseness_evaluator(inputs: dict, outputs: list[dict]) -> list[int]:
    """Compare two responses for conciseness."""
    response = client.chat.completions.create(
        model="gpt-5-nano",
        messages=[
            {"role": "system", "content": "You are a conciseness evaluator. Respond with only: 0, 1, or 2."},
            {"role": "user", "content": CONCISENESS_PROMPT.format(
                question=inputs["question"],
                response_a=outputs[0].get("answer", "N/A"),
                response_b=outputs[1].get("answer", "N/A"),
            )}
        ],
    )
    
    preference = int(response.choices[0].message.content.strip())
    
    if preference == 1:
        return [1, 0]  # A wins
    elif preference == 2:
        return [0, 1]  # B wins
    else:
        return [0, 0]  # Tie

# Run pairwise evaluation
evaluate(
    ("agent-v4-experiment", "agent-v5-experiment"),
    evaluators=[conciseness_evaluator],
    randomize_order=True,  # Prevent position bias
)

Use cases for pairwise evaluation:

A/B testing different prompts or models
Measuring incremental improvements
Evaluating subjective qualities where absolute scoring is hard
Detecting regressions when refactoring

Combining Evaluators

Real-world evaluation uses multiple evaluators to measure different aspects:

from langsmith import aevaluate

results = await aevaluate(
    chat_wrapper,
    data="officeflow-dataset",
    evaluators=[
        schema_before_query,      # Code-based: tool usage correctness
        evaluate_correctness,      # LLM-judge: response quality
        measure_latency,           # Code-based: performance
        check_hallucination,       # LLM-judge: factual accuracy
    ],
    experiment_prefix="agent-v2"
)

Each evaluator provides a different lens on agent performance:

Functional Correctness

Does the agent produce correct outputs and use tools properly?

Quality & Style

Is the response helpful, concise, and appropriate in tone?

Performance

Does the agent respond quickly enough for production use?

Safety

Does the agent avoid hallucinations, PII leaks, or harmful outputs?

Interpreting Results

After running an experiment, LangSmith shows:

Aggregate scores for each evaluator
Pass rate across the dataset
Individual run details with inputs, outputs, and scores
Comparison view when you select multiple experiments

What to Look For

Overall trends: Is your new version improving or regressing?
Failure patterns: Do certain types of questions consistently fail?
Trade-offs: Did you improve accuracy but hurt latency?
Edge cases: Which specific examples are still failing?

Iteration Loop

Run experiment on current agent version
Analyze results - identify failures
Hypothesize improvements (better prompt, different tool, etc.)
Implement changes
Run new experiment
Compare with previous version
Repeat until metrics meet your targets

Save every experiment. Even failed iterations provide valuable data about what doesn’t work. You’ll often revisit old experiments when debugging new issues.

Real-World Example: OfficeFlow Evolution

The course demonstrates iterative improvement through 6 versions:

v0: Baseline (no tracing)
v1: + LangSmith tracing
Evaluation: Can now see what agent is doing
v2: + Enhanced tool instructions
Evaluation: Tool usage improves from 60% to 85% accuracy
v3: + Stock information policy
Evaluation: Reduces hallucinations about inventory
v4: + No-chunking RAG
Evaluation: Knowledge base retrieval accuracy improves
v5: + Conciseness improvements
Evaluation: Pairwise comparison shows 70% prefer v5 responses

Each version is validated with experiments before moving to the next iteration.

Best Practices

Start Simple

Begin with a small dataset (10-20 examples) and basic evaluators. Add complexity as you learn what matters.

Test Edge Cases

Include challenging examples: ambiguous questions, multi-step reasoning, tool failures, and adversarial inputs.

Version Your Datasets

As your agent evolves, your evaluation needs change. Create new datasets for new capabilities.

Automate Everything

Run evaluations automatically on every code change (CI/CD) to catch regressions immediately.

Balance Speed and Quality

Use fast evaluators (code-based) for rapid iteration, and slower ones (LLM-judge) for final validation.

Continuous Monitoring

In production, run evaluators on random samples of traffic to detect quality drift over time.

Common Pitfalls

Pitfall 1: Dataset Overfitting

If you optimize only for your test dataset, you might hurt generalization. Solution: Maintain separate datasets for development and final validation.

Pitfall 2: Flaky Evaluators

LLM-as-judge evaluators can be inconsistent. Solution: Run evaluations multiple times and look at variance. Use temperature=0 for judges to reduce randomness.

Pitfall 3: Ignoring Latency

Focusing only on quality can lead to slow agents. Solution: Always include performance evaluators alongside quality metrics.

Pitfall 4: Binary Thinking

Not every improvement is clear-cut. Sometimes v2 is better at X but worse at Y. Solution: Use multiple evaluators and make trade-offs explicit. Document why you chose one version over another.

From Evaluation to Production

Once you’re confident in your agent’s performance:

Run final validation on a held-out dataset
Set quality thresholds for continuous monitoring
Configure online evaluators to score production traffic
Set up alerts for when metrics drop below thresholds
Schedule periodic re-evaluation to catch model drift

Next Steps

Creating Datasets

Learn techniques for building comprehensive test datasets

Writing Evaluators

Deep dive into building custom evaluators for your use case

Analyzing Results

Master the LangSmith UI for experiment comparison and debugging

Production Monitoring

Set up continuous evaluation for deployed agents

Get Started

Core Concepts

Building Agents

Evaluation

Production

​Why Evaluate?

​The Evaluation Foundation: Datasets

​Anatomy of a Dataset

​Creating Datasets

From Real Traffic

Handcrafted Test Cases

Synthetic Generation

From Bugs

​Example: OfficeFlow Dataset

​Running Experiments

​Basic Experiment Structure

​Types of Evaluators

​1. Code-Based Evaluators

​Example: Schema-Before-Query Evaluator

​2. LLM-as-Judge Evaluators

​Example: Correctness Evaluator

​3. Pairwise Evaluators

​Example: Conciseness Comparison

​Combining Evaluators

Functional Correctness

Quality & Style

Performance

Safety

​Interpreting Results

​What to Look For

​Iteration Loop

​Real-World Example: OfficeFlow Evolution

​Best Practices

Start Simple

Test Edge Cases

Version Your Datasets

Automate Everything

Balance Speed and Quality

Continuous Monitoring

​Common Pitfalls

​Pitfall 1: Dataset Overfitting

​Pitfall 2: Flaky Evaluators

​Pitfall 3: Ignoring Latency

​Pitfall 4: Binary Thinking

​From Evaluation to Production

​Next Steps

Creating Datasets

Writing Evaluators

Analyzing Results

Production Monitoring

Build docs developers (and LLMs) love

Why Evaluate?

The Evaluation Foundation: Datasets

Anatomy of a Dataset

Creating Datasets

Example: OfficeFlow Dataset

Running Experiments

Basic Experiment Structure

Types of Evaluators

1. Code-Based Evaluators

Example: Schema-Before-Query Evaluator

2. LLM-as-Judge Evaluators

Example: Correctness Evaluator

3. Pairwise Evaluators

Example: Conciseness Comparison

Combining Evaluators

Interpreting Results

What to Look For

Iteration Loop

Real-World Example: OfficeFlow Evolution

Best Practices

Common Pitfalls

Pitfall 1: Dataset Overfitting

Pitfall 2: Flaky Evaluators

Pitfall 3: Ignoring Latency

Pitfall 4: Binary Thinking

From Evaluation to Production

Next Steps