The OfficeFlow agent uses a dataset with questions like:
{ "question": "Do you have printer paper in stock?", "reference_answer": "Should check database and provide current inventory", "metadata": { "category": "inventory_check", "requires_tools": ["query_database"] }}
{ "question": "What's your return policy?", "reference_answer": "Should search knowledge base for return policy", "metadata": { "category": "policy_question", "requires_tools": ["search_knowledge_base"] }}
This dataset covers different agent capabilities and ensures you test both tools.
This evaluator checks that the agent inspects the database schema before running queries:
import reSCHEMA_PATTERNS = [ r"PRAGMA\s+table_info", r"SELECT\s+.*FROM\s+sqlite_master", r"PRAGMA\s+database_list",]def _is_schema_query(sql: str) -> bool: """Return True if the SQL is a schema-inspection query.""" for pattern in SCHEMA_PATTERNS: if re.search(pattern, sql, re.IGNORECASE): return True return Falsedef schema_before_query(run, example) -> dict: """Score 1 if agent checks DB schema before querying data, 0 otherwise.""" # Extract tool calls from the run tool_calls = _extract_tool_calls(run) db_calls = [tc for tc in tool_calls if tc["name"] == "query_database"] if not db_calls: return {"score": 1, "comment": "No database calls - N/A"} # Check if schema query appears before first data query seen_schema_check = False for tc in db_calls: sql = tc.get("arguments", "") if _is_schema_query(sql): seen_schema_check = True else: # First data query - was schema checked first? if not seen_schema_check: return { "score": 0, "comment": f"Agent queried data without checking schema first." } break return {"score": 1, "comment": "Agent checked schema before querying"}
Use cases for code-based evaluators:
Checking tool usage patterns
Validating output format (JSON schema, specific fields)
from openai import OpenAIclient = OpenAI()CORRECTNESS_PROMPT = """You are evaluating an AI customer support agent's response.User Question: {question}Agent Response: {response}Reference Context: {reference}Is the agent's response correct and helpful? Consider:- Does it answer the question accurately?- Does it use appropriate tools if needed?- Is the information factually correct?Respond with a score from 0 to 1:- 1.0: Completely correct and helpful- 0.5: Partially correct but missing important details- 0.0: Incorrect or unhelpfulOutput only the numeric score."""def evaluate_correctness(run, example) -> dict: """Use GPT to evaluate response correctness.""" question = example.inputs.get("question", "") response = run.outputs.get("answer", "") reference = example.outputs.get("reference_answer", "N/A") result = client.chat.completions.create( model="gpt-5-nano", messages=[ {"role": "system", "content": "You are an evaluator. Respond with only a number."}, {"role": "user", "content": CORRECTNESS_PROMPT.format( question=question, response=response, reference=reference )} ] ) score = float(result.choices[0].message.content.strip()) return { "score": score, "comment": f"LLM judge scored {score} for correctness" }
Use cases for LLM-as-judge:
Evaluating helpfulness or tone
Checking semantic correctness when exact wording varies
Assessing whether the response follows specific guidelines
Measuring conciseness or clarity
Best practice: Use more powerful models (like GPT-4) as judges, even if your agent uses smaller models. The judge’s accuracy is critical for reliable evaluation.
from openai import OpenAIfrom langsmith import evaluateclient = OpenAI()CONCISENESS_PROMPT = """You are evaluating two responses to the same customer question.Determine which response is MORE CONCISE while still providing all crucial information.**Question:** {question}**Response A:**{response_a}**Response B:**{response_b}Output your verdict as a single number:1 if Response A is more concise while preserving crucial information2 if Response B is more concise while preserving crucial information0 if they are roughly equal"""def conciseness_evaluator(inputs: dict, outputs: list[dict]) -> list[int]: """Compare two responses for conciseness.""" response = client.chat.completions.create( model="gpt-5-nano", messages=[ {"role": "system", "content": "You are a conciseness evaluator. Respond with only: 0, 1, or 2."}, {"role": "user", "content": CONCISENESS_PROMPT.format( question=inputs["question"], response_a=outputs[0].get("answer", "N/A"), response_b=outputs[1].get("answer", "N/A"), )} ], ) preference = int(response.choices[0].message.content.strip()) if preference == 1: return [1, 0] # A wins elif preference == 2: return [0, 1] # B wins else: return [0, 0] # Tie# Run pairwise evaluationevaluate( ("agent-v4-experiment", "agent-v5-experiment"), evaluators=[conciseness_evaluator], randomize_order=True, # Prevent position bias)
Use cases for pairwise evaluation:
A/B testing different prompts or models
Measuring incremental improvements
Evaluating subjective qualities where absolute scoring is hard
1. Run experiment on current agent version2. Analyze results - identify failures3. Hypothesize improvements (better prompt, different tool, etc.)4. Implement changes5. Run new experiment6. Compare with previous version7. Repeat until metrics meet your targets
Save every experiment. Even failed iterations provide valuable data about what doesn’t work. You’ll often revisit old experiments when debugging new issues.
LLM-as-judge evaluators can be inconsistent.Solution: Run evaluations multiple times and look at variance. Use temperature=0 for judges to reduce randomness.
Not every improvement is clear-cut. Sometimes v2 is better at X but worse at Y.Solution: Use multiple evaluators and make trade-offs explicit. Document why you chose one version over another.