Evals for Taste: How to Measure and Hill-Climb Your Agent
Without evals you're flying blind — reactive to complaints, unable to verify improvements. This lesson shows how to build code and model graders, run QA loops, and turn subjective quality into something you can act on.
This lesson is original educational writing based on this video by Claude (published May 23, 2026). All credit for the original content goes to the creators.
1. The reactive loop
Without evals, every quality problem looks the same: a customer reports something feels off, you look at logs, you try to reproduce it, you make a change, and you hope. The next problem comes the same way.
This workshop — built around a slide-generation agent — makes the stakes concrete. You can watch a slide deck get measurably better across three iterations, each one driven by what the previous round of evals revealed. By the end, you have a repeatable loop instead of vibes.
The argument for evals is four-fold:
- Clarity: Building an eval forces you to define what “good” actually means. If you can’t articulate success criteria, you can’t reliably produce success.
- Verification: You can confirm that changes made things better rather than just different. Without evals, prompt tweaks are a game of hope.
- Model adoption: When a new model drops, you can score it against your benchmark in hours rather than weeks of intuition-building.
- Pre-launch visibility: Catch the problems before your users do.
2. Three types of graders
Not all quality dimensions can be measured the same way.
Code graders
A code grader is a deterministic check: count emoji, verify slide count, assert a file exists. Fast, cheap, and can run thousands of times without adding cost.
In the slide-generation demo:
emoji_count— count emoji characters in the deckslide_count— assert exactly 5 slides producedcluttered_slides— count shapes per slide; flag if over thresholdsmall_font_slides— check min font size against a threshold
These catch objective failures. They can’t tell you whether the slide looks good.
Model graders
A model grader sends the output to Claude and asks for a scored evaluation according to a rubric. In the workshop:
color_judge— rates color contrast 0–5 with reasoninglayout_judge— rates slide density and breathing room 0–5text_judge— evaluates readability and conciseness
Model graders are where calibration matters enormously. The workshop reveals a common trap: graders that score everything 4–5 because the rubric doesn’t anchor what “bad” looks like. A rubric that says “text should be concise” gives the model nothing to compare against. A rubric that says “a score of 1 means walls of text with no whitespace; a score of 5 means each point is one sentence with clear hierarchy” produces useful signal.
Pairwise comparison
An underrated technique for cases where you can’t define absolute quality: show the grader two outputs and ask which is better and why. This sidesteps the calibration problem entirely — the model doesn’t need to know what a “4” means, it just needs to prefer one over the other.
Check your understanding
3 questions · your answers are saved in this browser only
-
1. Why does the workshop reverse the order of model grader output — reasons first, then score?
-
2. When should you use pairwise comparison instead of a scored rubric?
-
3. In a QA loop, what specific instruction makes the critic agent more effective?
3. The hill-climbing loop
The workshop runs through three iterations of a slide-generation agent, each one driven by what the previous round revealed:
Iteration 0 — Baseline (vanilla Sonnet prompt):
- Emoji all over the deck (emoji_count: 4)
- Many slides with tiny text (small_font_slides: 4)
- Overlapping text boxes
- Inconsistent colors
Iteration 1 — Prompt fix (add typography rules): Updated system prompt: specific font sizes for title/body/caption, density rules, instruction to avoid “AI-generated tells” like emoji icons and thin accent lines.
Result: cleaner layout, consistent sizing. But emoji_count jumped to 20 — suggesting the model found a different surface for emoji (possibly in metadata fields). This is the calibration catch: the grader caught something, human review is needed to understand what.
Iteration 2 — QA loop: Added a second agent whose sole job is to criticize the slide deck after creation. Prompt: “Assume there are problems. Approach QA as a bug hunt, not a confirmation step. Inspect every slide image. Fix issues, re-render, re-inspect. Do not stop until at least one fix-and-verify cycle completes.”
Result: all judge scores moved into the 4.2–4.4 range. Slides became noticeably more readable with better image sizing and clearer structure.
Iteration 3 — Model upgrade: Switched to Opus 4.7 with the same simple baseline prompt (no typography rules, no QA loop). Opus scored higher than the Sonnet+QA-loop combination on layout and color, and produced zero emoji by default — it has implicit taste about what belongs in a professional deck.
The lesson: evals help you decide when to engineer prompts versus when to upgrade models. Without concrete scores, you can’t make that call reliably.
4. Evals are living artifacts
One of the most honest moments in the workshop: after running iteration 1, the emoji_count grader reports 20, but the presenter can’t find the emoji in the slides. The grader was counting something the humans weren’t measuring.
This happens. Graders need the same iterative treatment as the agents they evaluate:
- A grader that’s always high means it’s not discriminating — recalibrate or replace it
- A grader counting the wrong thing means the definition needs to be tightened
- “Saturation” means a dimension is solved — drop it and add a new one
Don’t build evals once and treat them as ground truth forever. The eval suite should evolve alongside the agent.
Build it yourself
Follow these exact steps to reproduce it yourself · estimated time: ~60 minutes
Prerequisites
- Python 3.10+ and an Anthropic API key
- A task where your agent produces an artifact you can score (code, a document, a data extraction, a response)
Build a minimal eval harness for your agent: one code grader, one model grader, and one QA loop.
Step 1 — Define your task and collect 5 test inputs
Pick something concrete. A support-ticket classifier, a data-extraction agent, a code generator. Gather 5 representative inputs — include at least one edge case.
TEST_INPUTS = [
"App crashed during checkout, third time this week",
"How do I change my email address?",
"The dark mode toggle doesn't persist after refresh",
"Charged twice for the same order",
"Feature request: add export to CSV",
]Step 2 — Write a code grader
Start with something deterministic that catches objective failures:
import re
def code_grade(output: str) -> dict:
"""Deterministic checks on agent output."""
return {
# Did the agent produce a JSON response?
"is_json": _is_valid_json(output),
# Does the response include required fields?
"has_severity": '"severity"' in output,
"has_summary": '"summary"' in output,
# Is it a reasonable length? (not a hallucinated essay)
"reasonable_length": 50 < len(output) < 2000,
}
def _is_valid_json(s: str) -> bool:
import json
try:
json.loads(s)
return True
except Exception:
return FalseStep 3 — Write a model grader
import anthropic
client = anthropic.Anthropic()
GRADER_SYSTEM = """You are a quality evaluator for a support ticket classifier.
Score the following output on two dimensions.
DIMENSIONS:
- accuracy: Does the severity classification match the described problem? (1=wrong, 5=exactly right)
- helpfulness: Is the summary useful for a support agent picking this up cold? (1=useless, 5=immediately actionable)
EXAMPLES OF LOW SCORES:
- accuracy=1: "P3 cosmetic" for a checkout crash that blocks purchases
- helpfulness=1: "User has a problem" with no detail
EXAMPLES OF HIGH SCORES:
- accuracy=5: "P1 blocks checkout" for a crash during purchase flow
- helpfulness=5: "Recurring crash (3rd time this week) during checkout. Likely P1."
OUTPUT FORMAT: Give your reasoning first, then output JSON: {"accuracy": N, "helpfulness": N}"""
def model_grade(ticket: str, output: str) -> dict:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=300,
system=GRADER_SYSTEM,
messages=[{
"role": "user",
"content": f"TICKET: {ticket}\n\nAGENT OUTPUT: {output}"
}],
)
text = response.content[0].text
# Extract JSON from after the reasoning
import json, re
match = re.search(r'\{.*\}', text, re.DOTALL)
return json.loads(match.group()) if match else {"accuracy": 0, "helpfulness": 0}Step 4 — Run your baseline
def run_agent(ticket: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
system="Classify this support ticket. Output JSON with: severity (P1/P2/P3), summary.",
messages=[{"role": "user", "content": ticket}],
)
return response.content[0].text
results = []
for ticket in TEST_INPUTS:
output = run_agent(ticket)
code = code_grade(output)
model = model_grade(ticket, output)
results.append({"ticket": ticket, "output": output, "code": code, "model": model})
print(f"accuracy={model['accuracy']}, helpfulness={model['helpfulness']} | {ticket[:40]}")Step 5 — Add a QA loop
QA_SYSTEM = """You are a QA critic for a support ticket classifier.
There are problems with the output below — your job is to find them.
Check: Is severity too high? Too low? Is the summary missing context?
Does it misread the urgency? Output your critique, then a corrected version."""
def qa_loop(ticket: str, initial_output: str) -> str:
critique = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
system=QA_SYSTEM,
messages=[{
"role": "user",
"content": f"TICKET: {ticket}\n\nDRAFT OUTPUT: {initial_output}"
}],
).content[0].text
# Feed critique back to the agent
final = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
system="Classify this support ticket. Output JSON with: severity, summary.",
messages=[
{"role": "user", "content": ticket},
{"role": "assistant", "content": initial_output},
{"role": "user", "content": f"QA review found issues:\n{critique}\n\nRevise your output."},
],
).content[0].text
return finalStep 6 — Compare baseline vs QA loop
Run both approaches on the same inputs and compare average scores. If the QA loop consistently improves model_grade scores, keep it. If it’s inconsistent, the critic prompt needs tightening.
What to iterate on
- Grader calibration: If model_grade is always 4–5, add a low-quality anchor example to the grader system prompt
- Code grader brittleness: If is_json fails because the agent wraps JSON in markdown, update the code grader or the agent’s output instructions
- QA loop over-correction: If the critic makes things worse, constrain the QA system prompt to specific failure modes
Where to go next
- Watch the original workshop to see the slide-generation agent and eval scores evolve in real time
- Building Effective Agents — the foundational patterns that your evals should cover
- Anthropic’s model evaluation guide covers benchmark methodology for API consumers