Evals for Taste: How to Measure and Hill-Climb Your Agent

1. The reactive loop

Without evals, every quality problem looks the same: a customer reports something feels off, you look at logs, you try to reproduce it, you make a change, and you hope. The next problem comes the same way.

This workshop — built around a slide-generation agent — makes the stakes concrete. You can watch a slide deck get measurably better across three iterations, each one driven by what the previous round of evals revealed. By the end, you have a repeatable loop instead of vibes.

The argument for evals is four-fold:

Clarity: Building an eval forces you to define what “good” actually means. If you can’t articulate success criteria, you can’t reliably produce success.
Verification: You can confirm that changes made things better rather than just different. Without evals, prompt tweaks are a game of hope.
Model adoption: When a new model drops, you can score it against your benchmark in hours rather than weeks of intuition-building.
Pre-launch visibility: Catch the problems before your users do.

2. Three types of graders

Not all quality dimensions can be measured the same way.

The grader spectrum. Deterministic code graders are fast and cheap but brittle; model graders handle nuance but require calibration; human graders are ground truth but expensive. Most production eval suites use all three.

Code graders

A code grader is a deterministic check: count emoji, verify slide count, assert a file exists. Fast, cheap, and can run thousands of times without adding cost.

In the slide-generation demo:

emoji_count — count emoji characters in the deck
slide_count — assert exactly 5 slides produced
cluttered_slides — count shapes per slide; flag if over threshold
small_font_slides — check min font size against a threshold

These catch objective failures. They can’t tell you whether the slide looks good.

Model graders

A model grader sends the output to Claude and asks for a scored evaluation according to a rubric. In the workshop:

color_judge — rates color contrast 0–5 with reasoning
layout_judge — rates slide density and breathing room 0–5
text_judge — evaluates readability and conciseness

Model graders are where calibration matters enormously. The workshop reveals a common trap: graders that score everything 4–5 because the rubric doesn’t anchor what “bad” looks like. A rubric that says “text should be concise” gives the model nothing to compare against. A rubric that says “a score of 1 means walls of text with no whitespace; a score of 5 means each point is one sentence with clear hierarchy” produces useful signal.

Pairwise comparison

An underrated technique for cases where you can’t define absolute quality: show the grader two outputs and ask which is better and why. This sidesteps the calibration problem entirely — the model doesn’t need to know what a “4” means, it just needs to prefer one over the other.

Check your understanding

3 questions · your answers are saved in this browser only

1. Why does the workshop reverse the order of model grader output — reasons first, then score?

If a model outputs "4" first, it then generates reasons that support a 4 — even for a 1-quality output. Outputting pros and cons first forces genuine evaluation before the score is anchored.
2. When should you use pairwise comparison instead of a scored rubric?

Pairwise comparison bypasses calibration entirely — you don't need to define what a "3" means. It's ideal for subjective quality dimensions like "does this slide look professional?"
3. In a QA loop, what specific instruction makes the critic agent more effective?

Instructing the critic "there are problems — your job is to find them" shifts its default behavior from confirming acceptability to adversarial search. The phrasing matters more than model choice.

3. The hill-climbing loop

The workshop runs through three iterations of a slide-generation agent, each one driven by what the previous round revealed:

Iteration 0 — Baseline (vanilla Sonnet prompt):

Emoji all over the deck (emoji_count: 4)
Many slides with tiny text (small_font_slides: 4)
Overlapping text boxes
Inconsistent colors

Iteration 1 — Prompt fix (add typography rules): Updated system prompt: specific font sizes for title/body/caption, density rules, instruction to avoid “AI-generated tells” like emoji icons and thin accent lines.

Result: cleaner layout, consistent sizing. But emoji_count jumped to 20 — suggesting the model found a different surface for emoji (possibly in metadata fields). This is the calibration catch: the grader caught something, human review is needed to understand what.

Iteration 2 — QA loop: Added a second agent whose sole job is to criticize the slide deck after creation. Prompt: “Assume there are problems. Approach QA as a bug hunt, not a confirmation step. Inspect every slide image. Fix issues, re-render, re-inspect. Do not stop until at least one fix-and-verify cycle completes.”

Result: all judge scores moved into the 4.2–4.4 range. Slides became noticeably more readable with better image sizing and clearer structure.

Iteration 3 — Model upgrade: Switched to Opus 4.7 with the same simple baseline prompt (no typography rules, no QA loop). Opus scored higher than the Sonnet+QA-loop combination on layout and color, and produced zero emoji by default — it has implicit taste about what belongs in a professional deck.

The lesson: evals help you decide when to engineer prompts versus when to upgrade models. Without concrete scores, you can’t make that call reliably.

4. Evals are living artifacts

One of the most honest moments in the workshop: after running iteration 1, the emoji_count grader reports 20, but the presenter can’t find the emoji in the slides. The grader was counting something the humans weren’t measuring.

This happens. Graders need the same iterative treatment as the agents they evaluate:

A grader that’s always high means it’s not discriminating — recalibrate or replace it
A grader counting the wrong thing means the definition needs to be tightened
“Saturation” means a dimension is solved — drop it and add a new one

Don’t build evals once and treat them as ground truth forever. The eval suite should evolve alongside the agent.

Build it yourself

Follow these exact steps to reproduce it yourself · estimated time: ~60 minutes

Prerequisites

Python 3.10+ and an Anthropic API key
A task where your agent produces an artifact you can score (code, a document, a data extraction, a response)

Build a minimal eval harness for your agent: one code grader, one model grader, and one QA loop.

Step 1 — Define your task and collect 5 test inputs

Pick something concrete. A support-ticket classifier, a data-extraction agent, a code generator. Gather 5 representative inputs — include at least one edge case.

TEST_INPUTS = [
    "App crashed during checkout, third time this week",
    "How do I change my email address?",
    "The dark mode toggle doesn't persist after refresh",
    "Charged twice for the same order",
    "Feature request: add export to CSV",
]

Step 2 — Write a code grader

Start with something deterministic that catches objective failures:

import re

def code_grade(output: str) -> dict:
    """Deterministic checks on agent output."""
    return {
        # Did the agent produce a JSON response?
        "is_json": _is_valid_json(output),
        # Does the response include required fields?
        "has_severity": '"severity"' in output,
        "has_summary": '"summary"' in output,
        # Is it a reasonable length? (not a hallucinated essay)
        "reasonable_length": 50 < len(output) < 2000,
    }

def _is_valid_json(s: str) -> bool:
    import json
    try:
        json.loads(s)
        return True
    except Exception:
        return False

Step 3 — Write a model grader

import anthropic

client = anthropic.Anthropic()

GRADER_SYSTEM = """You are a quality evaluator for a support ticket classifier.
Score the following output on two dimensions.

DIMENSIONS:
- accuracy: Does the severity classification match the described problem? (1=wrong, 5=exactly right)
- helpfulness: Is the summary useful for a support agent picking this up cold? (1=useless, 5=immediately actionable)

EXAMPLES OF LOW SCORES:
- accuracy=1: "P3 cosmetic" for a checkout crash that blocks purchases
- helpfulness=1: "User has a problem" with no detail

EXAMPLES OF HIGH SCORES:
- accuracy=5: "P1 blocks checkout" for a crash during purchase flow
- helpfulness=5: "Recurring crash (3rd time this week) during checkout. Likely P1."

OUTPUT FORMAT: Give your reasoning first, then output JSON: {"accuracy": N, "helpfulness": N}"""

def model_grade(ticket: str, output: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=300,
        system=GRADER_SYSTEM,
        messages=[{
            "role": "user",
            "content": f"TICKET: {ticket}\n\nAGENT OUTPUT: {output}"
        }],
    )
    text = response.content[0].text
    # Extract JSON from after the reasoning
    import json, re
    match = re.search(r'\{.*\}', text, re.DOTALL)
    return json.loads(match.group()) if match else {"accuracy": 0, "helpfulness": 0}

Step 4 — Run your baseline

def run_agent(ticket: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        system="Classify this support ticket. Output JSON with: severity (P1/P2/P3), summary.",
        messages=[{"role": "user", "content": ticket}],
    )
    return response.content[0].text

results = []
for ticket in TEST_INPUTS:
    output = run_agent(ticket)
    code = code_grade(output)
    model = model_grade(ticket, output)
    results.append({"ticket": ticket, "output": output, "code": code, "model": model})
    print(f"accuracy={model['accuracy']}, helpfulness={model['helpfulness']} | {ticket[:40]}")

Step 5 — Add a QA loop

QA_SYSTEM = """You are a QA critic for a support ticket classifier.
There are problems with the output below — your job is to find them.
Check: Is severity too high? Too low? Is the summary missing context?
Does it misread the urgency? Output your critique, then a corrected version."""

def qa_loop(ticket: str, initial_output: str) -> str:
    critique = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        system=QA_SYSTEM,
        messages=[{
            "role": "user",
            "content": f"TICKET: {ticket}\n\nDRAFT OUTPUT: {initial_output}"
        }],
    ).content[0].text

    # Feed critique back to the agent
    final = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        system="Classify this support ticket. Output JSON with: severity, summary.",
        messages=[
            {"role": "user", "content": ticket},
            {"role": "assistant", "content": initial_output},
            {"role": "user", "content": f"QA review found issues:\n{critique}\n\nRevise your output."},
        ],
    ).content[0].text
    return final

Step 6 — Compare baseline vs QA loop

Run both approaches on the same inputs and compare average scores. If the QA loop consistently improves model_grade scores, keep it. If it’s inconsistent, the critic prompt needs tightening.

What to iterate on

Grader calibration: If model_grade is always 4–5, add a low-quality anchor example to the grader system prompt
Code grader brittleness: If is_json fails because the agent wraps JSON in markdown, update the code grader or the agent’s output instructions
QA loop over-correction: If the critic makes things worse, constrain the QA system prompt to specific failure modes

Where to go next

Watch the original workshop to see the slide-generation agent and eval scores evolve in real time
Building Effective Agents — the foundational patterns that your evals should cover
Anthropic’s model evaluation guide covers benchmark methodology for API consumers

Evals for Taste: How to Measure and Hill-Climb Your Agent

1. The reactive loop

2. Three types of graders

Code graders

Model graders

Pairwise comparison

Check your understanding

3. The hill-climbing loop

4. Evals are living artifacts

Build it yourself

Step 1 — Define your task and collect 5 test inputs

Step 2 — Write a code grader

Step 3 — Write a model grader

Step 4 — Run your baseline

Step 5 — Add a QA loop

Step 6 — Compare baseline vs QA loop

What to iterate on

Where to go next

Related lessons

Agent Harness Engineering: Chasing Friction

Running an AI-Native Engineering Org: What Changes When Coding Isn't the Bottleneck

Fable 5 and the AI-Native Company

1. The reactive loop

2. Three types of graders

Code graders

Model graders

Pairwise comparison

🧠 Check your understanding

3. The hill-climbing loop

4. Evals are living artifacts

🛠️ Build it yourself

Step 1 — Define your task and collect 5 test inputs

Step 2 — Write a code grader

Step 3 — Write a model grader

Step 4 — Run your baseline

Step 5 — Add a QA loop

Step 6 — Compare baseline vs QA loop

What to iterate on

Where to go next

Related lessons

Agent Harness Engineering: Chasing Friction

Running an AI-Native Engineering Org: What Changes When Coding Isn't the Bottleneck

Fable 5 and the AI-Native Company

Check your understanding

Build it yourself