The Prompting Playbook

1. Why evals have to come first

Most prompt engineering starts with intuition: read the prompt, spot something that feels off, tweak it, eyeball a few outputs, and ship. That works until it doesn’t — and when it stops working you have no way to tell whether your latest change helped, hurt, or did nothing at all.

The insight at the centre of this talk by Margo van Laar (Anthropic) is disarmingly simple: never touch a prompt you can’t measure. That means building your evaluation set before you change anything — not after. The eval set lets you establish a baseline, compare every subsequent iteration against it, and stop guessing.

What counts as an eval?

An eval doesn’t have to be elaborate. At minimum you need:

A representative sample of real inputs — a handful of examples that cover the normal case, the edge case, and the failure case you’re currently worried about.
A scoring function — even a simple pass / fail label per example, manually annotated or LLM-graded, is enough to turn “I think this is better” into “this moved the score from 0.61 to 0.79”.
A baseline score — run the current, unchanged prompt against the eval set and record the result. That number is your anchor. Every change either beats it or doesn’t.

2. Scenario 1 — Maintaining and migrating an existing prompt

Once you have a baseline, there are five high-ROI techniques to apply in roughly this order. Each one is cheap, reversible, and observable through the eval.

Technique 1: Apply XML structure

Wrap logically distinct sections of your prompt in descriptive XML tags:

<instructions>
  Classify the customer message into one of the categories below.
</instructions>

<categories>
  billing, technical-support, account-access, other
</categories>

<context>
  The customer has a Pro subscription and contacted us 3 days ago about the same issue.
</context>

<message>
  {{customer_message}}
</message>

Without the tags, the model reads everything as a continuous stream of text and must infer where instructions end and data begins. Tags remove that ambiguity. They’re also self-documenting — future you (or a teammate) knows what each block is for.

Technique 2: Remove old-model patches

Every prompt that has been in production for more than a few months carries scar tissue: phrases added to work around a specific model’s quirks. Things like:

“Always respond in exactly five sentences.”
“Do not use bullet points under any circumstances.”
“Repeat the user’s question before answering.”

These were fixes for weaknesses in older models. When you migrate to a newer model, they become friction — or actively hurt performance. During a migration, audit every instruction and ask: “Is this here because of a real product requirement, or because the old model needed it?” Remove anything in the second category and let the eval tell you whether the new model needs it.

Technique 3: Give tools, not instructions

If your prompt contains instructions like “search the database for the customer’s order history before answering”, you’re asking the model to simulate a capability it doesn’t have. Convert those instructions into real tools:

tools = [
    {
        "name": "get_order_history",
        "description": "Retrieve the last N orders for a customer by ID.",
        "input_schema": { ... }
    }
]

Then the prompt becomes simply: “Use get_order_history to look up the customer’s orders before answering.” The model’s job is to decide when and with what arguments to call the tool — which it’s good at — rather than to hallucinate the tool’s output — which it isn’t.

Technique 4: Give both sides of every trade-off

One-sided instructions create predictable failure modes. If you say “be concise”, the model optimises for brevity and drops critical details. The fix is to state both sides explicitly:

“Be concise, but include every detail the user needs to act — even if that makes the response longer than you’d otherwise choose.”

The two-sided form forces the model to make a judgment call rather than blindly optimising a single dimension. It surfaces the real intent: be concise when you can, complete when you must.

Apply this pattern anywhere you’re giving a directional instruction: “be friendly, but don’t sacrifice accuracy for tone”, “be specific, but don’t overwhelm with detail the user didn’t ask for”.

Technique 5: Re-run the eval after each change

Apply changes one at a time. Re-run the eval after each. This is not bureaucracy — it’s the only way to know which change caused which effect. Bundling all five techniques into one big diff and then running the eval once tells you the bundle worked or didn’t, not which individual technique drove the result.

Check your understanding

5 questions · your answers are saved in this browser only

1. You are migrating a production prompt to a newer model. You find the instruction "Never use bullet points." What should you do first?

Instructions accumulate as patches for old model weaknesses. The right move is to audit intent before deciding. If it was a workaround, remove it and let the eval tell you if the new model needs it.
2. Your prompt says "search the knowledge base for relevant articles before answering." Which technique addresses this most directly?

Instructions that ask the model to simulate a capability belong in a tool definition. The model decides when and how to call the tool; it doesn't simulate the result.
3. Why should you apply the five optimisation techniques one at a time rather than all at once?

Bundling changes loses the signal. One change at a time with an eval in between is the only way to attribute results to causes.
4. A team wants to add the rule "prefer British English for UK customers" to their prompt. What is the best approach according to the generate-evaluate-repair playbook?

The evaluation step is the right place for context-dependent soft constraints. The generator produces a draft; the evaluator checks it against customer-specific rules without those rules cluttering the main prompt.
5. In the talk's code-review experiment, which approach won on the combined metric of quality, token count, and latency?

The generate-evaluate-repair loop on Sonnet passed the eval, used fewer tokens than Opus+thinking, and had lower latency — showing that architecture beats raw model power.

3. Scenario 2 — Building a prompt from scratch

When you have no existing prompt to refine, the instinct is to jump to the most capable model and write the best prompt you can. The experiment in the talk tests that instinct directly, using a complex code-review task as the benchmark.

The model comparison experiment

Approach	Eval result	Notes
Sonnet 3.7 + plain prompt	Fails	Baseline starting point
Opus + plain prompt	Partial pass	Better reasoning, still misses edge cases
Opus + adaptive thinking	Passes	But uses 3× the tokens and higher latency
Sonnet + improved prompt	Still fails	Prompt alone wasn’t enough
Generate-evaluate-repair (Sonnet)	Passes	Fewer tokens, lower latency than Opus+thinking

The headline result: architectural improvement beats model upgrade. Reaching for a larger model is the obvious move and sometimes correct, but it’s not always the highest-ROI one. The loop unlocked Sonnet-level cost and speed while meeting the quality bar that previously required Opus with extended thinking.

4. The generate-evaluate-repair loop

The loop is the key architectural pattern for building high-quality agentic prompts from scratch. It separates concerns cleanly: the generator focuses on producing good output, the evaluator focuses on checking it, and the repair step focuses on fixing specific failures.

The generate-evaluate-repair loop: the generator produces a draft, the evaluator scores it against hard and soft constraints, and the repair step fixes failures — repeating until the quality bar is met or the iteration limit is reached.

How the loop works

Generate — the main agent prompt produces a draft output. Keep this prompt focused on the primary task. Don’t pollute it with context-specific rules that apply only to some requests.

Evaluate — a separate evaluation call (or function) scores the draft. This is where the loop earns its power:

Score against hard constraints: “Did the output include all required sections?”
Score against soft constraints: context-dependent rules you inject at evaluation time rather than baking into the generator. A rule like “for this customer, prefer British English” lives here, not in the system prompt — because it’s only relevant sometimes.

If the score meets the threshold, the output is returned. If not, the evaluation step emits a structured critique explaining what failed.

Repair — the repair call receives the original input, the draft, and the critique. Its job is narrow: fix the specific failures the evaluator identified, without changing anything that was already correct. This focused repair is more reliable than asking the generator to try again from scratch.

Loop — the repaired output goes back to Evaluate. Repeat until the quality bar is met or a maximum iteration count is reached (always set a maximum — unbounded loops are bugs).

Why soft constraints belong in the evaluator

Injecting soft constraints at evaluation time keeps the generator prompt clean and general. The generator doesn’t need to know about every possible customer preference or locale rule — it just needs to produce a good draft. The evaluator, which runs after the draft exists, has full context about the specific request and can check for anything.

This separation also makes the system easier to maintain: adding a new soft constraint means editing the evaluation step, not the generator prompt. And you can test the evaluator independently with known-good and known-bad examples.

5. Choosing your playbook

The two scenarios map to two real situations developers face. Knowing which one you’re in determines everything that follows.

Use Scenario 1 (maintain/migrate) when:

You have a prompt in production that mostly works but has regressed, plateaued, or needs to move to a new model.
The task is a single-turn or simple multi-turn interaction (classify, summarise, extract, answer).
The performance gap is modest — the current prompt is in the right ballpark.

Use Scenario 2 (build from scratch with the loop) when:

You’re building something new, or the existing approach has fundamentally hit a ceiling.
The task requires multiple sub-steps, conditional logic, or quality checks that vary by request.
You need to inject per-request soft constraints without maintaining dozens of prompt variants.
You’ve tried prompt-only improvements and they’re not moving the eval enough.

The common thread

Both playbooks share the same foundation: evals first, changes second. Whether you’re iterating on an existing prompt or building a new architecture, you need a way to measure before you can improve. The prompting playbook is not a collection of clever phrasings — it’s a measurement-driven engineering discipline applied to language model inputs.

Build it yourself

Follow these exact steps to reproduce it yourself · estimated time: ~30 minutes

Prerequisites

An Anthropic API key (set as ANTHROPIC_API_KEY)
Python 3.10+ with the anthropic package installed (pip install anthropic)
A task you want to improve — something with clear pass/fail criteria works best

Step 1 — Build a minimal eval set

Create a file eval_cases.json with at least five examples that cover normal inputs, edge cases, and the failure you’re worried about:

[
  {
    "input": "Summarise this support ticket in one sentence: ...",
    "expected": "pass",
    "label": "normal case"
  },
  {
    "input": "Summarise this support ticket in one sentence: [very long ticket] ...",
    "expected": "pass",
    "label": "long input edge case"
  }
]

Even a simple pass/fail label per case beats no eval at all.

Step 2 — Record a baseline

Run your current prompt against all eval cases and record the score:

import anthropic, json

client = anthropic.Anthropic()

def run_prompt(system, user_input):
    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=system,
        messages=[{"role": "user", "content": user_input}],
    )
    return msg.content[0].text

current_prompt = "Summarise the following support ticket in one sentence."
cases = json.load(open("eval_cases.json"))

results = []
for case in cases:
    output = run_prompt(current_prompt, case["input"])
    results.append({"label": case["label"], "output": output})

print("Baseline recorded. Review outputs and score manually.")

Score the outputs, note the baseline, and only then start changing things.

Step 3 — Apply XML structure

Rewrite your prompt using XML tags and re-run the eval:

xml_prompt = """
<instructions>
  Summarise the support ticket in the <ticket> tag in exactly one sentence.
  Include the main issue and the customer's desired outcome.
</instructions>
"""

def run_with_xml(user_input):
    return run_prompt(xml_prompt, f"<ticket>{user_input}</ticket>")

Compare the score to the baseline. If it improved, keep it. If not, revert and try the next technique.

Step 4 — Build the generate-evaluate-repair loop

def generate(user_input: str) -> str:
    return run_prompt(xml_prompt, f"<ticket>{user_input}</ticket>")

def evaluate(draft: str, soft_constraints: list[str]) -> dict:
    constraint_block = "\n".join(f"- {c}" for c in soft_constraints)
    eval_prompt = f"""
<draft>{draft}</draft>
<constraints>
{constraint_block}
</constraints>
Score the draft. Respond with JSON: {{"score": 0-1, "critique": "..."}}
"""
    result = run_prompt("You are a strict output evaluator.", eval_prompt)
    return json.loads(result)

def repair(user_input: str, draft: str, critique: str) -> str:
    repair_prompt = f"""
<original_input>{user_input}</original_input>
<draft>{draft}</draft>
<critique>{critique}</critique>
Fix only the issues described in the critique. Keep everything else unchanged.
"""
    return run_prompt("You are a precise editor.", repair_prompt)

def generate_evaluate_repair(user_input: str, soft_constraints: list[str], max_iters: int = 3) -> str:
    draft = generate(user_input)
    for _ in range(max_iters):
        result = evaluate(draft, soft_constraints)
        if result["score"] >= 0.8:
            return draft
        draft = repair(user_input, draft, result["critique"])
    return draft  # return best effort after max iterations

Step 5 — Inject a soft constraint and verify

output = generate_evaluate_repair(
    user_input="Customer in London: my invoice shows the wrong amount ...",
    soft_constraints=["Use British English spelling (e.g. 'colour', 'summarise')"],
)
print(output)

Run the same case without the soft constraint and compare. The generator prompt is identical in both cases — only the evaluator’s context changes.

Expected result: the loop produces an output that meets the quality threshold without adding customer-specific rules to the generator prompt. If a step fails, check that the evaluator prompt returns valid JSON — a json.loads error usually means the model included explanation text outside the JSON object; tighten the eval prompt to prevent that.

Where to go next

Watch the original talk by Margo van Laar for the live demos and full experiment walkthrough.
Read the Anthropic prompt engineering guide for a comprehensive reference on XML, tool use, and multi-turn patterns.
Explore Mastering Claude Code for how the same eval-and-loop thinking applies to agentic coding workflows.

The Prompting Playbook

1. Why evals have to come first

What counts as an eval?

2. Scenario 1 — Maintaining and migrating an existing prompt

Technique 1: Apply XML structure

Technique 2: Remove old-model patches

Technique 3: Give tools, not instructions

Technique 4: Give both sides of every trade-off

Technique 5: Re-run the eval after each change

Check your understanding

3. Scenario 2 — Building a prompt from scratch

The model comparison experiment

4. The generate-evaluate-repair loop

How the loop works

Why soft constraints belong in the evaluator

5. Choosing your playbook

The common thread

Build it yourself

Step 1 — Build a minimal eval set

Step 2 — Record a baseline

Step 3 — Apply XML structure

Step 4 — Build the generate-evaluate-repair loop

Step 5 — Inject a soft constraint and verify

Where to go next

Related lessons

Agent Harness Engineering: Chasing Friction

Running an AI-Native Engineering Org: What Changes When Coding Isn't the Bottleneck

Fable 5 and the AI-Native Company

1. Why evals have to come first

What counts as an eval?

2. Scenario 1 — Maintaining and migrating an existing prompt

Technique 1: Apply XML structure

Technique 2: Remove old-model patches

Technique 3: Give tools, not instructions

Technique 4: Give both sides of every trade-off

Technique 5: Re-run the eval after each change

🧠 Check your understanding

3. Scenario 2 — Building a prompt from scratch

The model comparison experiment

4. The generate-evaluate-repair loop

How the loop works

Why soft constraints belong in the evaluator

5. Choosing your playbook

The common thread

🛠️ Build it yourself

Step 1 — Build a minimal eval set

Step 2 — Record a baseline

Step 3 — Apply XML structure

Step 4 — Build the generate-evaluate-repair loop

Step 5 — Inject a soft constraint and verify

Where to go next

Related lessons

Agent Harness Engineering: Chasing Friction

Running an AI-Native Engineering Org: What Changes When Coding Isn't the Bottleneck

Fable 5 and the AI-Native Company

Check your understanding

Build it yourself