AI Learning
advanced ⏱️ 14 min read · 🎬 ~16 min video

Self-Improving Prompts: How Metaview Automates Prompt Optimization

How Metaview built an application-review system whose prompts learn from every recruiter decision — a practical blueprint for automated prompt critique and rewriting.

This lesson is original educational writing based on this video by Anthropic (published May 22, 2026). All credit for the original content goes to the creators.

#prompting #evaluation #agents
Video thumbnail: Self-Improving Prompts: How Metaview Automates Prompt Optimization
Original video — all credit to the creators. Watch the original on YouTube ↗

1. The problem with prompts that stand still

Metaview helps recruiters sift through thousands of resumes a day. Early in their AI-review system, the team did what almost everyone does: a human wrote a careful prompt, tested it on a handful of examples, and shipped it. It worked — until it didn’t.

Recruiter preferences drift over time. A new hiring manager joins with different standards. A role evolves mid-cycle. The market shifts and the definition of “strong candidate” quietly changes. Every time preferences changed, someone had to manually revisit the prompt, run a new batch, compare outputs, and decide whether the new version was better. That is expensive, slow, and easy to skip.

The deeper issue is that prompts encode a point-in-time snapshot of human judgment. Human judgment is not a snapshot — it is a continuous, evolving signal. Any static prompt is already slightly out of date by the time it ships.

Most teams respond to this by treating prompts as configuration: update them manually when complaints pile up. Metaview chose a different path. They asked: what if the system could watch every decision a recruiter makes, notice where its own output diverged from that decision, and rewrite the prompt to close the gap?

2. The self-improvement loop

The core architecture Metaview built has five stages that form a closed loop:

  1. Run — the current prompt processes a batch of applications and produces structured outputs (a score, a summary, a recommendation).
  2. Collect signal — recruiters act on those outputs. When a recruiter overrides the AI decision, that override is a labeled training example: here is what the system said, here is what the human actually decided.
  3. Critique — a second LLM call (the “critic”) looks at the divergences and identifies patterns. It is not just comparing individual cases; it is synthesising a hypothesis about why the current prompt keeps getting these cases wrong.
  4. Rewrite — armed with the critic’s hypothesis, a third LLM call produces a candidate replacement prompt. The rewrite targets the failure mode without destroying what the current prompt already does well.
  5. Evaluate — the candidate prompt is run on a held-out test set of past decisions (where the ground truth is known) and its accuracy is compared against the current prompt. Only if the candidate wins on that benchmark does it get promoted to production.
Runcurrent promptCollect signalrecruiter overridesCritiquefind failure patternsRewritecandidate promptEvalA/B testwinspromote to productionlosescollect more signal, retry
The self-improvement loop: recruiter overrides generate signal, a critic identifies failure patterns, a rewriter proposes a new prompt, and evaluation gates promotion to production.

The loop runs continuously in the background. Recruiters never see it — they just see the AI getting gradually more accurate over time, without anyone manually editing a config file.

3. Capturing implicit feedback

The trickiest design challenge is the signal collection stage. Asking users to give explicit ratings (“was this AI output good? 1–5 stars”) sounds easy but fails in practice: users ignore rating prompts, gaming and fatigue corrupt the signal, and a recruiter with 200 applications to review is not going to rate each AI summary.

Metaview’s insight was to use implicit, behavioural signal. When a recruiter overrides the AI’s recommendation — moves an applicant the AI ranked low into a phone screen, or declines someone the AI rated highly — that override is a high-quality, naturally-occurring label. It represents a moment where the human disagreed enough to act.

Three properties make implicit feedback compelling:

  1. High fidelity — the recruiter is expressing a real preference, not just filling out a form.
  2. Zero additional burden — no change to the recruiter’s workflow is required.
  3. Automatic accumulation — the dataset grows with every normal usage session.

The important caveat is sparse signal. Not every application gets overridden; not every override is informative (sometimes a recruiter clicks the wrong thing). The system needs to tolerate noisy labels and work effectively with only a subset of examples that have reliable ground truth.

4. Writing the critic prompt

The critique step is where most teams underinvest. The naive implementation asks the LLM: “Here are cases where the AI was wrong. What should the prompt do differently?” That produces vague, unhelpful output like “be more nuanced about experience levels.”

Metaview structures the critic prompt to do three things:

A. Force pattern recognition before recommendations. The critic is instructed to group failures into clusters before suggesting any changes. A single outlier gets ignored; a cluster of five failures where the AI consistently undervalued candidates with non-linear career paths is a real signal worth acting on.

B. Require contrastive analysis. The critic must also look at the successes — cases where the AI matched the recruiter decision. This prevents the rewriter from “fixing” something that is already working. The critic produces a differential: the prompt handles X well, but fails on Y because of Z.

C. Produce a structured hypothesis. The critic output is not free text. It follows a schema: failure mode (string), affected examples (list), hypothesised cause (string), suggested prompt delta (string). This schema makes the rewriter’s job tractable and makes the output auditable by a human if needed.

CRITIC_PROMPT = """
You are evaluating where an AI application-review prompt diverges from recruiter decisions.

INPUT:
- Current prompt: {current_prompt}
- Failure cases: {failures}  # list of {application, ai_output, recruiter_decision}
- Success cases: {successes}  # same structure

TASK:
1. Group the failure cases into clusters by failure mode (max 5 clusters).
2. For each cluster, identify what the current prompt does or does NOT say that causes it.
3. Identify what the successes share in common that the failures lack.
4. For the largest cluster only, propose a minimal edit to the prompt.

OUTPUT FORMAT (JSON):
{
  "clusters": [...],
  "success_pattern": "...",
  "primary_failure_mode": "...",
  "hypothesised_cause": "...",
  "suggested_delta": "..."
}
"""

The schema is enforced with structured output so downstream code can rely on it without fragile parsing.

5. Evaluation and promotion gates

Generating a candidate prompt is cheap. Promoting a bad prompt to production is expensive — it silently degrades results for every recruiter until someone notices. The evaluation gate is the most important guardrail in the system.

Metaview maintains a golden dataset: a curated set of historical application/decision pairs where the ground truth is known and trusted. Every candidate prompt is scored against this dataset before promotion. Promotion requires meeting three conditions simultaneously:

  1. Net accuracy improvement — the candidate must outperform the current prompt on overall accuracy across the golden set.
  2. No regression on known-good clusters — if the current prompt already handles a cluster well (e.g. “candidates with a PhD in a relevant field”), the candidate must not perform worse on that cluster.
  3. Minimum sample coverage — the golden set must contain enough examples of the failure mode being targeted for the result to be statistically meaningful, not noise.

If the candidate fails any condition, it is discarded. The system logs the failure reason, which feeds back into the next critique cycle — the critic can see that its last suggestion did not work and try a different angle.

Check your understanding

5 questions · your answers are saved in this browser only

  1. 1. Why does Metaview use recruiter overrides as feedback rather than explicit star ratings?

  2. 2. What does the critic LLM produce in Metaview's pipeline?

  3. 3. A candidate prompt improves accuracy on the failure cluster it was targeting, but scores 3% lower on a previously well-handled cluster. What should the system do?

  4. 4. Which property of the self-improvement loop ensures the system does not get stuck in a local optimum?

  5. 5. What is the minimum information the critic must include to enable contrastive analysis?

6. Lessons that generalise beyond recruiting

Metaview’s approach is domain-specific in its data source (recruiter decisions on job applications) but the architecture is general. Any LLM pipeline that:

  • processes a high volume of items,
  • has downstream human decisions that can be observed, and
  • runs on an ongoing basis rather than a one-time task

…is a candidate for a self-improving prompt loop.

Document review in legal or compliance is an obvious parallel: attorneys override AI recommendations on which documents are relevant in discovery. Those overrides are labelled examples. A critic can identify what language patterns the prompt is consistently misclassifying and propose a refinement.

Customer support triage is another: support agents who reassign or escalate tickets the AI routed are implicitly labelling the AI’s routing errors. The loop can improve routing prompts without any survey or manual annotation.

The generalisable checklist:

  1. Identify where humans make decisions after seeing your AI’s output — those are your override opportunities.
  2. Build infrastructure to capture those overrides with the AI context (what did the AI say, what did the human decide, what was the input).
  3. Write a structured critic prompt that clusters failures and contrasts them with successes.
  4. Build a golden dataset from your most trusted historical decisions.
  5. Gate promotion on accuracy improvement and no regression.
  6. Feed evaluation failures back to the critic.

Build it yourself

Follow these exact steps to reproduce it yourself · estimated time: ~45 min

Prerequisites

  • Python 3.10+ and pip
  • Anthropic Python SDK: pip install anthropic
  • A small dataset of (input, ai_output, human_decision) triples — 20–50 rows is enough to start

Step 1 — Define your data structures

Create a file prompt_optimizer.py and define the core types:

from dataclasses import dataclass
from typing import Literal

@dataclass
class Example:
    input_text: str      # the item being processed (e.g. a resume)
    ai_output: str       # what the current prompt produced
    human_decision: str  # what the human actually decided (ground truth)
    is_override: bool    # True if human disagreed with AI

@dataclass
class CriticOutput:
    clusters: list[dict]          # failure clusters
    success_pattern: str          # what successes share
    primary_failure_mode: str     # biggest cluster summary
    hypothesised_cause: str       # why the prompt fails
    suggested_delta: str          # proposed prompt edit

Step 2 — Write the critic

import anthropic, json

client = anthropic.Anthropic()

def run_critic(current_prompt: str, examples: list[Example]) -> CriticOutput:
    failures = [e for e in examples if e.is_override]
    successes = [e for e in examples if not e.is_override]

    critic_prompt = f"""
You are evaluating where an AI prompt diverges from human decisions.

CURRENT PROMPT:
{current_prompt}

FAILURE CASES (human overrode the AI):
{json.dumps([{"input": e.input_text, "ai_output": e.ai_output, "human_decision": e.human_decision} for e in failures[:20]], indent=2)}

SUCCESS CASES (human agreed with the AI):
{json.dumps([{"input": e.input_text, "ai_output": e.ai_output, "human_decision": e.human_decision} for e in successes[:20]], indent=2)}

TASK:
1. Group failure cases into clusters (max 5) by failure mode.
2. Identify what the successes share that the failures lack.
3. For the largest failure cluster only, propose a minimal edit to the prompt.

Return a JSON object with keys: clusters, success_pattern, primary_failure_mode,
hypothesised_cause, suggested_delta.
"""

    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": critic_prompt}],
    )
    return CriticOutput(**json.loads(response.content[0].text))

Step 3 — Write the rewriter

def rewrite_prompt(current_prompt: str, critic: CriticOutput) -> str:
    rewriter_prompt = f"""
You are rewriting an AI prompt to fix a specific failure mode.

CURRENT PROMPT:
{current_prompt}

CRITIC ANALYSIS:
- Primary failure mode: {critic.primary_failure_mode}
- Hypothesised cause: {critic.hypothesised_cause}
- Suggested delta: {critic.suggested_delta}
- Do not break successes: {critic.success_pattern}

Write the updated prompt. Make the smallest edit that addresses the failure mode.
Do not change sections that handle successful cases. Return only the updated prompt text.
"""

    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=2048,
        messages=[{"role": "user", "content": rewriter_prompt}],
    )
    return response.content[0].text.strip()

Step 4 — Build the evaluation gate

def evaluate_prompt(prompt: str, golden_set: list[Example]) -> float:
    """Run prompt against golden set; return accuracy (0.0–1.0)."""
    correct = 0
    for example in golden_set:
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=256,
            system=prompt,
            messages=[{"role": "user", "content": example.input_text}],
        )
        ai_output = response.content[0].text
        # Simple string-match accuracy — replace with your domain-specific scorer
        if example.human_decision.lower() in ai_output.lower():
            correct += 1
    return correct / len(golden_set)

def should_promote(
    current_accuracy: float,
    candidate_accuracy: float,
    threshold: float = 0.02,
) -> bool:
    return candidate_accuracy >= current_accuracy + threshold

Step 5 — Wire the loop

def run_improvement_cycle(
    current_prompt: str,
    training_examples: list[Example],
    golden_set: list[Example],
) -> str:
    """Run one improvement cycle. Returns the new prompt (or current if no improvement)."""
    overrides = [e for e in training_examples if e.is_override]
    if len(overrides) < 5:
        print("Not enough overrides to critique — collecting more signal.")
        return current_prompt

    print("Running critic...")
    critic = run_critic(current_prompt, training_examples)
    print(f"Primary failure mode: {critic.primary_failure_mode}")

    print("Rewriting prompt...")
    candidate = rewrite_prompt(current_prompt, critic)

    print("Evaluating...")
    current_acc = evaluate_prompt(current_prompt, golden_set)
    candidate_acc = evaluate_prompt(candidate, golden_set)
    print(f"Current: {current_acc:.2%}  Candidate: {candidate_acc:.2%}")

    if should_promote(current_acc, candidate_acc):
        print("Candidate wins — promoting.")
        return candidate
    else:
        print("Candidate rejected — keeping current prompt.")
        return current_prompt

Step 6 — Run it

if __name__ == "__main__":
    current_prompt = "Review the following job application and recommend Accept or Decline."
    # Load your examples — replace with real data
    training_examples = [...]  # list[Example]
    golden_set = [...]         # list[Example] — held-out, trusted labels

    new_prompt = run_improvement_cycle(current_prompt, training_examples, golden_set)
    print("\nActive prompt:\n", new_prompt)

Expected result: on the first run with at least 10–15 overrides in training_examples, the critic identifies the most common failure pattern and proposes a targeted edit. With a well-curated golden set, the evaluation gate accurately distinguishes improvements from regressions. Schedule run_improvement_cycle on a cron or event trigger to run after every N new overrides accumulate.

Related lessons

intermediate 🎬 Anthropic · ~25 min

Prompting for Agents: Steering Models That Act

Agents are models using tools in a loop. This lesson covers when to build one, how to prompt it — heuristics, budgets, guardrails — and how to evaluate something that takes hundreds of steps.

#agents #prompting #evaluation
advanced 🎬 Anthropic · ~9 min

Agent Battle: Build the Best Diamond-Mining Agent

An Anthropic workshop where participants build diamond-mining agents in 45 minutes and compete on a live leaderboard. Learn agent configuration, eval-driven improvement, and what separates winning architectures.

#agents #evaluation #claude-code
advanced 🎬 Anthropic · ~26 min

Agentic Analytics: How Omni Built a Production Harness with Claude Code

How Omni's CTO Chris Merrick designed a multi-agent analytics system powered entirely by Claude — covering coordinator architecture, tool sizing, the semantic layer as CLAUDE.md analogy, evaluation with LLM-as-Judge, and the critical design pivots that drove 86x token growth.

#agents #claude-code #evaluation