Agent Battle: Build the Best Diamond-Mining Agent

1. Why Mine Diamonds in Minecraft?

The setup sounds frivolous: Anthropic’s applied AI team runs a workshop where participants build AI agents that mine diamonds in Minecraft, competing on a live leaderboard with 35 minutes to build and test. But the choice of environment is deliberate. Minecraft diamond mining is a perfect agent evaluation problem because it has a single clear objective metric (diamonds mined), a constrained time window for each run (five minutes of agent execution), a reproducible starting state (same seed and start kit every reset), and enough complexity in the decision space that different agent architectures produce meaningfully different results.

These are properties you want in any agent benchmark. A good eval needs a measurable outcome, reproducible starting conditions, and enough degrees of freedom that the agent’s choices actually matter. The Minecraft wrapper provides all three, and it does so in a domain that is fun and tangible — participants can watch their agent move around, dig, get stuck, find coal instead of diamonds, and try again. The concreteness of the task makes the abstractions of agent design more intuitive.

Ben and Jeff from Anthropic’s applied AI team frame the workshop around three learning objectives: build and deploy a managed agent, understand the impact of agent configuration, and learn to hill climb on evals. These three skills — deploy, configure, evaluate — are exactly the skills required to build production agents in any domain. The diamond mining wrapper is a teaching device for lessons that transfer directly to real agent engineering.

The agent battle loop: configure agent, run eval, observe result, update configuration, repeat within a fixed time window.

2. The Three Levers: System Prompt, Model, and Tools

The agent configuration space in the workshop is explicitly bounded. Participants work with a my_agent.py file and have three main levers to adjust: the system prompt, the model string, and the MCP tools (or custom skills) available to the agent. Everything else — the Minecraft environment, the starting seed, the MindFlayer bot interface, the evaluation harness — is fixed. This constraint is pedagogically important.

In real-world agent development, the configuration space can feel overwhelmingly large. There are hundreds of decisions you could make about architecture, tooling, prompting strategy, model choice, and evaluation design. By reducing the space to three levers and fixing everything else, the workshop makes the causal structure visible: when you change the system prompt and the agent gets more diamonds, you know why. When you swap model versions and performance improves, you can attribute it cleanly.

The system prompt is perhaps the highest-leverage lever of the three. The prompt starts empty — a blank configuration that produces some default behavior. Participants who think carefully about what the agent needs to know (how diamonds are distributed in the game, what tools are available, what strategies tend to work) and encode that knowledge into the system prompt will outperform those who rely on the model’s default behavior. The model’s reasoning capability gets applied to whatever context you give it; a richer, more strategically informed context produces better decisions.

3. Hill Climbing on Evals: The Core Improvement Loop

Ben describes hill climbing on evals as “how we improve all of our agents internally” at Anthropic. The process is straightforward: measure the agent’s performance on a defined metric, make a change to the configuration, measure again, keep the change if it improved performance, discard it if it did not. Repeat. This is a gradient ascent process applied to agent configuration space.

The key insight is that this process requires a fast, reliable measurement. If each evaluation takes twenty minutes and produces noisy results, the feedback loop is too slow and too uncertain to drive meaningful improvement within the workshop’s time constraint — or within the timeline of most real agent development projects. The one-minute eval set in the workshop is designed precisely for this: fast enough to run ten or more times in the 35-minute window, reliable enough that a genuine improvement registers consistently.

In a competitive setting, the speed of the eval loop becomes itself a competitive advantage. Someone who can run fifteen cycles of configure-eval-update-configure in 35 minutes will explore much more of the configuration space than someone running three cycles. The team that builds a fast, reliable eval first — before optimizing the agent — will almost always outperform the team that spends all their time on the agent without building the measurement infrastructure.

4. What Separates Winning Architectures

The live leaderboard in the workshop creates a natural experiment: thirty-plus participants, same environment, same time window, dramatically different results. The fact that 19 diamonds seemed like the ceiling before someone broke through it suggests that the problem has structure — there are better and worse approaches to diamond mining that can be discovered through systematic exploration.

What separates high-performing agents from low-performing ones in this kind of benchmark typically comes down to three things. First, strategic specificity in the system prompt: rather than generic instructions (“mine for diamonds”), high-performing prompts encode domain knowledge about how to find diamonds efficiently (go deep, avoid surface mining, use specific block patterns to identify diamond-containing layers). Second, appropriate tool use: an agent that knows which MCP tools to call in which sequence, and when to switch strategies, will outperform one relying on default behavior. Third, failure recovery: an agent that gets stuck (runs into a wall, falls in lava, runs out of tools) and can recover gracefully will accumulate more diamonds over a five-minute run than one that terminates early.

The winner who broke 19 diamonds “with only one minute twenty seconds to go” presumably did so through a combination of all three. Their system prompt probably encoded specific diamond-hunting strategy. Their tool use was probably efficient enough to produce the token efficiency required. And their agent probably handled the failures that others got stuck on.

Check your understanding

5 questions · your answers are saved in this browser only

1. What are the three main learning objectives of the agent battle workshop?

The workshop has three explicit objectives: (1) build and deploy a managed agent, (2) understand the impact of agent configuration, and (3) learn to hill climb on evals — the core loop for iterative agent improvement.
2. How are ties broken in the agent battle, and what does this reveal about agent design?

Ties are broken by token efficiency (best diamonds-to-tokens ratio), which directly incentivizes honing system prompts for precision rather than using verbose instructions or heavy models unnecessarily.
3. Why does the workshop use two different evaluation modes — a fast one-minute eval and a five-minute production run?

The two-tier design serves different purposes: the one-minute eval enables fast iteration (run many cycles of configure-evaluate-update), while the five-minute production run provides more reliable final measurement. Fast feedback loops drive improvement; longer runs reduce variance.
4. What does 'hill climbing on evals' mean in agent development?

Hill climbing on evals is a gradient ascent process for agent configuration: measure the current performance, change one thing, measure again, keep the change if it improved results. This is described as how Anthropic's applied AI team improves its agents internally.
5. What does the workshop's fixed environment (same seed, same start kit) reveal about good agent benchmarks?

The fixed starting conditions mean that when one agent gets more diamonds than another, you know it's because of different configuration — system prompt, model, tools — not because one agent got a luckier starting position. Reproducibility is essential for eval-driven improvement.

Build it yourself

Follow these exact steps to reproduce it yourself · estimated time: ~60 min

Prerequisites

Access to Claude API or Claude Code
A clearly defined, measurable task for your agent (not Minecraft, but something with a numeric score)

Step 1 — Define your evaluation metric

Before writing any agent code, specify exactly how you will measure success. The metric must be: (a) numeric, (b) computed automatically without human judgment, and (c) reproducible from a fixed starting state. Write this down before touching any code.

Step 2 — Build the fast eval loop first

Create a script that runs your agent and returns the score in under two minutes. Do not optimize the agent yet — just get the measurement infrastructure working. A fast, reliable eval is the most important tool in agent development.

Step 3 — Start with an empty system prompt

Run your agent with an empty or minimal system prompt and record the baseline score. This is your starting point on the hill you are about to climb.

Step 4 — Make one change at a time

Add one piece of domain knowledge to your system prompt, change the model, or add one tool. Run the eval. If the score went up, keep the change. If it went down or stayed the same, revert. Never change two things at once — you cannot attribute the result.

Step 5 — Track token efficiency

After each successful configuration change, note both the score and the approximate tokens consumed per run. A change that improves your score by 10% while doubling token consumption may not be worth it. Track both dimensions.

Step 6 — Run your best configuration for final scoring

When time is up (or you have converged), run your highest-scoring configuration on a longer evaluation (five minutes or more) to get a more reliable final measurement. Compare to your baseline — the gap is the value of your systematic hill-climbing process.

Agent Battle: Build the Best Diamond-Mining Agent

1. Why Mine Diamonds in Minecraft?

2. The Three Levers: System Prompt, Model, and Tools

3. Hill Climbing on Evals: The Core Improvement Loop

4. What Separates Winning Architectures

Check your understanding

Build it yourself

Step 1 — Define your evaluation metric

Step 2 — Build the fast eval loop first

Step 3 — Start with an empty system prompt

Step 4 — Make one change at a time

Step 5 — Track token efficiency

Step 6 — Run your best configuration for final scoring

Related lessons

Agentic Analytics: How Omni Built a Production Harness with Claude Code

Introducing Code Review by Claude Code

AI with Claude on AWS: From Code to Orchestration

1. Why Mine Diamonds in Minecraft?

2. The Three Levers: System Prompt, Model, and Tools

3. Hill Climbing on Evals: The Core Improvement Loop

4. What Separates Winning Architectures

🧠 Check your understanding

🛠️ Build it yourself

Step 1 — Define your evaluation metric

Step 2 — Build the fast eval loop first

Step 3 — Start with an empty system prompt

Step 4 — Make one change at a time

Step 5 — Track token efficiency

Step 6 — Run your best configuration for final scoring

Related lessons

Agentic Analytics: How Omni Built a Production Harness with Claude Code

Introducing Code Review by Claude Code

AI with Claude on AWS: From Code to Orchestration

Check your understanding

Build it yourself