AI Learning
intermediate ⏱️ 18 min read · 🎬 ~45 min video

Tool, Skill, or Subagent? Decomposing an Agent

The decision framework for knowing when agent logic belongs in a tool, a skill, or a subagent — illustrated through a live decomposition of a 400-line inventory agent.

This lesson is original educational writing based on this video by Anthropic (published May 23, 2026). All credit for the original content goes to the creators.

#agentic-workflows #best-practices
Video thumbnail: Tool, Skill, or Subagent? Decomposing an Agent
Original video — all credit to the creators. Watch the original on YouTube ↗

1. The problem: your agent grew up

You built an agent. It worked. You added features. Then more features. Somewhere around the third rewrite of your system prompt you started to notice the agent getting slower, stranger, and harder to debug. Sound familiar?

This is the natural lifecycle of an agent that hasn’t been deliberately structured. The symptoms are predictable:

  • A system prompt that reads like a legal document
  • Tools that contain branching business logic
  • Eval scores that plateau in the low 80s and resist improvement
  • Changes in one area mysteriously breaking behaviour in another

The StockPilot inventory management agent — the case study in this workshop — exhibited every one of these symptoms before decomposition. Its starting state was a 400-line system prompt, 12 tools (three of which were thin wrappers calling sub-processes), and an eval score of 83%. Not bad. Not good enough. And impossible to reason about.

The workshop walks through a full decomposition into three clean primitives: tools, skills, and subagents. Each has a specific job. Mixing them is the root cause of most overgrown agents.

2. The StockPilot case study

StockPilot manages inventory for a retail operation: it fields supplier queries, checks stock levels, generates purchase orders, and escalates edge cases. A perfectly reasonable agent scope — but the implementation had collapsed all of its logic into one place.

Before: the monolith

The 400-line system prompt contained three distinct categories of content, all mixed together:

  1. Identity and tone — what the agent is, how it speaks (about 20 lines, appropriate)
  2. Business rules — reorder thresholds, supplier priority tiers, approval workflows, escalation criteria (about 300 lines — the problem)
  3. Procedural instructions — step-by-step sequences for handling specific situation types (about 80 lines, also problematic)

The 12 tools included a check_reorder_eligibility tool that, on inspection, contained the entire reorder decision tree in its implementation. It was not fetching data — it was making decisions. That is not a tool’s job.

The decomposition journey

Phase 1 — Strip to primitives. The first move felt counterintuitive: reduce the system prompt to 15 lines, replace every complex tool with the three human-like primitives (bash, file read/write), and move all business logic out of both. The result? Eval score: 62%.

That temporary regression is the most instructive data point in the whole workshop. A 15-line system prompt is too sparse. The agent lost the knowledge it needed to act correctly. The goal is not minimum system prompt — it is clean separation. The business knowledge didn’t disappear; it moved into skills, and skills hadn’t been wired up yet.

Phase 2 — Restore with structure. Business rules went into skill files, loaded on demand. Procedural sequences became structured skill playbooks. The system prompt grew back to about 40 lines — but now every line in it was genuinely architectural, not situational. Final eval: 92%.

BeforeAfter400-line system promptIdentity + tone (~20 lines)appropriateBusiness rules (~300 lines)always loaded, bloats contextProcedures (~80 lines)mixed with identity12 tools (3 = logic, not I/O)92%System prompt~40 lines, identity onlyToolsbash, read, write + 3 customSkillsloaded on demandbusiness rules + playbooksSubagentsparallel review + escalationEval: 83%Eval: 92%
StockPilot before and after: a 400-line monolith decomposed into a lean system prompt, targeted tools, on-demand skills, and purposeful subagents.

3. Tools — for external interactions

A tool is anything that reaches outside the agent’s reasoning loop and touches the world: databases, APIs, file systems, browsers, code execution, search engines.

The defining test: if removing the tool means the agent can no longer interact with some external system, it is a tool. If removing it just means the agent doesn’t know something or can’t follow a procedure, it is not a tool.

Start with human-like primitives

Before reaching for custom tools, reach for the ones that mirror how a skilled human works:

  • bash — run shell commands, invoke scripts, call CLIs
  • File read / write — access and update persistent state
  • Web search — retrieve current information
  • Code execution — compute, transform, validate
  • Todo list — structured task tracking across a long session

These primitives compose into almost anything. The workshop recommendation: add a custom tool only when an eval shows that the primitive version falls short. Every additional tool adds complexity to the agent’s decision space.

What belongs in a tool

  • Database queries that return data for the agent to reason about
  • API calls to external services (payment processors, shipping APIs, supplier portals)
  • File reads that retrieve current state
  • Code execution for computation that would be unreliable as text reasoning

What does not belong in a tool

The most common mistake in the StockPilot case was check_reorder_eligibility. It was called like a tool but behaved like a decision engine: it took a product ID, internally applied the reorder threshold rules, checked supplier priority tiers, and returned a yes/no. That is business logic. Business logic belongs in skills.

Tools should be stateless operations with well-defined inputs and outputs. When possible, they should be deterministic. The agent reasons; tools fetch and execute.

MCP: defer until necessary

Model Context Protocol lets you expose tools to multiple clients through a standardised interface. It is genuinely useful — but it adds overhead. The workshop guidance is direct: do not introduce MCP until you have multiple clients that need the same tools. For a single-agent, single-codebase setup, MCP is premature infrastructure.

4. Skills — for progressive disclosure of knowledge

Skills are the primitive most developers under-use. They solve a specific problem: your agent needs deep domain knowledge, but not all of it at the same time.

In the StockPilot case, the business rules (reorder thresholds, supplier tiers, approval workflows, escalation criteria) were stuffed into the system prompt because the agent needed them. But it didn’t need all of them on every turn. It needed reorder rules when handling reorder queries, supplier priority rules when selecting vendors, approval workflows when generating purchase orders above a certain value.

The key insight: on-demand loading

Skills are loaded into context when a situation calls for them, not at startup. This is the mechanism that makes them different from “longer system prompt”. A 300-line block of business rules loaded on every turn consumes context window constantly. The same 300 lines as skills, loaded only when relevant, preserve that context for the actual task at hand.

Think of skills as the playbook. When a situation type arises, the agent picks up the relevant playbook section, reads the rules that apply, and acts accordingly. Once that situation resolves, those rules leave the context.

What belongs in a skill

  • Business rules and thresholds specific to your domain (reorder levels, discount tiers)
  • Domain knowledge that took expertise to encode and changes infrequently
  • Step-by-step procedures for handling specific situation types
  • Compliance requirements, escalation matrices, approval hierarchies

What does not belong in a skill

Skills that make API calls or touch external systems are tools in disguise. A skill should contain knowledge and reasoning patterns, not I/O operations. If your skill file says “call the supplier API and check availability”, that API call should be a tool invocation inside a skill-guided procedure, not logic embedded in the skill itself.

supplier-query.skill.md
------------------------
When a supplier query arrives:
1. Read the current stock level for the SKU (use read_inventory tool)
2. Apply the reorder threshold for this product category:
   - Fast-moving (turnover > 30/month): reorder at 15% of monthly volume
   - Standard: reorder at 10%
   - Slow-moving: reorder at 20%, flag for manual review
3. If reorder is indicated, select the primary supplier using the tier matrix below...

The skill contains the threshold knowledge. The tool call retrieves the data. The agent applies the skill’s reasoning with the tool’s data.

5. Subagents — for parallelism and fresh perspective

Subagents are full agent invocations spawned by the main agent. They are powerful and expensive — use them for the two specific situations where they are genuinely superior:

When to use a subagent

1. Parallel independent exploration. When you have multiple independent lines of work that can proceed simultaneously, spawning subagents is the agentic equivalent of parallelising a task. “Check these five supplier contracts for non-standard clauses” can become five concurrent subagent calls, each examining one contract, with results merged by the main agent.

2. Fresh perspective / review. A subagent that hasn’t read your entire conversation context notices different things than the main agent. This is architecturally useful: after the main agent drafts a purchase order, a review subagent that sees only the purchase order and the approval criteria can catch errors the main agent’s context has trained it to overlook.

The workshop describes this as “throwing lots of Claude at a problem” — not as a workaround for limited intelligence, but as deliberate use of independent perspectives and parallel execution.

The handoff requirement

A subagent is only valuable if it has a clear, self-contained task with a well-defined output. If you cannot write a two-sentence briefing for the subagent that fully describes its job, the decomposition is premature. The handoff point is where the task is complete enough to hand off and the output is concrete enough to act on.

Observability

Claude Managed Agents includes a “callable agents” feature that gives native observability into subagent chains — you can see what each subagent was asked, what it returned, and where the chain branched. This is important for debugging multi-agent systems in production. Before building a custom orchestration layer, check whether the managed infrastructure already gives you what you need.

6. The decision framework

With all three primitives understood, the decision becomes a flowchart applied to each piece of logic in your agent.

Piece of agent logicWhat should this become?Does it need to interactwith an external system?YesToolI/O + executionNoIs it specialised knowledgeonly needed sometimes?YesSkillon-demand knowledgeNoDoes it need parallel executionor a fresh, independent review?YesSubagentparallel + fresh eyesNoBelongs in system prompt or inline reasoning
The tool-skill-subagent decision framework: three questions that route any piece of agent logic to the right primitive.

Applying the framework

Walk every non-trivial piece of logic in your system prompt or tool implementations through these three questions in order:

  1. Does it interact with an external system? If it calls a database, reads a file, hits an API, executes code — it is a tool. Move it.

  2. Is it specialised knowledge needed only in some situations? Business rules, domain thresholds, procedural playbooks — it is a skill. Move it and define the trigger condition.

  3. Does it benefit from parallel execution or independent review? Fan-out tasks, review steps, validation passes with fresh context — it is a subagent. Define the handoff clearly.

  4. Neither? Then it is genuinely architectural context — it belongs in the system prompt. After running this process, your system prompt should be short.

System prompt philosophy

Start sparse. Give the agent identity, scope, and the three human-like primitives. Add complexity only when evals show you need it. The temptation is to pre-load everything the agent might ever need; the practice that works is to add things incrementally, measuring after each addition whether the eval score actually improves.

The StockPilot experience is a useful calibration: a 15-line system prompt scored 62%, a 40-line system prompt with well-structured skills scored 92%. The additional 25 lines were architectural facts the agent needed at all times. The other 360 lines from the original became skills.

Check your understanding

5 questions · your answers are saved in this browser only

  1. 1. The StockPilot agent temporarily dropped from 83% to 62% after Phase 1. What does this tell you?

  2. 2. A tool in your agent applies reorder threshold rules and supplier priority tiers to return a yes/no reorder recommendation. What should you do?

  3. 3. What is the primary architectural advantage of skills over a longer system prompt?

  4. 4. Which of the following is a valid reason to spawn a subagent?

  5. 5. When should you introduce MCP (Model Context Protocol) into your agent architecture?

Build it yourself

Follow these exact steps to reproduce it yourself · estimated time: ~30 minutes

Prerequisites

  • An existing agent with a system prompt longer than 100 lines
  • An eval suite you can run to measure performance (even a small one)
  • A way to load skill files into the agent context (file read tool or equivalent)

Step 1 — Run your current eval baseline

Before changing anything, record your current eval score. You need a baseline to know whether decomposition is helping or hurting at each step.

# Example: run your eval suite and capture the score
python3 run_evals.py --agent current --output baseline.json

If you do not have an eval suite, build a minimal one first: 10–20 representative inputs with known correct outputs. Decomposing without evals is flying blind.

Step 2 — Audit your system prompt by category

Copy your system prompt into a scratch file. Label every section with one of four tags:

  • [IDENTITY] — who the agent is, its scope, its tone
  • [BUSINESS] — domain rules, thresholds, decision criteria, approval matrices
  • [PROCEDURE] — step-by-step sequences for specific situation types
  • [TOOL-LOGIC] — anything that describes what a tool should decide
You are StockPilot, an inventory agent for...          [IDENTITY]
Reorder thresholds by category:                        [BUSINESS]
  - Fast-moving: reorder at 15% of monthly volume      [BUSINESS]
When a supplier query arrives:                         [PROCEDURE]
  1. Check current stock                               [PROCEDURE]
check_reorder_eligibility: returns yes if...           [TOOL-LOGIC]

Step 3 — Extract business rules into skill files

Create a skills/ directory. For each [BUSINESS] section, create a named skill file:

mkdir -p skills
# skills/reorder-thresholds.md

## Reorder threshold rules

Apply these thresholds when evaluating whether to initiate a reorder:

- Fast-moving (turnover > 30 units/month): reorder when stock falls below 15% of monthly volume
- Standard: reorder at 10%
- Slow-moving (turnover < 5 units/month): reorder at 20%, flag for manual review

To determine turnover category, read the product's `turnover_class` field from inventory.

Remove the corresponding lines from your system prompt. Add a brief reference: “When evaluating reorder eligibility, load the reorder-thresholds skill.”

Step 4 — Extract procedures into skill playbooks

For each [PROCEDURE] section, create a playbook skill:

# skills/supplier-query-playbook.md

## Handling a supplier query

1. Identify the SKU(s) referenced in the query
2. Read current inventory levels (use read_inventory tool)
3. Apply reorder threshold rules (load reorder-thresholds skill)
4. If reorder is indicated: select supplier using supplier-tier skill, generate PO draft
5. If unclear escalation: load escalation-criteria skill before deciding

Step 5 — Fix tools that contain business logic

Find every tool where the implementation makes a domain decision. Refactor: the tool fetches raw data, the agent applies skill-guided reasoning to that data.

# Before: tool makes the decision
def check_reorder_eligibility(sku: str) -> bool:
    stock = get_stock_level(sku)
    threshold = get_threshold_for_category(sku)  # business logic here
    return stock < threshold

# After: tool returns raw data only
def get_inventory_status(sku: str) -> dict:
    return {
        "sku": sku,
        "current_stock": get_stock_level(sku),
        "turnover_class": get_turnover_class(sku),
        "monthly_volume": get_monthly_volume(sku),
    }

Step 6 — Run evals and iterate

Run your eval suite after each change. Expect a temporary dip if you extracted a lot at once. Add skills back in as needed, following the eval signal rather than intuition.

python3 run_evals.py --agent decomposed --output phase1.json
python3 compare_evals.py baseline.json phase1.json

Expected result: after completing all steps, your system prompt should be under 50 lines and contain only [IDENTITY] content. Your eval score should exceed the original baseline. If a step causes a large regression, check whether a skill is missing its trigger condition or whether a tool is missing data the agent previously inferred from business rules.

Related lessons

intermediate 🎬 Anthropic · ~27 min

Agent Harness Engineering: Chasing Friction

AirOps's hard-won lessons from shipping Claude agents to non-technical enterprise users: intentional scoping, specialized tools over primitive exploration, and sub-agents for context isolation.

#agentic-workflows #best-practices
intermediate 🎬 Anthropic · ~30 min

Fable 5 and the AI-Native Company

What Fable 5's capabilities unlock, how dynamic workflows reshape engineering at scale, and what it looks like when a company runs on an AI substrate.

#best-practices #agentic-workflows #claude-code