Agent Harness Engineering: Chasing Friction
AirOps's hard-won lessons from shipping Claude agents to non-technical enterprise users: intentional scoping, specialized tools over primitive exploration, and sub-agents for context isolation.
This lesson is original educational writing based on this video by Anthropic (published May 22, 2026). All credit for the original content goes to the creators.
1. The challenge: agents for non-technical users
AirOps builds a growth marketing platform that helps companies optimize for AI search. When Claude agents became powerful enough to replace their node-based workflow builder, Dylan and his team made the transition — and immediately ran into a wall that pure developer-tooling builders rarely encounter.
Their users are content marketers, not engineers.
That one fact reshapes every design decision. A developer reviewing a coding agent’s output brings intuitions about correctness: does it compile, do tests pass, does the diff look reasonable? A content marketer reviewing agent output has a very different set of intuitions — does this match our brand voice, is the brief going in the right strategic direction, will the final article actually rank?
The gap between what agents can do and what non-technical users can trust is not closed by making the agent smarter. It is closed by engineering the harness around it.
This talk is about that harness: the structures, tools, and sub-agent patterns you build to make agent output reliable enough for people who won’t debug prompts.
2. Friction point 1: intentionality and human review
The endless use-case spiral
The first thing most teams feel when they start building with agents is a kind of vertigo: the capability feels limitless. You can handle this use case. And that one. And that edge case too. Without discipline, the product sprawls — and a sprawling agent is an unreliable agent.
AirOps’s answer was to pin everything to a single user persona: marketers, marketers, marketers. Every feature decision runs through that filter. If a capability doesn’t serve the marketer’s daily workflow, it doesn’t ship, no matter how technically interesting it is.
From that persona, they mapped the actual workflow a content marketer runs today:
- Discover — find content opportunities and gaps
- Research — gather competitive intelligence, audience signals
- Draft brief — define angle, structure, target keywords
- Write — produce the article
- SEO / AEO optimization — tune for search and AI answer engines
Mapping this workflow was not a UX exercise — it was an architecture exercise. The workflow defines where computation belongs, where humans belong, and how those hand-offs work.
Human review at the right moment
Here is the insight that separates content agents from coding agents.
For a coding agent, the natural human review point is the end: look at the PR, approve or request changes. The artifact is easy to inspect and the cost of a revision is bounded.
For a content agent, waiting until the end is expensive. If the brief went in the wrong direction, the entire article is wrong. Changing strategic direction mid-article means rewriting everything. The correct review points are earlier, at moments when course corrections are cheap:
- After the brief is drafted: Is this the right angle?
- After the outline is approved: Does this structure serve our SEO goals?
- Before the writing begins: Are all research inputs accurate?
AirOps built three components to operationalize this:
Assigned reviewers per playbook section. Rather than routing all review to the same person, specific users are designated as approvers for specific steps. The brief goes to the strategist; the keyword plan goes to the SEO lead.
Inbox. Every human review moment surfaces in one place, with the agent’s thought trace and intermediate artifacts visible. Reviewers see not just the output but how the agent got there — which builds trust and makes feedback more precise.
Grid. For scale, the grid view lets teams orchestrate many content runs in parallel, see status across all of them, and leave cell-level feedback that the agent picks up in its next pass.
Why a document builder beats a node builder
Node-based workflow builders (boxes connected by lines) are intuitive for developers. They’re not intuitive for marketers. Marketers know documents. AirOps switched from a visual node graph to a document-based playbook builder — the workflow is written as a structured doc, not drawn as a flowchart.
The payoff was immediate adoption. Non-technical users could read, edit, and understand their own playbooks without hand-holding.
3. The car/engine metaphor
Before diving into the technical solutions, it helps to have the right frame.
Claude is the engine. A great engine matters enormously — but drop a Formula 1 engine into a rusted chassis with bad tires and worn brakes, and the race performance reflects the chassis, not the engine.
The harness you build around Claude — the tools you give it, the context you put in its window, the sub-agents you use to isolate concerns — is the chassis. Anthropic keeps improving the engine. Your job is to keep improving the chassis.
This frame is useful because it explains why you can’t solve consistency problems by switching models or writing longer system prompts. Consistency problems usually live in the chassis: the wrong tools, too much noise in the context window, or the absence of isolation between concerns.
The next two sections are chassis engineering.
4. Friction point 2: specialized tools over primitive exploration
The primitive tool anti-pattern
A natural first instinct when building a content agent is to give it the primitives it might need: a traffic data API, a citation checker, a web scraper, a keyword tool. Then let the agent figure out how to use them.
The agent does figure it out — but inefficiently. AirOps called this the safari trip: the agent makes 20+ sequential tool calls to gather context before it can start the actual task. Each call gathers a piece of information, which suggests a follow-up call, which suggests another. Tokens accumulate. Time accumulates. And because the path through those tool calls varies run to run, output consistency suffers.
Specialized compound tools
The fix is to identify the tasks the agent does repeatedly and wrap the common pattern into a single higher-level tool.
AirOps’s example is a page health tool. Before, assessing one URL required several calls: fetch traffic data, check citation rate, pull competitor comparison, identify target keywords. The agent decided how to combine them — and decided differently each time.
After: one call to the page health tool returns a structured response with all of that data, pre-aggregated and formatted exactly as the agent needs it. The agent doesn’t explore; it just asks for what it needs.
Results from this single change:
- 8% fewer tokens consumed per content run
- Dramatically faster execution (20 tool calls → 1)
- More consistent output, because the data gathering path is no longer variable
The test for whether a tool should be specialized: does the agent call a cluster of tools in the same sequence, for the same kind of task, repeatedly? If yes, that cluster is a compound tool waiting to be built.
5. Friction point 2 (continued): sub-agents for context isolation
Compound tools fix the gathering problem. But there is a second problem: even after the agent has good data, it may not use it well — because its context window is too crowded.
The attention quality problem
Large context windows are one of Claude’s great strengths. They are also a trap. It is tempting to dump everything into one context — brand guidelines, research data, compliance rules, previous drafts, user instructions — and let the model sort it out.
This approach has a ceiling. Attention quality is not uniform across a context window. Material near the edges competes with material in the middle. The model attends to what seems most relevant in the moment, which is not always what you intended to be most salient.
A million-token context window is not an excuse to be sloppy about what you put in it.
The better mental model: treat the context window like a meeting agenda. Every item you add competes for attention with every other item. The discipline is to put in only what serves the current task, nothing more.
AirOps’s sub-agent architecture
AirOps uses three specialized sub-agents, each with an isolated context window:
1. Brand kit sub-agent (runs first)
Before any content is generated, a dedicated sub-agent fetches all relevant brand context from the knowledge base and stores it as an artifact. Every subsequent sub-agent in the run references the same pre-fetched artifact.
This solves two problems at once: it prevents brand context from being fetched inconsistently across different parts of the run, and it keeps brand knowledge out of the main orchestrator’s context window.
2. Writing sub-agent (focused context)
The writing sub-agent receives the brand artifact and the content brief. That is all. It does not receive the raw research, the SEO analysis, or any compliance rules. Those would be distractions — the agent should attend to one thing: writing.
Narrowing the context window to only what is needed for writing produces measurably better prose. The model isn’t splitting attention between “sound like this brand” and “incorporate this research finding” and “don’t violate this compliance rule” simultaneously.
3. Compliance sub-agent (isolated rule context)
After a draft is produced, a compliance sub-agent receives the draft and a complete context window dedicated to brand compliance rules. It returns a structured response: a score plus specific violations with references to the relevant rules.
The main orchestrator receives this structured report and asks the writing sub-agent to revise — without either the writing sub-agent or the compliance sub-agent ever holding each other’s context.
Check your understanding
5 questions · your answers are saved in this browser only
-
1. AirOps found that giving agents primitive tools (traffic API, scraper, etc.) caused "safari trips." What is a safari trip?
-
2. Why does AirOps run the brand kit sub-agent FIRST, before writing or compliance?
-
3. The writing sub-agent deliberately does NOT receive research context or compliance rules. Why?
-
4. For a content workflow, when should human review happen compared to a coding workflow?
-
5. What is the "car/engine" metaphor arguing about agent development?
6. Friction always moves: the discipline of chasing it
Here is the honest version of what building production agents looks like: you solve a problem, and the solving reveals the next problem.
AirOps shipped human review checkpoints — and discovered that brand consistency across runs was the new bottleneck. They built compound tools — and noticed that context window discipline was now the constraint. They isolated sub-agents — and now their frontier is self-improvement loops and content benchmarking.
Chasing friction is not a phase of development. It is the development process. The teams that ship reliable agents are not the ones that solved all the problems at the start; they are the ones that kept moving to the next problem after each solve.
Two open frontiers AirOps is exploring:
Self-improvement and feedback loops. Inspired by research on AI “dreaming” sequences and episodic memory, the question is whether an agent can observe its own outputs over time and update its own playbook — identifying patterns in what worked and what didn’t, without human intervention.
Benchmarking content agents. Evaluating coding agents is relatively tractable: tests pass or they don’t, code compiles or it doesn’t. Evaluating whether an article is good is fundamentally subjective. Building evaluation infrastructure for content quality is an unsolved problem, and AirOps is investing in it.
Build it yourself
Follow these exact steps to reproduce it yourself · estimated time: ~45 minutes
Prerequisites
- An existing agent or multi-step LLM pipeline (even a simple one)
- Access to the agent's tool call logs or traces
- Basic familiarity with Claude API or Agent SDK
Step 1 — Audit your tool call patterns
Pull the last 10–20 runs of your agent and look at the tool call sequences. You are looking for clusters: groups of 3+ tool calls that appear together, in roughly the same order, to accomplish the same intermediate goal.
Questions to ask:
- Does the agent call tool A, then tool B, then tool C before doing any real work?
- Do those calls happen on almost every run?
- Does the agent combine the results of those calls in the same way each time?Each cluster you find is a candidate compound tool.
Step 2 — Build one compound tool
Pick the most common cluster. Write a single function or API endpoint that:
- Accepts the same inputs the agent was assembling across those 3+ calls
- Returns a single structured response with all the data pre-aggregated
Register it as a new tool with a clear, task-oriented name (“get_page_health”, “fetch_competitor_summary”). Remove the individual primitive tools from the agent’s tool list for that task.
# Before: agent calls three tools in sequence
traffic = call_tool("get_traffic", url=url)
citations = call_tool("get_citations", url=url)
keywords = call_tool("get_keywords", url=url)
# After: one compound tool
page_health = call_tool("get_page_health", url=url)
# Returns: { traffic, citations, keywords, competitor_comparison }Run 10 tasks and compare token counts and output consistency. Expect a noticeable reduction in both tokens and variance.
Step 3 — Identify context isolation candidates
Review your main agent’s system prompt or context window. Look for large blocks of content that serve distinct, separable concerns — for example:
- A large block of compliance or brand rules
- A set of writing guidelines or style constraints
- A knowledge base that only matters for one step of the pipeline
Each block that serves a distinct concern is a candidate for a dedicated sub-agent.
Step 4 — Extract one sub-agent
Pick the compliance or quality-checking concern (it tends to be the easiest to isolate). Create a sub-agent whose context contains:
- The rules or guidelines relevant to that concern — and nothing else
- A system prompt scoped to the checking task: “Review the following draft for X violations. Return a structured list of specific issues with rule references.”
Call this sub-agent from your main orchestrator after a draft is produced. Feed its structured output back to the main agent (or a revision agent) as instructions.
# Compliance sub-agent call
compliance_report = compliance_agent.run(
draft=current_draft,
# No research context. No brand kit. Only the draft.
)
# Returns: { score: 0.82, violations: [...] }Step 5 — Pre-fetch shared context as an artifact
If multiple parts of your pipeline need the same context (brand guidelines, user preferences, account configuration), do NOT let each step fetch it independently. Instead:
- Create a “setup” step that runs first and fetches all shared context
- Store it as a structured artifact (a dictionary, a JSON object)
- Pass the artifact explicitly to each downstream step
This prevents inconsistency when the same context is fetched multiple times and eliminates redundant API calls.
Step 6 — Measure and log
Before you start this process, log: average token count per run, average latency, and a subjective output quality rating (even 1–3 scale is fine). After each change, measure again.
Expected results after completing all steps:
- 5–15% reduction in tokens per run
- More consistent outputs across similar inputs
- Faster identification of the next friction point — because the current bottleneck is visible
Keep a friction log. Write down what the current bottleneck is after each improvement. That document is your roadmap.
Where to go next
- Watch the original talk by Dylan from AirOps — the live demo of the playbook builder and inbox is worth seeing directly.
- Continue with Prompting for Agents to understand how to write effective instructions for sub-agents.
- Read the Claude Agent SDK documentation for the primitives used to build multi-agent systems like AirOps’s.