Agentic Analytics: How Omni Built a Production Harness with Claude Code
How Omni's CTO Chris Merrick designed a multi-agent analytics system powered entirely by Claude — covering coordinator architecture, tool sizing, the semantic layer as CLAUDE.md analogy, evaluation with LLM-as-Judge, and the critical design pivots that drove 86x token growth.
This lesson is original educational writing based on this video by Anthropic (published May 21, 2026). All credit for the original content goes to the creators.
1. Starting point: from Q&A to agentic reasoning
Omni is a business intelligence platform — it gives analysts and executives a way to query their company data in plain English. What makes it unusual is that 99% of its platform code was written using Claude Code, and its AI layer (“Blobby”) runs on Claude models hosted via AWS Bedrock.
When Chris Merrick, Omni’s cofounder and CTO, presented at Anthropic’s “Code with Claude” event in May 2026, he opened with an honest origin story. Blobby started as a simple Q&A chatbot: one prompt in, one answer out. It used Haiku, the smallest and fastest Claude model, and it handled straightforward lookups reasonably well.
Two things broke the simple model. First, customers started asking questions that couldn’t be answered in a single query — questions like “Compare our churn rate this quarter to last quarter, and break it down by region” require multiple data pulls, intermediate reasoning, and synthesis. Second, when Omni launched its MCP Server (letting users query Omni data directly from Claude, ChatGPT, and Cursor), they saw how users actually interacted with an agentic tool. The demand for multi-step reasoning was undeniable.
The team rebuilt Blobby around an agentic architecture — and then spent 18 months refining it through a series of hard-won pivots they internally named “Blobotomies.”
2. The two-tier loop architecture
The production architecture that emerged from the Blobotomies is a two-tier loop:
Outer loop — manages the overall agentic session with up to 50 iterations. It handles durable execution: checkpoints are written at each iteration, so if a step fails mid-run the system can recover from the last good state rather than starting over. The outer loop is where goal-level planning lives.
Inner loop — handles parallel tool invocation within each iteration. At peak, Blobby has access to 45 tools: query generation, dashboard creation, visualization, validation, data modeling, and more. The inner loop decides which tools to call in parallel, processes their results, and feeds the output back into the outer loop’s reasoning.
This two-tier design solves a real scaling problem. A flat single-loop agent trying to manage both session state and tool execution simultaneously ends up with a context window cluttered with implementation details. The outer loop focuses on what to accomplish; the inner loop focuses on how to execute individual steps.
3. The semantic layer as CLAUDE.md
One of the most transferable insights from Chris Merrick’s talk is the analogy between the semantic layer in data systems and the CLAUDE.md file in Claude Code.
In Claude Code, the CLAUDE.md file lives at the root of a repository and provides project-specific context to the coding agent: what conventions to follow, what tools are available, what the codebase is for. Context lives near its usage site — in the repo itself.
Omni’s semantic layer works the same way. Every data field in the system has not just a label but a rich set of machine-readable context:
ai_context— LLM-specific guidance for how this field should be interpreted. “Revenue” might note “excludes refunds; use only for closed-won opportunities.”sample_queries— concrete question patterns that map to this field, helping the agent understand when to reach for it.all_values— an enumerated list of valid values, enabling fuzzy matching and typo tolerance. If a user asks for “New York” and the data contains “NYC”, the field values tool resolves the mapping.
The semantic layer lives in YAML files stored in Git. It constrains the agent to relevant data, enforces row- and column-level permissions, and — crucially — eliminates the need to teach the agent your business vocabulary at prompt time. The vocabulary is already encoded where the data lives.
The principle generalizes: whether you are building a coding agent or an analytics agent, context should live near the thing it describes, not in a single monolithic system prompt. A system prompt that tries to encode all business logic in one place becomes a maintenance nightmare and an attention burden.
4. Three architectural pivots (“Blobotomies”)
The most instructive part of the talk is the series of hard architectural reversals Omni made during the 18-month evolution. They call each major refactor a “Blobotomy.” Three stand out:
Pivot 1: Haiku → Sonnet (accepting token cost for capability)
Blobby launched on Claude Haiku — fast, cheap, good enough for single-question lookups.
When the team tried to make multi-step reasoning work on Haiku, it couldn’t hold the plan together across iterations. Complex questions that needed three or four data pulls kept losing track of earlier findings. The team moved to Sonnet and accepted the cost. Monthly token consumption grew from 1.2 billion tokens (August 2025) to 103.3 billion tokens (April 2026) — an 86x increase. The questions that were previously unanswerable became answerable. The economics followed from that capability gain, not the other way around.
Lesson: Don’t size down your model to control costs if the capability gap prevents the core task from working. Get the task working first, then optimize.
Pivot 2: Custom JSON queries → direct SQL
Omni initially had the agent produce queries in a proprietary JSON format that mapped to their internal query engine. This seemed reasonable — the agent wouldn’t need to know SQL, just the abstraction layer.
In practice, the agent struggled. Claude’s training data is saturated with SQL; it has deep, reliable intuitions about SQL syntax and semantics. The custom JSON format was an abstraction it had to learn from scratch, with no pre-training support. Generating a correct JSON query on the first attempt required 3–4 iterations on average.
After switching to direct SQL generation, single-attempt success rates jumped sharply. The lesson is a broader one about LLM affordances: use the formats the model was trained on. Don’t invent proprietary representations for things that already have well-established syntaxes that Claude knows deeply.
Pivot 3: Sub-agent delegation → direct tool integration
An early Blobotomy separated planning from execution into distinct sub-agents: one sub-agent maintained the task plan, another generated queries. This felt architecturally clean — separation of concerns.
What it produced was a “split-brain” problem. The planning sub-agent knew which datasets were available; the query sub-agent knew what a single query could retrieve. But neither had the full picture. The query sub-agent would make an assumption about available data that the planning sub-agent hadn’t confirmed, producing a query that couldn’t be satisfied. The planning sub-agent would update the plan based on incomplete information from the query sub-agent. The result was infinite error loops.
Solution: collapse both responsibilities into the main harness. Direct tool integration — with 45 tools accessible from the main coordinator — proved more reliable than distributing responsibility across agents that couldn’t share grounded state.
5. Error recovery as the highest-leverage investment
Among all the improvements Omni made to Blobby, one had a disproportionate impact on their evaluation scores: high-quality error messages.
The team found that LLMs are “optimistic” — when given incomplete or ambiguous context, the model tries to help anyway rather than stopping. This produces confident-sounding outputs built on faulty assumptions. The fix is not to make the model more cautious; it’s to make the environment more informative when things go wrong.
Omni invested heavily in error messages that explain:
- What happened — exactly which step failed and why, not a generic failure code
- How to fix it — actionable guidance the agent can act on in the next iteration
A query that fails because the field name “total_revenue” doesn’t exist in the selected topic
should not return QueryError: field not found. It should return something like:
Field 'total_revenue' not found in topic 'Sales'. Available revenue fields: 'gross_revenue', 'net_revenue'. Did you mean one of these?
With that message, the outer loop’s coordinator can route the inner loop to the correct field on the next iteration. Without it, the agent may retry with the same incorrect field or hallucinate a different one.
The single investment in error message quality improved evaluation scores more than any other harness change. This makes sense in hindsight: an agent that can self-correct from clear failures converges on correct answers; an agent operating in an opaque error environment amplifies failures across iterations.
6. Evaluation: three pillars
Blobby runs across approximately 200 customer organizations. At that scale, ad-hoc manual review of output quality doesn’t work. Omni built a three-pillar evaluation framework:
Pillar 1: LLM-as-Judge benchmarks
A secondary Claude agent evaluates the primary agent’s responses for correctness. This is not the system prompt asking the model to double-check itself — it’s a separate agent with an independent context window, a structured scoring rubric, and no access to the chain-of-thought that produced the answer it is judging.
LLM-as-Judge scales in a way human evaluation doesn’t. It runs on every session, not a sampled subset.
Pillar 2: Structured benchmark construction
Benchmark questions are scored on eight criteria before being admitted to the eval suite:
| Criterion | What it checks |
|---|---|
| Evaluability | Can a judge definitively score this question correct/incorrect? |
| Coverage | Does it test a meaningful part of the system’s capability? |
| Realism | Would a real customer actually ask this? |
| Difficulty | Does it challenge the system in a useful way? |
| Non-redundancy | Does it add information the suite doesn’t already have? |
| Discriminative power | Would a bad agent fail this where a good one succeeds? |
| Semantic clarity | Is the question unambiguous? |
| Data selection | Are the required data fields actually present in the system? |
The eight-criteria filter prevents eval suite inflation: questions that pass the filter are actually diagnostic, not just easy wins or near-duplicates.
Pillar 3: Daily trace triage
The team runs a daily process using Claude Code to analyze session traces from the previous day and identify root causes in failed user interactions. Automated scoring tells you that something failed; trace triage tells you why.
Merrick noted that engineers found reading raw traces more valuable than watching aggregate metrics move. A score going from 0.81 to 0.83 is hard to act on. A trace showing “agent called the field values tool with ‘New York’ but received no results because the field enumerates ‘NY, ‘NYC’, ‘New York City’ — not ‘New York’” is immediately actionable.
Check your understanding
5 questions · your answers are saved in this browser only
-
1. Omni's agent grew from 1.2B to 103.3B monthly tokens — an 86x increase. What drove this growth and what was the team's stance on it?
-
2. Why did Omni abandon custom JSON query definitions in favor of direct SQL generation?
-
3. Omni's early architecture used separate sub-agents for planning and query generation. What problem did this cause?
-
4. What single harness improvement had the largest impact on Omni's evaluation scores?
-
5. How does Omni's semantic layer relate to Claude Code's CLAUDE.md pattern?
Build it yourself
Follow these exact steps to reproduce it yourself · estimated time: ~45 min
Prerequisites
- A working single-agent or single-query LLM pipeline
- Access to your agent's tool call logs or session traces
- A dataset or API with queryable fields (the pattern applies beyond analytics)
- Basic familiarity with Claude API or Agent SDK
Step 1 — Separate outer and inner loop concerns
Review your current agent. Identify which responsibilities belong to goal-level planning and which belong to step-level execution:
Outer loop (goal-level):
- What is the user trying to accomplish?
- What have we learned so far?
- Should we continue, try a different approach, or stop?
- Is the session state checkpointed?
Inner loop (step-level):
- Which tools do we call right now?
- How do we handle tool responses?
- What do we retry if a tool fails?If your agent mixes these in a single context, refactor so the outer loop produces a plan (a structured intent), passes it to the inner loop, and receives a structured result back. Don’t let the inner loop’s tool call details pollute the outer loop’s planning context.
Step 2 — Build a semantic/context layer
For each major data source, entity type, or domain concept your agent works with, create a structured context file (YAML, JSON, or equivalent). Each entry should have:
# Example field definition
gross_revenue:
label: "Gross Revenue"
description: "Total revenue before refunds or discounts"
ai_context: "Use only for closed-won opportunities. Does not include trial revenue."
sample_queries:
- "What was our gross revenue last quarter?"
- "Show gross revenue by region"
all_values: null # continuous numeric, no enumeration needed
permissions: ["analyst", "exec"]For enumerated fields (status, category, region):
deal_stage:
label: "Deal Stage"
ai_context: "Use to filter pipeline by sales stage"
all_values: ["Prospecting", "Qualification", "Proposal", "Negotiation", "Closed Won", "Closed Lost"]This context layer replaces the need to enumerate business logic in your system prompt on every call. The agent reads it when constructing queries; you maintain it alongside the data.
Step 3 — Invest in error messages
Go through every tool in your inner loop and upgrade its error handling. For each failure mode:
- Write an error message that states exactly what failed
- Add a “how to fix” hint the agent can act on in the next iteration
# Before
raise ValueError("Field not found")
# After
available = [f.name for f in topic.fields if "revenue" in f.name.lower()]
raise ValueError(
f"Field '{field_name}' not found in topic '{topic.name}'. "
f"Available revenue-related fields: {available}. "
f"Use the field_values tool to inspect any field before querying."
)Test your error messages by deliberately triggering them and reading what the agent does next. A good error message should let the agent self-correct without human intervention.
Step 4 — Add checkpoint state to the outer loop
At the end of each outer-loop iteration, write a checkpoint containing:
checkpoint = {
"iteration": i,
"goal": original_user_goal,
"completed_steps": [...],
"current_findings": {...},
"next_intended_action": planned_action,
}
save_checkpoint(session_id, checkpoint)On any failure, load the last checkpoint rather than restarting. Test this by intentionally killing a session mid-run and verifying recovery starts from the right state.
Step 5 — Build a minimal three-pillar eval suite
Pillar 1 — LLM-as-Judge: Write a secondary Claude call that receives a question, the agent’s final answer, and a rubric. Have it return a structured score:
judge_prompt = """
You are evaluating an analytics agent's response.
Question: {question}
Agent's answer: {answer}
Expected behavior: {rubric}
Score 0–3: 0=Wrong, 1=Partially correct, 2=Correct but incomplete, 3=Fully correct.
Respond with JSON: {"score": <int>, "reasoning": "<one sentence>"}
"""Pillar 2 — Benchmark questions: Add questions to your eval suite only if they pass at least five of the eight criteria: evaluability, coverage, realism, difficulty, non-redundancy, discriminative power, semantic clarity, data selection. Reject easy wins.
Pillar 3 — Trace triage: Once per week, pull the five lowest-scored sessions and read their raw traces. Write one sentence for each describing the root cause. Use those sentences as requirements for your next harness fix.
Step 6 — Format audit: are you fighting the model’s priors?
List every format the agent produces or consumes that is not a standard format (SQL, JSON, Markdown, Python). For each:
- Does the model struggle with it? (High iteration counts, frequent corrections?)
- Is there a well-known equivalent the model would handle natively?
If yes to both: switch to the standard format. The cost of maintaining a custom format is paid in reliability, not just development time.
Where to go next
- Read Building Omni’s Architecture for Agentic Analytics — Omni’s engineering blog post covers the semantic layer design in depth.
- The companion lesson Agent Harness Engineering: Chasing Friction covers a different production harness story (AirOps) with complementary patterns around sub-agent isolation and compound tools.
- For the foundational vocabulary of workflows vs. agents, see Building Effective Agents.