Building AI-Native at Enterprise Scale: Lessons from monday.com, Doctolib, and Delivery Hero

1. Three bets, one common ambition

In May 2026, monday.com, Doctolib, and Delivery Hero sat down at an Anthropic panel to share what they had actually built — not roadmaps, but systems running in production. Three European companies, three different industries, three different bets on Claude. The outcome: no single playbook for enterprise AI, but a set of converging patterns that every org building at scale will eventually face.

The meta-point from the panel: the companies making the most progress are those that picked a specific, high-leverage insertion point and built deeply, rather than spreading AI thinly across every workflow. Each company chose a different insertion point.

Delivery Hero bet on the software delivery pipeline itself — could an autonomous agent handle the end-to-end task of writing, testing, and merging code?
Doctolib bet on developer productivity at org scale — could AI tooling be standardized and governed well enough for a healthcare company to roll it out to every engineer?
monday.com bet on the product surface — could Claude be embedded inside monday.com so that non-technical end users could build software with plain English?

Three enterprise AI insertion points. Each company found a different place to embed AI deeply, rather than spreading it thin. The common thread: go deep on one high-leverage surface before expanding.

2. Delivery Hero: the autonomous agent that replaced 130 engineers’ output

Delivery Hero’s bet was the boldest structurally. Rather than giving engineers a coding assistant, they built an autonomous agent — Herogen — that receives tasks in natural language and handles the entire software delivery lifecycle: writing, testing, iterating, and merging.

The numbers as of April 2026: Herogen merges over 100 pull requests per day, representing about 9% of all PRs. The 85% success rate means that in 85 out of 100 tasks, Herogen autonomously merged a correct implementation with zero or one interaction with a human in the loop. That rate — at 100+ PRs/day volume — frees an estimated 250,000 engineering hours annually, equivalent to the output of 130 senior engineers.

The architecture. Herogen runs on Claude Opus 4.5 as its primary coding model, deployed via Google Cloud’s Vertex AI. Before any human review, a “council of agents” — built on multiple LLMs from different providers including both Claude and Gemini — reviews the code from different perspectives. The council approach addresses a structural weakness of single-model systems: any individual model has blind spots in its training data. Running the same code through models with different training reduces the chance that a shared blind spot makes it through.

The human-in-the-loop design. The final review is still human. Herogen doesn’t merge without a human approval step. The design choice is deliberate: Herogen handles routine, well-scoped tasks end-to-end, while Claude Code handles the exploratory work — building new projects from scratch, experimenting with design approaches — where the back-and-forth between developer and model is part of the value.

3. Doctolib: governing AI across a healthcare engineering org

Doctolib is Europe’s leading healthcare technology platform, serving 420,000 health professionals and 90 million patients across France, Germany, and Italy. The compliance constraints alone make AI adoption harder than at most companies: any tooling used in the engineering org touches code that eventually runs clinical workflows.

The productivity challenge Doctolib faced was specific: administrative tasks — writing documentation, creating tests, reviewing PRs — were consuming a disproportionate share of engineering time. The fix couldn’t be chaotic individual adoption of AI tools; it had to be governed.

The centralized repository model. Doctolib’s platform team built and maintains a centralized repository of prompts, custom commands, and subagents — tested and approved workflows that every engineer pulls as part of their initial Claude Code setup. This means:

Every engineer starts with proven, reusable workflows on day one
Standard workflows include documentation, testing, code review, and debugging patterns
New hires onboard to unfamiliar codebases in days rather than weeks

The CI-driven documentation approach. One of the most operationally durable patterns Doctolib implemented: every code change triggers a CI job that automatically updates the relevant technical documentation. Documentation that lives outside the update loop goes stale; by making the doc update automatic and blocking on CI, Doctolib’s technical docs stay current without a dedicated documentation process.

What governance at healthcare scale requires. Unlike a typical software company, Doctolib cannot treat AI tooling as “move fast and see what happens.” Their centralized model means the platform team vets each workflow addition — they own the quality bar so individual engineers don’t have to rediscover it. The tradeoff: less bottom-up experimentation, more reliable baseline.

4. monday.com: shipping AI to users who’ve never written code

monday.com’s AI bet is the most structurally distinct from the other two: it is not an internal developer tool, it is a product feature shipped to paying customers who are primarily non-technical.

The flagship capability is monday vibe — a vibe coding environment embedded in the monday.com platform. Product managers, operations leads, and marketers can describe what they want in plain English and receive a working custom app inside monday.com. The audience has never written code and doesn’t need to.

This represents a different kind of enterprise AI challenge than Delivery Hero or Doctolib faced. It is not about making engineers more productive. It is about whether non-technical users trust AI enough to use it for real work, whether the results are reliable enough for enterprise data, and whether the platform can govern what users build.

The multi-model gateway. monday.com connects to Claude, ChatGPT, Copilot, and Gemini, giving enterprise customers a choice of model. The AI Platform Gateway matches the right model to the right task within a workflow. This is a significant architectural choice: rather than committing to one provider, monday.com built an abstraction layer that insulates users from model transitions — important given how rapidly model capabilities shift.

AI agents as first-class users. monday.com made a structural product decision: AI agents have full user status on the platform, with the same permissions, audit trails, and governance as human users. This is not cosmetic — it means AI agents can be assigned to boards, given tasks, and held accountable in the same way a human team member would be. Enterprise customers need that accountability structure before they trust an agent with production workflows.

5. The shared scaling challenges

Despite their different insertion points, all three companies hit similar friction points at scale.

Model churn. Models improve every few months, and enterprise systems built on a specific model version need to be updated. Delivery Hero’s council-of-agents architecture and monday.com’s multi-model gateway are both partial answers to this: if your system isn’t tightly coupled to one model version, transitions are cheaper. Doctolib’s centralized repo approach means there’s one team responsible for validating that updated models still produce correct outputs on their standard workflows — rather than 500 engineers discovering it ad hoc.

Measuring ROI at enterprise scale. All three companies faced the same measurement challenge: how do you attribute output improvement to AI specifically when engineering teams are also improving their processes in other ways? Delivery Hero’s approach — PR merge rate, hours freed — is the most legible because Herogen’s output is directly countable. Doctolib measures onboarding time to first PR and documentation staleness. monday.com measures feature adoption rates by non-technical users. The lesson: choose a metric that is directly observable and that would not have improved without the AI capability specifically.

Governance before scale. The mistake several companies made before these three was rolling out AI tooling without governance infrastructure: no shared prompts, no policy on what data the model could access, no escalation path when the model was wrong. All three companies built governance before broad rollout. Delivery Hero’s human final-review gate, Doctolib’s centralized command repo, and monday.com’s agent-as-user model are all governance structures, not afterthoughts.

Staying ahead of the model. The hardest operational challenge is that the capability floor keeps rising. A workflow you built around a model’s limitations in January may be unnecessarily constrained by March. All three companies flagged this as an ongoing cost: you have to periodically re-evaluate your design decisions against the current model, not the model you first built on.

Check your understanding

5 questions · your answers are saved in this browser only

1. Why did Delivery Hero build a "council of agents" with multiple LLM providers rather than relying on Claude alone?

Each LLM has blind spots — patterns or edge cases underrepresented in its training. Running code review through multiple models with different training data reduces the probability that any single blind spot makes it into a merged PR. Cost and vendor lock-in are secondary considerations.
2. What is the key operational benefit of Doctolib's centralized prompt repository approach?

The centralized repo means a new engineer doesn't spend days figuring out how to prompt Claude effectively for Doctolib's specific codebase and workflows. They start with what already works. The platform team owns quality so individual engineers don't have to. Monitoring and cost are separate concerns.
3. What structural decision did monday.com make about AI agents that distinguishes their approach from a simple chatbot integration?

Giving agents "user status" means they appear in boards, task assignments, and audit logs the same way a human colleague would. Enterprise customers need this accountability structure before they trust an agent with real workflows — a chatbot that operates outside the governance model doesn't give the same confidence.
4. Herogen achieves an 85% autonomous PR merge rate. What does this number specifically measure?

The 85% success rate is defined as the ratio of merged to rejected PRs. A rejected PR means Herogen's implementation was wrong and a human had to intervene beyond a final approval. This metric is directly observable and cleanly attributable to Herogen's autonomous capability.
5. Which of the following is a shared challenge that all three companies encountered when scaling enterprise AI?

Model churn means that architectural constraints you designed around in January may be unnecessary by March. All three companies flagged this as an ongoing operational cost: you must re-evaluate your design decisions against the current model capability, not the version you originally built on.

6. What enterprise AI maturity actually looks like

Taken together, the three companies illustrate a maturity curve for enterprise AI adoption. It is not a linear ladder — a company might reach Stage 3 in one function while staying at Stage 1 in another. But the progression shows where each stage breaks down and what unlocks the next.

Stage 1 — Ad hoc individual use. Engineers discover AI tools on their own, use them inconsistently, and there is no shared infrastructure. Output quality varies by individual. This is where most large enterprises were in 2024-2025.

Stage 2 — Standardized tooling. The organization picks a set of tools, builds shared prompts and workflows, and provides a governed starting point for every engineer. Doctolib’s centralized repository model represents this stage operating at healthcare-grade governance. The bottleneck shifts from “can we use AI at all” to “can we maintain and update the standard workflows as models improve.”

Stage 3 — Embedded product AI. AI capabilities appear in the product itself, not just in the development process. monday.com’s monday vibe and agent-as-user model represent this stage. The challenge shifts from developer productivity to user trust, reliability at scale, and multi-model governance.

Stage 4 — Autonomous agentic systems. AI agents handle end-to-end tasks autonomously with humans in a supervisory role. Delivery Hero’s Herogen represents this stage for software delivery. The challenge shifts to escalation design, council architectures to catch blind spots, and ROI measurement on autonomous output.

Key takeaways

Pick one high-leverage insertion point and go deep: the companies making the most progress chose pipeline automation, developer org governance, or end-user product — not all three at once
A council-of-agents architecture — multiple models reviewing the same output — reduces the risk that any single model’s blind spots reach production
Centralized prompt repositories and governance infrastructure need to exist before broad rollout, not after; retrofitting governance onto ad hoc adoption is significantly harder
Giving AI agents the same user status, permissions, and audit trails as human users is a prerequisite for enterprise trust in agentic workflows
Model churn is an ongoing operational cost: design decisions made against one model’s limitations need periodic re-evaluation as capabilities improve

Build it yourself

Follow these exact steps to reproduce it yourself · estimated time: ~20 min

Prerequisites

Access to your organization's current engineering workflow documentation
A rough sense of where engineering time is most consistently wasted
Stakeholder alignment on one function to target first (internal tooling, developer productivity, or end-user product)

Use this guide to design your enterprise AI adoption insertion point — deciding where to go deep before spreading thin.

Step 1 — Map your three candidate insertion points

Write down one concrete opportunity in each of the three categories the panel companies represent:

1. Pipeline / autonomous agents:
   What repetitive, well-defined task in our delivery pipeline
   could an agent handle end-to-end?
   (Examples: dependency bumps, migration scripts, test generation,
   changelog drafting, PR description writing)

2. Developer org governance:
   What AI workflow, if standardized and vetted, would most
   reduce the variance in how engineers currently use AI?
   (Examples: code review prompts, documentation generation,
   onboarding commands, debugging subagents)

3. End-user product:
   What capability, if AI-powered, would unlock value for
   users who currently can't access it because it requires
   technical skill?
   (Examples: custom report building, workflow automation,
   data transformation, natural language querying)

Step 2 — Score each opportunity on three dimensions

For each candidate, score 1-3 on:

Observability — can you measure success clearly? (Delivery Hero’s PR merge rate = 3; vague “productivity improvement” = 1)
Scope clarity — is the task well-defined enough for an agent or standard workflow to handle reliably? (dependency bumps = 3; “help engineers write better code” = 1)
Governance readiness — does your org have the infrastructure to govern this? Regulated industries score pipeline automation lower unless escalation paths exist.

Pick the opportunity with the highest combined score.

Step 3 — Define your success metric before you build

The mistake is building first and measuring later. Before writing a single prompt or line of code, write down:

Our success metric is: [specific, directly observable number]
This metric will be at least [X] after [Y] weeks of operation.
We will know the AI specifically caused the improvement because: [mechanism]

Herogen’s metric: “ratio of merged to rejected PRs on Herogen-submitted code.” It is directly observable, causally attributable, and would not improve without Herogen specifically.

Step 4 — Design the governance structure before rollout

For each insertion type, the governance structure is different:

For autonomous agents (pipeline):

Define which tasks are in-scope for autonomous operation (explicit list, not “anything routine”)
Define the escalation trigger: what causes Herogen to stop and ask a human?
Require a human final-approval step on every merge — not as a bottleneck, as a guard

For developer org tooling:

Designate a platform team or person responsible for the centralized prompt/command repository
Establish a validation process: before adding a workflow to the shared repo, it must be tested on at least N real tasks
Plan for model updates: who re-validates the shared workflows when Claude updates?

For end-user product AI:

Define what data the model can and cannot access per user role
Give AI agents the same audit trail as human users — not a separate system
Set reliability expectations with users before launch: what happens when the AI is wrong?

Step 5 — Run a two-week pilot, then decide to expand or kill

A two-week pilot with a small team generates real data. At the end:

Did the metric improve? By how much?
What broke that you didn’t expect?
What would need to change before expanding to the full org?

The council-of-agents approach is worth piloting even at small scale: run your pilot outputs through two different models and compare. The differences surface your primary model’s blind spots before you deploy at volume.

Where to go next

AI-Native Engineering Org — Fiona Fung on how coding without the bottleneck changes every upstream and downstream process
Building Effective Agents — the core patterns for building reliable agentic systems, directly applicable to Herogen-style architectures
Agents That Remember — how to give autonomous agents the persistent memory they need to handle complex, multi-step tasks

Building AI-Native at Enterprise Scale: Lessons from monday.com, Doctolib, and Delivery Hero

1. Three bets, one common ambition

2. Delivery Hero: the autonomous agent that replaced 130 engineers’ output

3. Doctolib: governing AI across a healthcare engineering org

4. monday.com: shipping AI to users who’ve never written code

5. The shared scaling challenges

Check your understanding

6. What enterprise AI maturity actually looks like

Build it yourself

Step 1 — Map your three candidate insertion points

Step 2 — Score each opportunity on three dimensions

Step 3 — Define your success metric before you build

Step 4 — Design the governance structure before rollout

Step 5 — Run a two-week pilot, then decide to expand or kill

Where to go next

Related lessons

Build AI Agents with Claude in Microsoft Azure AI Foundry

Integrating Claude Managed Agents with Enterprise Tools: The Asana AI Teammates Pattern

How Anthropic's GTM Engineering Team Uses Claude

1. Three bets, one common ambition

2. Delivery Hero: the autonomous agent that replaced 130 engineers’ output

3. Doctolib: governing AI across a healthcare engineering org

4. monday.com: shipping AI to users who’ve never written code

5. The shared scaling challenges

🧠 Check your understanding

6. What enterprise AI maturity actually looks like

🛠️ Build it yourself

Step 1 — Map your three candidate insertion points

Step 2 — Score each opportunity on three dimensions

Step 3 — Define your success metric before you build

Step 4 — Design the governance structure before rollout

Step 5 — Run a two-week pilot, then decide to expand or kill

Where to go next

Related lessons

Build AI Agents with Claude in Microsoft Azure AI Foundry

Integrating Claude Managed Agents with Enterprise Tools: The Asana AI Teammates Pattern

How Anthropic's GTM Engineering Team Uses Claude

Check your understanding

Build it yourself