Building AI-Native at Enterprise Scale: Lessons from monday.com, Doctolib, and Delivery Hero
Three of Europe's fastest-scaling tech companies made three different bets on Claude. Discover how Delivery Hero built an autonomous agent merging 100+ PRs a day, Doctolib governs AI across its entire healthcare engineering org, and monday.com ships AI directly to users who've never written code.
This lesson is original educational writing based on this video by Claude (published May 20, 2026). All credit for the original content goes to the creators.
1. Three bets, one common ambition
In May 2026, monday.com, Doctolib, and Delivery Hero sat down at an Anthropic panel to share what they had actually built — not roadmaps, but systems running in production. Three European companies, three different industries, three different bets on Claude. The outcome: no single playbook for enterprise AI, but a set of converging patterns that every org building at scale will eventually face.
The meta-point from the panel: the companies making the most progress are those that picked a specific, high-leverage insertion point and built deeply, rather than spreading AI thinly across every workflow. Each company chose a different insertion point.
- Delivery Hero bet on the software delivery pipeline itself — could an autonomous agent handle the end-to-end task of writing, testing, and merging code?
- Doctolib bet on developer productivity at org scale — could AI tooling be standardized and governed well enough for a healthcare company to roll it out to every engineer?
- monday.com bet on the product surface — could Claude be embedded inside monday.com so that non-technical end users could build software with plain English?
2. Delivery Hero: the autonomous agent that replaced 130 engineers’ output
Delivery Hero’s bet was the boldest structurally. Rather than giving engineers a coding assistant, they built an autonomous agent — Herogen — that receives tasks in natural language and handles the entire software delivery lifecycle: writing, testing, iterating, and merging.
The numbers as of April 2026: Herogen merges over 100 pull requests per day, representing about 9% of all PRs. The 85% success rate means that in 85 out of 100 tasks, Herogen autonomously merged a correct implementation with zero or one interaction with a human in the loop. That rate — at 100+ PRs/day volume — frees an estimated 250,000 engineering hours annually, equivalent to the output of 130 senior engineers.
The architecture. Herogen runs on Claude Opus 4.5 as its primary coding model, deployed via Google Cloud’s Vertex AI. Before any human review, a “council of agents” — built on multiple LLMs from different providers including both Claude and Gemini — reviews the code from different perspectives. The council approach addresses a structural weakness of single-model systems: any individual model has blind spots in its training data. Running the same code through models with different training reduces the chance that a shared blind spot makes it through.
The human-in-the-loop design. The final review is still human. Herogen doesn’t merge without a human approval step. The design choice is deliberate: Herogen handles routine, well-scoped tasks end-to-end, while Claude Code handles the exploratory work — building new projects from scratch, experimenting with design approaches — where the back-and-forth between developer and model is part of the value.
3. Doctolib: governing AI across a healthcare engineering org
Doctolib is Europe’s leading healthcare technology platform, serving 420,000 health professionals and 90 million patients across France, Germany, and Italy. The compliance constraints alone make AI adoption harder than at most companies: any tooling used in the engineering org touches code that eventually runs clinical workflows.
The productivity challenge Doctolib faced was specific: administrative tasks — writing documentation, creating tests, reviewing PRs — were consuming a disproportionate share of engineering time. The fix couldn’t be chaotic individual adoption of AI tools; it had to be governed.
The centralized repository model. Doctolib’s platform team built and maintains a centralized repository of prompts, custom commands, and subagents — tested and approved workflows that every engineer pulls as part of their initial Claude Code setup. This means:
- Every engineer starts with proven, reusable workflows on day one
- Standard workflows include documentation, testing, code review, and debugging patterns
- New hires onboard to unfamiliar codebases in days rather than weeks
The CI-driven documentation approach. One of the most operationally durable patterns Doctolib implemented: every code change triggers a CI job that automatically updates the relevant technical documentation. Documentation that lives outside the update loop goes stale; by making the doc update automatic and blocking on CI, Doctolib’s technical docs stay current without a dedicated documentation process.
What governance at healthcare scale requires. Unlike a typical software company, Doctolib cannot treat AI tooling as “move fast and see what happens.” Their centralized model means the platform team vets each workflow addition — they own the quality bar so individual engineers don’t have to rediscover it. The tradeoff: less bottom-up experimentation, more reliable baseline.
4. monday.com: shipping AI to users who’ve never written code
monday.com’s AI bet is the most structurally distinct from the other two: it is not an internal developer tool, it is a product feature shipped to paying customers who are primarily non-technical.
The flagship capability is monday vibe — a vibe coding environment embedded in the monday.com platform. Product managers, operations leads, and marketers can describe what they want in plain English and receive a working custom app inside monday.com. The audience has never written code and doesn’t need to.
This represents a different kind of enterprise AI challenge than Delivery Hero or Doctolib faced. It is not about making engineers more productive. It is about whether non-technical users trust AI enough to use it for real work, whether the results are reliable enough for enterprise data, and whether the platform can govern what users build.
The multi-model gateway. monday.com connects to Claude, ChatGPT, Copilot, and Gemini, giving enterprise customers a choice of model. The AI Platform Gateway matches the right model to the right task within a workflow. This is a significant architectural choice: rather than committing to one provider, monday.com built an abstraction layer that insulates users from model transitions — important given how rapidly model capabilities shift.
AI agents as first-class users. monday.com made a structural product decision: AI agents have full user status on the platform, with the same permissions, audit trails, and governance as human users. This is not cosmetic — it means AI agents can be assigned to boards, given tasks, and held accountable in the same way a human team member would be. Enterprise customers need that accountability structure before they trust an agent with production workflows.
5. The shared scaling challenges
Despite their different insertion points, all three companies hit similar friction points at scale.
Model churn. Models improve every few months, and enterprise systems built on a specific model version need to be updated. Delivery Hero’s council-of-agents architecture and monday.com’s multi-model gateway are both partial answers to this: if your system isn’t tightly coupled to one model version, transitions are cheaper. Doctolib’s centralized repo approach means there’s one team responsible for validating that updated models still produce correct outputs on their standard workflows — rather than 500 engineers discovering it ad hoc.
Measuring ROI at enterprise scale. All three companies faced the same measurement challenge: how do you attribute output improvement to AI specifically when engineering teams are also improving their processes in other ways? Delivery Hero’s approach — PR merge rate, hours freed — is the most legible because Herogen’s output is directly countable. Doctolib measures onboarding time to first PR and documentation staleness. monday.com measures feature adoption rates by non-technical users. The lesson: choose a metric that is directly observable and that would not have improved without the AI capability specifically.
Governance before scale. The mistake several companies made before these three was rolling out AI tooling without governance infrastructure: no shared prompts, no policy on what data the model could access, no escalation path when the model was wrong. All three companies built governance before broad rollout. Delivery Hero’s human final-review gate, Doctolib’s centralized command repo, and monday.com’s agent-as-user model are all governance structures, not afterthoughts.
Staying ahead of the model. The hardest operational challenge is that the capability floor keeps rising. A workflow you built around a model’s limitations in January may be unnecessarily constrained by March. All three companies flagged this as an ongoing cost: you have to periodically re-evaluate your design decisions against the current model, not the model you first built on.
Check your understanding
5 questions · your answers are saved in this browser only
-
1. Why did Delivery Hero build a "council of agents" with multiple LLM providers rather than relying on Claude alone?
-
2. What is the key operational benefit of Doctolib's centralized prompt repository approach?
-
3. What structural decision did monday.com make about AI agents that distinguishes their approach from a simple chatbot integration?
-
4. Herogen achieves an 85% autonomous PR merge rate. What does this number specifically measure?
-
5. Which of the following is a shared challenge that all three companies encountered when scaling enterprise AI?
6. What enterprise AI maturity actually looks like
Taken together, the three companies illustrate a maturity curve for enterprise AI adoption. It is not a linear ladder — a company might reach Stage 3 in one function while staying at Stage 1 in another. But the progression shows where each stage breaks down and what unlocks the next.
Stage 1 — Ad hoc individual use. Engineers discover AI tools on their own, use them inconsistently, and there is no shared infrastructure. Output quality varies by individual. This is where most large enterprises were in 2024-2025.
Stage 2 — Standardized tooling. The organization picks a set of tools, builds shared prompts and workflows, and provides a governed starting point for every engineer. Doctolib’s centralized repository model represents this stage operating at healthcare-grade governance. The bottleneck shifts from “can we use AI at all” to “can we maintain and update the standard workflows as models improve.”
Stage 3 — Embedded product AI. AI capabilities appear in the product itself, not just in the development process. monday.com’s monday vibe and agent-as-user model represent this stage. The challenge shifts from developer productivity to user trust, reliability at scale, and multi-model governance.
Stage 4 — Autonomous agentic systems. AI agents handle end-to-end tasks autonomously with humans in a supervisory role. Delivery Hero’s Herogen represents this stage for software delivery. The challenge shifts to escalation design, council architectures to catch blind spots, and ROI measurement on autonomous output.
Build it yourself
Follow these exact steps to reproduce it yourself · estimated time: ~20 min
Prerequisites
- Access to your organization's current engineering workflow documentation
- A rough sense of where engineering time is most consistently wasted
- Stakeholder alignment on one function to target first (internal tooling, developer productivity, or end-user product)
Use this guide to design your enterprise AI adoption insertion point — deciding where to go deep before spreading thin.
Step 1 — Map your three candidate insertion points
Write down one concrete opportunity in each of the three categories the panel companies represent:
1. Pipeline / autonomous agents:
What repetitive, well-defined task in our delivery pipeline
could an agent handle end-to-end?
(Examples: dependency bumps, migration scripts, test generation,
changelog drafting, PR description writing)
2. Developer org governance:
What AI workflow, if standardized and vetted, would most
reduce the variance in how engineers currently use AI?
(Examples: code review prompts, documentation generation,
onboarding commands, debugging subagents)
3. End-user product:
What capability, if AI-powered, would unlock value for
users who currently can't access it because it requires
technical skill?
(Examples: custom report building, workflow automation,
data transformation, natural language querying)Step 2 — Score each opportunity on three dimensions
For each candidate, score 1-3 on:
- Observability — can you measure success clearly? (Delivery Hero’s PR merge rate = 3; vague “productivity improvement” = 1)
- Scope clarity — is the task well-defined enough for an agent or standard workflow to handle reliably? (dependency bumps = 3; “help engineers write better code” = 1)
- Governance readiness — does your org have the infrastructure to govern this? Regulated industries score pipeline automation lower unless escalation paths exist.
Pick the opportunity with the highest combined score.
Step 3 — Define your success metric before you build
The mistake is building first and measuring later. Before writing a single prompt or line of code, write down:
Our success metric is: [specific, directly observable number]
This metric will be at least [X] after [Y] weeks of operation.
We will know the AI specifically caused the improvement because: [mechanism]Herogen’s metric: “ratio of merged to rejected PRs on Herogen-submitted code.” It is directly observable, causally attributable, and would not improve without Herogen specifically.
Step 4 — Design the governance structure before rollout
For each insertion type, the governance structure is different:
For autonomous agents (pipeline):
- Define which tasks are in-scope for autonomous operation (explicit list, not “anything routine”)
- Define the escalation trigger: what causes Herogen to stop and ask a human?
- Require a human final-approval step on every merge — not as a bottleneck, as a guard
For developer org tooling:
- Designate a platform team or person responsible for the centralized prompt/command repository
- Establish a validation process: before adding a workflow to the shared repo, it must be tested on at least N real tasks
- Plan for model updates: who re-validates the shared workflows when Claude updates?
For end-user product AI:
- Define what data the model can and cannot access per user role
- Give AI agents the same audit trail as human users — not a separate system
- Set reliability expectations with users before launch: what happens when the AI is wrong?
Step 5 — Run a two-week pilot, then decide to expand or kill
A two-week pilot with a small team generates real data. At the end:
- Did the metric improve? By how much?
- What broke that you didn’t expect?
- What would need to change before expanding to the full org?
The council-of-agents approach is worth piloting even at small scale: run your pilot outputs through two different models and compare. The differences surface your primary model’s blind spots before you deploy at volume.
Where to go next
- AI-Native Engineering Org — Fiona Fung on how coding without the bottleneck changes every upstream and downstream process
- Building Effective Agents — the core patterns for building reliable agentic systems, directly applicable to Herogen-style architectures
- Agents That Remember — how to give autonomous agents the persistent memory they need to handle complex, multi-step tasks