Vibe Coding in Prod — Responsibly
Erik Schluntz merged a 22,000-line largely Claude-written change into a production RL codebase. This lesson extracts the discipline that makes that safe: you stop being the code writer and become the system designer and verifier.
This lesson is original educational writing based on this video by Anthropic (published May 22, 2025). All credit for the original content goes to the creators.
1. The 22,000-line merge
Erik Schluntz (Anthropic technical staff) tells the story plainly: his team merged a change of roughly 22,000 lines into a production reinforcement-learning codebase, with the vast majority written by Claude. Not a prototype, not a demo — production. The talk is about why that wasn’t reckless.
“Vibe coding” — letting the model write code while you steer at a higher level — has a bad reputation because people import the speed without the system. Erik’s framing of the system: you stop being the code writer and become the system designer and the verifier. Your output is no longer lines of code; it’s specifications, interfaces, constraints and verification. The model’s output is lines of code. The safety comes from never confusing the two jobs.
2. Where vibe coding is safe (and where it isn’t)
The talk is explicit that this works because they chose the right ground. Ask three questions about any code you’re tempted to vibe:
- Is the blast radius contained? Leaf code — tooling, visualization, data pipelines, experiment harnesses — fails loudly and locally. Core abstractions, auth, billing and migrations fail everywhere at once: keep human hands on those.
- Can correctness be checked from the outside? If behavior can be pinned by tests, types and observable outputs, you can verify without reading every line. If correctness lives in subtle invariants nobody wrote down, write them down first — or don’t vibe it.
- Will a human be able to change it later? Insist on clean interfaces and small modules so the next change (by human or model) doesn’t require archaeology.
3. Verification is the product
If you write less of the code, you must own more of the proof. The practices behind the big merge:
- Spec before prompt. The team wrote down what the system should do — interfaces, invariants, failure behavior — and prompted from the spec. A model amplifies whatever clarity (or vagueness) you give it.
- Tests the human trusts. AI can write tests too, but you review the tests with full attention even when you skim the implementation. Tests are the contract; the implementation is negotiable. Beware the failure mode of letting the model “fix” a failing test by weakening it.
- Mechanical gates that don’t get tired. Strict types, linters and CI catch the whole class of plausible-but-wrong code that human reviewers approve at line 1,800 of a big diff. Every gate you automate is review attention you get back.
- Review the shape, not every line. For large AI-written changes, review like an architect: are the interfaces right? Is behavior covered by tests you trust? Are there surprising dependencies? Line-by-line reading of 22,000 lines was never going to happen — and that’s fine if and only if the other layers exist.
- Let go gradually. The talk’s title says “responsibly” because trust is earned per-area: start with leaf code, watch the failure rate, expand scope as the verification net proves itself.
Check your understanding
4 questions · your answers are saved in this browser only
-
1. What is the fundamental role change the talk argues for?
-
2. Which area is the WORST candidate for vibe coding, per the lesson's three questions?
-
3. Why review AI-written tests more carefully than AI-written implementation?
-
4. What made merging a 22,000-line mostly-AI-written change defensible?
Build it yourself
Follow these exact steps to reproduce it yourself · estimated time: ~45 minutes
Prerequisites
- A project with CI (GitHub Actions or similar)
- Claude Code installed
Build the verification net once; vibe responsibly forever after. We’ll harden one project and then do a contained vibe-coded change through the net.
Step 1 — Map your blast radius
In your repo, with claude:
Classify the top-level modules of this codebase into: (a) leaf code — contained
blast radius, externally checkable; (b) core — shared abstractions, auth,
data integrity. Output a two-column table with one-line justifications.Save the output as VIBE_MAP.md. This is your “where it’s allowed” policy.
Step 2 — Install mechanical gates
For a TypeScript project (adapt for your stack):
# strictness that catches plausible-but-wrong code
npx tsc --init --strict
npm install -D eslint typescript-eslintAdd a CI workflow that runs types + lint + tests on every PR — gates only count if they block:
# .github/workflows/ci.yml
name: ci
on: [pull_request]
jobs:
verify:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 22 }
- run: npm ci
- run: npx tsc --noEmit
- run: npx eslint .
- run: npm testStep 3 — Write the spec before the prompt
Pick a leaf-code task from your VIBE_MAP.md. Write SPEC.md yourself — 10 minutes, by hand:
# Feature: <name>
Interface: <function/endpoint signatures>
Invariants: <what must always hold>
Failure behavior: <what happens on bad input / errors>
Out of scope: <explicitly not this>Step 4 — Vibe the implementation, own the tests
Read SPEC.md. First: write tests that pin the interface, invariants and
failure behavior. Do not implement yet.Review those tests with full attention — this is your real review. Then:
Now implement until the tests pass. Do not modify the tests; if a test seems
wrong, stop and tell me instead.Step 5 — Review the shape
When it’s green, review like the talk says — interfaces and behavior, not lines:
Summarize this change for a reviewer: public interface, new dependencies,
invariants covered by tests, invariants NOT covered by tests.Expected result: a merged change where you personally wrote ~0 lines of implementation but can answer, with evidence, “how do you know it’s correct?” — spec, trusted tests, green gates. That answer is the whole discipline. Repeat, and widen scope only as the net keeps catching what it should.
Where to go next
- Watch Erik Schluntz’s original talk — the production war stories don’t fit in a lesson.
- The day-to-day mechanics live in Claude Code Best Practices.
- For the theory of when to hand work to a model at all, see Prompting for Agents.