Vibe Coding in Prod — Responsibly

1. The 22,000-line merge

Erik Schluntz (Anthropic technical staff) tells the story plainly: his team merged a change of roughly 22,000 lines into a production reinforcement-learning codebase, with the vast majority written by Claude. Not a prototype, not a demo — production. The talk is about why that wasn’t reckless.

“Vibe coding” — letting the model write code while you steer at a higher level — has a bad reputation because people import the speed without the system. Erik’s framing of the system: you stop being the code writer and become the system designer and the verifier. Your output is no longer lines of code; it’s specifications, interfaces, constraints and verification. The model’s output is lines of code. The safety comes from never confusing the two jobs.

2. Where vibe coding is safe (and where it isn’t)

The talk is explicit that this works because they chose the right ground. Ask three questions about any code you’re tempted to vibe:

Is the blast radius contained? Leaf code — tooling, visualization, data pipelines, experiment harnesses — fails loudly and locally. Core abstractions, auth, billing and migrations fail everywhere at once: keep human hands on those.
Can correctness be checked from the outside? If behavior can be pinned by tests, types and observable outputs, you can verify without reading every line. If correctness lives in subtle invariants nobody wrote down, write them down first — or don’t vibe it.
Will a human be able to change it later? Insist on clean interfaces and small modules so the next change (by human or model) doesn’t require archaeology.

The responsibility split: the human designs and verifies; the model writes. Verification layers — types, tests, CI, review — sit between AI-written code and production.

3. Verification is the product

If you write less of the code, you must own more of the proof. The practices behind the big merge:

Spec before prompt. The team wrote down what the system should do — interfaces, invariants, failure behavior — and prompted from the spec. A model amplifies whatever clarity (or vagueness) you give it.
Tests the human trusts. AI can write tests too, but you review the tests with full attention even when you skim the implementation. Tests are the contract; the implementation is negotiable. Beware the failure mode of letting the model “fix” a failing test by weakening it.
Mechanical gates that don’t get tired. Strict types, linters and CI catch the whole class of plausible-but-wrong code that human reviewers approve at line 1,800 of a big diff. Every gate you automate is review attention you get back.
Review the shape, not every line. For large AI-written changes, review like an architect: are the interfaces right? Is behavior covered by tests you trust? Are there surprising dependencies? Line-by-line reading of 22,000 lines was never going to happen — and that’s fine if and only if the other layers exist.
Let go gradually. The talk’s title says “responsibly” because trust is earned per-area: start with leaf code, watch the failure rate, expand scope as the verification net proves itself.

Check your understanding

4 questions · your answers are saved in this browser only

1. What is the fundamental role change the talk argues for?

Your output becomes specs, interfaces, constraints and verification; the model's output is the code. Safety comes from keeping those jobs distinct.
2. Which area is the WORST candidate for vibe coding, per the lesson's three questions?

Billing has maximal blast radius, correctness that's hard to check from the outside, and everything depends on it. Leaf code with loud, local failures is where you start.
3. Why review AI-written tests more carefully than AI-written implementation?

The whole scheme rests on verification you trust. A model quietly weakening an assertion to make a test pass is the classic failure mode to watch for.
4. What made merging a 22,000-line mostly-AI-written change defensible?

No single layer carries it. Spec-driven prompting, tests reviewed with full attention, tireless mechanical gates and architect-style review together replace line-by-line reading.

Build it yourself

Follow these exact steps to reproduce it yourself · estimated time: ~45 minutes

Prerequisites

A project with CI (GitHub Actions or similar)
Claude Code installed

Build the verification net once; vibe responsibly forever after. We’ll harden one project and then do a contained vibe-coded change through the net.

Step 1 — Map your blast radius

In your repo, with claude:

Classify the top-level modules of this codebase into: (a) leaf code — contained
blast radius, externally checkable; (b) core — shared abstractions, auth,
data integrity. Output a two-column table with one-line justifications.

Save the output as VIBE_MAP.md. This is your “where it’s allowed” policy.

Step 2 — Install mechanical gates

For a TypeScript project (adapt for your stack):

# strictness that catches plausible-but-wrong code
npx tsc --init --strict
npm install -D eslint typescript-eslint

Add a CI workflow that runs types + lint + tests on every PR — gates only count if they block:

# .github/workflows/ci.yml
name: ci
on: [pull_request]
jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 22 }
      - run: npm ci
      - run: npx tsc --noEmit
      - run: npx eslint .
      - run: npm test

Step 3 — Write the spec before the prompt

Pick a leaf-code task from your VIBE_MAP.md. Write SPEC.md yourself — 10 minutes, by hand:

# Feature: <name>
Interface: <function/endpoint signatures>
Invariants: <what must always hold>
Failure behavior: <what happens on bad input / errors>
Out of scope: <explicitly not this>

Step 4 — Vibe the implementation, own the tests

Read SPEC.md. First: write tests that pin the interface, invariants and
failure behavior. Do not implement yet.

Review those tests with full attention — this is your real review. Then:

Now implement until the tests pass. Do not modify the tests; if a test seems
wrong, stop and tell me instead.

Step 5 — Review the shape

When it’s green, review like the talk says — interfaces and behavior, not lines:

Summarize this change for a reviewer: public interface, new dependencies,
invariants covered by tests, invariants NOT covered by tests.

Expected result: a merged change where you personally wrote ~0 lines of implementation but can answer, with evidence, “how do you know it’s correct?” — spec, trusted tests, green gates. That answer is the whole discipline. Repeat, and widen scope only as the net keeps catching what it should.

Where to go next

Watch Erik Schluntz’s original talk — the production war stories don’t fit in a lesson.
The day-to-day mechanics live in Claude Code Best Practices.
For the theory of when to hand work to a model at all, see Prompting for Agents.

Vibe Coding in Prod — Responsibly

1. The 22,000-line merge

2. Where vibe coding is safe (and where it isn’t)

3. Verification is the product

Check your understanding

Build it yourself

Step 1 — Map your blast radius

Step 2 — Install mechanical gates

Step 3 — Write the spec before the prompt

Step 4 — Vibe the implementation, own the tests

Step 5 — Review the shape

Where to go next

Related lessons

Claude Code Best Practices: The Field Guide

Mastering Claude Code: The Agentic Coding Workflow

Running an AI-Native Engineering Org: What Changes When Coding Isn't the Bottleneck

1. The 22,000-line merge

2. Where vibe coding is safe (and where it isn’t)

3. Verification is the product

🧠 Check your understanding

🛠️ Build it yourself

Step 1 — Map your blast radius

Step 2 — Install mechanical gates

Step 3 — Write the spec before the prompt

Step 4 — Vibe the implementation, own the tests

Step 5 — Review the shape

Where to go next

Related lessons

Claude Code Best Practices: The Field Guide

Mastering Claude Code: The Agentic Coding Workflow

Running an AI-Native Engineering Org: What Changes When Coding Isn't the Bottleneck

Check your understanding

Build it yourself