The Real Problem With AI Coding Isn’t the Model

Cheesecake Labs CEO Marcello Gracietti explains why AI coding productivity gains don’t translate into better team outcomes.
The Real Problem With AI Coding Isn’t the Model
|

Every engineering leader I've talked to in the last six months eventually arrives at the same complaint about AI agents:

  • The model still hallucinates
  • The output is inconsistent
  • The code it generated last week doesn't match the code it generated this week

The honest answer most leaders avoid is that the model isn't the variable. The team is, and specifically the way the team works with the model.

The pattern is almost identical every time. A team adopts Claude Code or Cursor. The first month is euphoric, with engineers reporting 30% to 50% productivity gains, sometimes more.

By month three the codebase starts looking strange in code review.

By month six the seniors are spending half their week cleaning up the inconsistencies the agent introduced, and the team blames the model.

The model is fine: it did exactly what it was prompted to do.

The problem is that ten engineers prompted it ten different ways, against ten different mental models of the codebase, with no shared conventions.

The agent shipped what each one asked for, and the aggregate is incoherent.

That is what "training your agents wrong" looks like, and it lets a team ship inconsistency faster than any individual engineer could.

Why AI Developer Productivity Gains Are Not Improving Teams

I want to put one statistic, from the 2025 Stack Overflow Developer Survey (over 49,000 respondents), in front of you before going any further.

Seventy percent of AI agent users report personal efficiency improvements.

Only 17% say agents have improved team collaboration. That is the lowest-rated impact in the survey by a wide margin.

That is the gap: 70% individual, 17% collective, a difference of 53 points.

The story those numbers tell is simple. Individual engineers are getting faster.

But teams are not getting better. In some cases, they are getting worse.

The 2025 DORA Accelerate State of AI-Assisted Software Development report backs this up from a different angle.

Their headline finding, quoted verbatim from the report landing page: "Successful AI adoption is a systems problem, not a tools problem."

The DORA team introduced an AI Capabilities Model with seven foundational practices, and the central argument is that AI amplifies what is already there.

Mature teams convert AI gains into delivery performance. Weak teams get amplified in the wrong direction, sometimes producing measurably worse outcomes after AI adoption than before.

Two different methodologies, same conclusion. The tool is not the problem. The system around the tool is.

How Prompt Drift Creates Inconsistent Code

I want to name the specific failure mode because most teams I see have not named it, and you cannot fix what you have not named.

The Comet team calls it prompt drift.

Their definition of prompt drift describes it as “inconsistent or evolving instruction templates across users or sessions.”

This is problematic in multi-agent systems where coordination depends on standardized communication patterns.

LLMs are conditional on their input, and small variations in how engineers describe the same intent produce wildly different outputs.

In a single engineer's workflow, prompt drift is a manageable problem. You develop your own habits.

Your prompts converge. Your outputs become more predictable over time. In a multi-engineer team, the opposite happens.

Each engineer's habits diverge. Each engineer's mental model of "how to talk to the agent" becomes idiosyncratic.

The agent has no memory across engineers. Each session starts from zero.

The symptoms in a real codebase look like this: Two endpoints in the same service use two different naming conventions because two engineers asked for them in different weeks.

The error handling on one module follows the team's old standard.

The error handling in another module follows whatever pattern was easiest to specify in a prompt at 4p.m. on a Tuesday.

The test scaffolding in the legacy modules uses one factory pattern.

The tests written this quarter use a different one because the engineer who wrote them did not know the factory pattern existed.

None of these are model failures. Every one of them is an operational failure that the model dutifully implemented.

This is the part that bites later. Because each commit is reasonable, the architecture violations only show up in aggregate.

They fail in code review six months later, when the senior who tries to refactor realizes there are now four conflicting patterns for the same problem, all of them introduced by the same agent answering questions the same engineers asked badly.

AI agents are highly sensitive to operational inconsistency.

Every session is a fresh context. Every prompt is the only signal. Without external scaffolding, the agent will optimize for the prompt in front of it, not for the codebase its editing.

That sensitivity is what makes the agent malleable enough to be useful.

It is also what makes the team's job harder when the operational practices are weak.

Stripe’s Operational Model for AI Coding

There is a counterexample to the chaos, and it is the cleanest one in the industry right now.

Stripe deployed Claude Code across 1,370 engineers through what they call a zero-configuration enterprise binary.

Two to three months of testing with Anthropic before rollout.

Pre-installed on every laptop. Rules, tokens, authentication, and conventions are baked in before any individual engineer writes their first prompt.

One team migrated 10,000 lines of Scala to Java in four days, work that was estimated at 10 engineer-weeks unaided.

The detail that matters is the zero-configuration part.

Stripe didn't give 1,370 engineers a Claude Code license and a Slack channel. They built the operational discipline first.

Then, they handed each engineer an agent already loaded with the company's conventions, the team's coding standards, the access controls, and the auth boundaries.

When a Stripe engineer asks the agent for a Scala migration, the agent already knows what "good" looks like in Stripe's codebase because the engineer isn't training the agent from scratch every session.

Anthropic does this for itself, too.

The PDF on how Anthropic teams use Claude Code documents the practice across legal, marketing, data science, infrastructure, and engineering teams.

The shared pattern is CLAUDE.md hierarchy at four levels (enterprise, user, project, directory), team skills versioned in Git, slash commands for repeated workflows, and explicit conventions for how to ask.

And all of this is operational discipline encoded in files instead of heads.

This is the substantive answer to the Stack Overflow 17% number. The teams that close that gap are the teams that did the unglamorous work of writing the conventions down.

Four Ways To Standardize AI Coding Workflows

In our practice at Cheesecake Labs, we have settled on four artifacts every engineering team should standardize before scaling AI usage beyond the early-adopter phase.

The order matters. Cheap and high-leverage first.

First, the CLAUDE.md hierarchy, which is the single highest-leverage move you can make in a week.

A project-level CLAUDE.md committed to the repo, listing the build commands, the directory structure, the testing conventions, the coding standards, the "always do X" and "never do Y" rules.

Loaded automatically into every session. If your codebase has conventions that an experienced engineer would need an hour to explain to a new hire, those conventions belong in CLAUDE.md.

Anthropic's own best-practices doc is explicit on this. Write commands, not suggestions. "Never use inline mocks. Use src/test/factories/*."

Second, a versioned shared skills directory. Skills are reusable instructions that the agent loads on demand.

Database migration patterns, A/B test setup, common refactoring routines, the team's specific way of writing tests.

Committed to the repo at .claude/skills/ or wherever your tool puts them. Reviewed in pull requests like code. Tagged and versioned.

This is the layer that makes one engineer's working pattern available to everyone else, without requiring everyone to learn it through trial and error.

Third, prompt templates for non-trivial work. If your team has a "how we ask the agent to add a new endpoint" template, every engineer asking for a new endpoint will get more consistent output.

We use slash commands for the most common patterns and a short "prompt-cookbook.md" for the rest.

This is the artifact that most directly closes the prompt-drift gap.

Fourth, required plan mode for anything non-trivial. Plan mode forces the agent to produce a written plan in Markdown before any code is written.

Engineers review the plan. The plan gets committed alongside the eventual PR.

Two things happen: The agent stops misunderstanding intent because the misunderstanding gets caught in the plan.

And the plan itself becomes a durable artifact of "what we asked for and why," which the next engineer reads instead of guessing.

These four are the cheapest leverage on the problem. None of them require building a custom harness, hiring a platform team, or paying for new tools.

All four can be in place in a sprint. We have rolled them out to clients in week one.

Why AI Coding Still Needs Human Review

The technical harness gets most of the attention. Lint, type-check, tests, completion gates, judge agents over executor agents.

We covered the technical harness in detail in a separate piece on harness engineering.

The piece that gets less attention is the human oversight layer; the rules that engineers agree to follow in how they use the agent.

The single most useful rule we have adopted is that the agent does not get to decide when the work is done. Not the executor agent, not even the judge agent.

The human always reviews the diff against the spec before the PR moves forward.

Plan mode catches misunderstood intent. The judge agent catches the premature victory.

The human catches the things the system cannot define yet.

The second rule: every non-trivial change is plan-mode-first. No exceptions for "this should be easy."

Some of the worst architectural damage we have seen has come from "easy" changes the engineer assumed would not need a plan.

The third rule: conventions belong in files, not in heads.

If a senior engineer is correcting the same agent mistake more than twice in a sprint, the convention they are correcting is missing from CLAUDE.md or from a skill.

The fix is not a better verbal review, it is a one-line commit.

Scaling AI Coding Across Engineering Teams

Four moves, in order, that I would put on the table this quarter.

First, measure the gap. Survey your engineers anonymously.

Two questions. "Has AI improved your personal productivity?" and "Has AI improved how your team collaborates and ships?"

If you are below the Stack Overflow 17% baseline, you are not bad; you are normal. But you have a problem to solve.

The number itself is the most useful artifact to put in front of your leadership.

Second, commit the conventions to CLAUDE.md this week. Even a rough first version.

The team's coding standards, the build commands, the testing conventions, the directory layout, and the "never do this" rules.

Anthropic's docs are clear on writing commands, not suggestions.

Get this in the repo before the end of the sprint. Subsequent edits are cheap, but starting from nothing is expensive.

Third, standardize on plan-mode for anything non-trivial. Make it a team rule. Enforce it for 30 days.

Track the rework rate before and after. We see consistent 20% to 35% reductions in token spend on feature-grade work, sometimes much more on complex features.

The bigger payoff is the architecture stability that compounds beyond month three.

Fourth, build the skills library, two skills at a time. Start with the two highest-leverage repeatable patterns in your codebase.

A database migration skill, a new-endpoint skill, or a test-scaffolding skill; whatever shows up in pull requests every week.

Commit them, review them like code, and add two more next month.

By quarter –end, you have 10 shared skills the team uses every day. That is the layer that makes individual engineer productivity translate into team productivity.

The Operational Lesson From AI Coding Teams

The discourse on AI for software development is converging fast.

DORA, Stack Overflow, METR, and most of the credible practitioners I read are landing in the same place.

The model is not the bottleneck. The model is fine. The system around the model is what determines whether AI adoption produces software you can ship or rework you have to absorb.

You are not training your agents wrong because the agents are bad.

You're training them wrong because the conventions you would write down for a new hire are not written down.

Your team and the agent are the new hires. The problem and the fix are the same for both.

Write the conventions down. Commit them. Treat them as code.

Otherwise, agents end up scaling the same inconsistencies that already exist across the team.

👍👎💗🤯
Latest AI News
Receive our NewsletterJoin over 70,000 B2B decision-makers growing their brands