Tool or System: Why Most "AI-First" Tops Out at 10x
A person holding the reins of a galloping horse made of machinery and circuitry — the human steers, the machine does the running
Silicon Valley 101 ran an episode recently where the host, Hongjun, sat down with the three founders of Creao to talk about how they rebuilt their company into an AI-led organization. I listened to it twice.
The hook is loud. A 25-person company whose CTO, Peter, says 99% of their code is written by AI. A feature gets written at 10am, A/B tested by noon, half of it killed by 3pm based on the data, and a better version rewritten by 5pm. In the old development flow, that cycle took six weeks.
But the part that stuck with me wasn't any of those numbers.
What stuck with me wasn't the 99%
Let me get this out of the way first. The 99%, the three-to-eight deploys a day, the 200,000 users — those are all self-reported by Creao. Nobody audited their codebase or their deploy logs. Treat them as marketing. Discount accordingly.
What actually stuck was a different line. The CEO, Kai Cheng, said the industry is stuck in two traps. One: if humans still operate AI tools step by step, productivity hits a ceiling. Two: if humans are still the only ones building the tools, the real AI revolution hasn't even started.
I went back and replayed that bit three times. Because it cut open something I'd felt was off for a while but couldn't name.
Most people today use AI as a faster tool. Autocomplete for code, a first draft for the doc, a few options for the image. Everyone's got an extra arm now, and yeah, output went up a bit.
But that mode has a ceiling, and the ceiling is lower than it looks.
Use AI as a tool, and you're racing 24 hours
The logic of a tool goes like this. A human kicks off a task, AI speeds the human up, the human finishes and kicks off the next one. The bottleneck is always the human doing the kicking off.
You have 24 hours a day. However fast your hands are, you can only watch over so many things at once. So even if you make every single task 10x faster, your total output is still capped by this one body of yours. The framing from Creao is that the speed ceiling for a tool-user is roughly 10x.
I've hit that line myself. Over two months I tore down my AI workflow and rebuilt it four times. At first I ran it bare — main thread reading files, editing files, running tests — and anything bigger than a small task filled the context. The problem wasn't that the tools were bad. It was that I was still sitting in the seat where everything had to pass through me.
The real unlock came when I stopped personally kicking off every task. With five agents working at once, I was no longer the one operating line by line — I became the one looking at results and setting direction. At that point output stops being a question of 10x.
That's the difference between Cheng's first trap and second trap. People in the trap are racing 24 hours. People out of it have stopped racing the clock at all.
The hard part was never the model
So how do you climb out of the trap? The episode keeps circling one word: harness. The piece of tack you strap onto a horse to steer and restrain it.
The word is hot in the Valley right now. But let me say something on behalf of the clear-eyed: it isn't that new. People have pointed out that test harness and eval harness are old terms, and above them sit middleware and platform engineering — all describing the same thing, building a sane engineering environment around a moving core. An engineer named Stuart Miller put it bluntly: the harness hype will pass, probably swapped for another word inside 18 months, but the craft underneath will stay.
I think he's right. And it doesn't change the fact that the thing underneath is real.
The hardest evidence comes from a LangChain experiment. Hold the model completely fixed, change only the system around it, and the same agent's score on Terminal-Bench went from 52.8% to 66.5%. Nearly fourteen points, with no smarter model — just better system design.
OpenAI published an engineering post of its own, literally titled harness engineering. Over five months, with zero hand-written code, they stood up an internal product — a million lines, fifteen hundred PRs — driven by just three engineers steering Codex. One line from it I wrote down: Humans steer, agents execute. And another, blunter one: the hard part isn't the agent, it's the layer of system wrapped around it.
That's what Cheng means by treating AI as a system. Anyone can call a model's API. The hard part is building a system around it that self-heals and iterates on its own. This kind of system is a different order of magnitude from the solo-dev, tool-layer harness I was fiddling with — that's fitting an engine to one person; this is turning a whole company into one engine.
The human steps back, and two jobs remain
Once the system is running, where does the human go?
This is the part of the episode worth pulling out and sitting with. The human's role shifts — from user of AI tools to reviewer of AI output, plus the one who supplies the high-level intent.
This isn't just Creao's idea. Karpathy laid it out most clearly in his YC talk last year. He said that even when AI spits out ten thousand lines instantly, he's still the bottleneck — because he has to confirm the thing introduced no bugs, has no security holes, and does the right thing. Then a line I think carries real weight: AI does the generation, humans do the verification, and it's in our interest to make that loop run as fast as possible.
See — two loops again. The tool-user lives inside the generation loop, generating by hand, trapped by 24 hours. The system orchestrator lives at the exit of the verification loop, AI saturating generation, the human only giving intent and collecting results.
Karpathy has a more concrete version too: keep the AI on the leash, don't let it dump ten thousand lines at once, hold it to a pace you can actually look over. Ethan Mollick says the same thing from the other side — the defining skill is no longer prompting, it's delegation and orchestration: you define the task, set the constraints, hand over the material, and verify the output.
Peter's version is the bluntest. He has a physics PhD, and he says the most useful thing the PhD taught him wasn't writing code — it was questioning assumptions, stress-testing arguments, finding what's missing. So his verdict: the ability to criticize AI will be worth more than the ability to produce code.
This is the same thing I wrote about in Dao Rises, Skill Fades, carried one step further. That piece argued that craft depreciates and judgment appreciates. This one is about where exactly the judgment lands — on whether you can give good intent, and whether you can catch AI's mistakes. I've long thought the future human is managing AI rather than doing the work for it.
Anthropic's engineering writing gives this new position a very concrete anchor. They describe a failure mode of long-running agents called declaring victory early. The agent tells you it's done when it isn't. The human's most irreplaceable value in the system might be exactly that: being the one who isn't fooled by the false victory.
But the review gate is drowning
I have to stop here and talk about the cost. Skip the cost and this becomes one more piece of AI feel-good.
The human retreating to the review seat sounds great. The problem is the review gate itself is drowning.
Faros AI tracked more than ten thousand developers. Teams using AI heavily did finish 21% more tasks and merged nearly twice the PRs. But the cost: individual PR size grew 154%, median review time grew 91%, bugs rose 9%, and actual delivery speed didn't move. By the 2026 follow-up, review time had shot up to a 441% increase — and mature teams weren't spared either.
Generation got sped up several times over by AI. Verification is still moving at the speed of a human brain reading code. The water pours faster; the gate didn't get wider.
Security is a harder bone. Veracode ran 80 coding tasks across a hundred-plus models, and nearly half the code failed the security tests — Java worst, at a 72% failure rate. The most pointed conclusion: as models got bigger, the code got more syntactically correct, but the security barely improved. A stronger model can't save this gate. Only a human can — or another verification system the human designs.
Then the most embarrassing one. METR ran a randomized controlled trial last year with sixteen senior developers on real work. The ones using AI were actually 19% slower. Yet those same people predicted beforehand they'd be 24% faster, and afterward still believed they'd been about 20% faster.
Every self-reported 100x speedup — Creao's included — rests on this kind of unreliable gut feeling. That perception gap is the thing to really watch.
I'll be honest and add this: METR's own follow-up this year reversed direction — returning developers now show up as faster. So the conclusion that actually holds isn't "AI makes you slower." It's the perception gap, and the bottleneck moving from generation to verification. Don't treat any single benchmark as scripture, this one included.
Real rebuild vs. AI-first as a fig leaf
Last, the org layer — which is the real point of the episode.
Plenty of companies are shouting AI-first. But there are two kinds, and the gap between them is too big to lump together.
One kind is a real rebuild. Shopify's CEO sent an internal memo requiring teams to prove AI can't do the job before asking for headcount, and AI fluency went into performance reviews. But the company didn't lay people off — it actually expanded its intern program from 75 toward a target of 1,000, because interns are the most creative AI users. IBM used AI to automate 94% of routine HR work and cut roughly 200 roles, yet total headcount went up, and the $3.5 billion in productivity gains got reinvested into engineering and sales.
The other kind uses AI as a fig leaf. Klarna is the classic cautionary tale. Its support bot once bragged about doing the work of a few hundred agents and saving tens of millions of dollars — then two years later the CEO admitted cost had become the overwhelming evaluation factor, the result was lower quality, and they brought humans back. Forrester flat-out called it the poster child for bad AI deployment.
The difference between the two is exactly Cheng's difference between system and tool. Real AI-first redesigns the workflow and sends the freed-up labor toward higher-value work. Fake AI-first uses AI as the reason to cut, then discovers quality collapsed and pays to patch it.
That second kind is happening at scale. Gartner predicts that by 2027, half of the companies that did AI layoffs will rehire — often giving the new person a title that involves managing the AI. It's the same script as the era when everyone treated outsourcing as a cure-all and ended up paying more in rework.
So to judge whether a company is really AI-first, don't look at how loud it shouts. Look at whether the freed-up labor got reinvested into growth, or just dropped straight onto the profit line.
After all that, what I want to say is simple.
People who use AI as a tool are still racing 24 hours, and 10x is the ceiling. People who use AI as a system have rebuilt the whole workflow and moved themselves from operator to reviewer and intent-giver — and that's the door to 100x and up.
But the road isn't as smooth as the podcast makes it sound. The word "harness" will date, the self-reported numbers need discounting, and the verification gate is drowning under its own success.
And it's precisely because of all that, that the human's step backward becomes more irreplaceable, not less. When generation is nearly free, when even the strongest model can't write secure code, when the agent smiles and tells you it's done and it isn't — the person who can give good intent, catch the mistake, and say "no, that's not right" in a room full of "I'm done" becomes the most expensive part of the whole system.
These days, when I write or build, less and less of my time goes into generation. More of it goes into those two things. Saying clearly what I actually want, and staring at what the AI hands back and asking, is this really what I asked for.
The tool era tested how fast your hands were. The system era tests whether you can still tell what's right.