Business, Organizations, and the Future of SaaS in the Post-Agent Era

The Math

What does bookkeeping actually cost a small business?

QuickBooks runs about $10K a year. But QuickBooks doesn't run itself -- you still need an accountant. That's $120K.

The tool costs $10K. The person operating the tool costs $120K. A 1:12 ratio.

Sequoia recently cited a broader version of this: for every $1 enterprises spend on software, they spend $6 on services. Software expenditure and labor expenditure are separated by an order of magnitude across the entire economy.

The implication is significant. If AI can close the books directly -- not give you a better bookkeeping interface, but deliver a finished set of books -- the addressable market isn't QuickBooks' $10K. It's the accountant's $120K.

Copilot competes for the $1 tool market. Autopilot competes for the $6 services market.

These are not the same fight.

Three core arguments in this piece:

1. Work is changing. From "doing things" to "ensuring things get done." The boundary between intelligence and judgment moves at wildly different speeds across domains -- years for code, decades for consulting. The copilot phase isn't a transition period; it's the data collection phase for autopilot.

2. Organizations are changing. Hierarchy is fundamentally an information routing protocol, and AI offers the first viable alternative. But Block's case is extremely specific -- a two-sided transaction platform with high-density financial data. Most companies will settle in a hybrid state: AI handles 80% of information routing, humans retain exception handling and final accountability.

3. SaaS will fracture into four categories. Pure intelligence tools get replaced. Judgment-intensive tools become copilots. Infrastructure actually benefits. And the most important new category -- context infrastructure (authorization, evaluation, audit, context asset management) -- is the real bottleneck, and the real opportunity.

Two Articles, One Thesis

Sequoia published two long pieces in the same week. Julien Bek wrote about product form factors. Jack Dorsey and Roelof Botha wrote about organizational structure.

Different topics on the surface. The same argument underneath: when AI can deliver results, every commercial structure built around "tool + human" needs to be rewritten.

Bek's central framework splits all work into two layers: intelligence and judgment.

Intelligence is rule-governed work -- complex, tedious, but fundamentally deterministic. Accounting standards, tax codes, insurance rate calculations, contract templates, code generation.

Judgment requires experience, intuition, and context -- legal strategy, management consulting, product direction, hiring decisions.

AI eats the intelligence layer first. This is already happening.

But Bek's framework only tells you the order. It doesn't tell you the speed.

And the speed differences matter far more than the order.

1. Work Is Changing -- The Intelligence-Judgment Boundary Moves, But at Wildly Different Speeds

Bek makes a critical claim: today's judgment gradually becomes tomorrow's intelligence.

He's right. But the statement omits a more important question: the rate of that conversion differs by orders of magnitude across domains.

The speed depends on one thing: whether the domain has a verifiable notion of "correct."

Code is the fastest. Does the program compile? Do the tests pass? Does it meet the performance spec? All binary verdicts. AI writes code, runs it, and knows immediately whether it worked. Software engineering crossed the line first -- Sequoia's data shows developers have the highest AI tool adoption rate of any profession, above 50%. My own experience tracks exactly: 3 people doing the work of 15, not because we're smarter, but because agents took over massive amounts of intelligence-layer work. I wrote about this shift in You Are the Manager.

Accounting is also fast. GAAP is an explicit rule set. Debits must balance credits, depreciation has formulas, tax codes have statutes. Whether a set of books is correct can be checked by an auditor referencing the standards. ICD-10 medical coding works the same way -- every diagnosis maps to a deterministic set of codes. The more explicit the rules, the easier AI jumps from copilot to autopilot.

Insurance starts slowing down. Actuarial models are deterministic -- given risk factors, there's a formula for the rate. But underwriting isn't purely formulaic. A fire insurance policy for a beachfront restaurant: the actuarial model can produce a base rate, but the underwriter also evaluates the merchant's fire suppression setup, the local fire department's response time, whether this person has a suspicious claims history. Those judgments rely on commercial intuition and relationship networks, not rules. The boundary gets fuzzy.

Law is slower still. Law is adversarial -- the same set of facts can produce diametrically opposite conclusions depending on which side you're on. It's jurisdiction-specific: the same question has different answers in California and Texas. Even "standard" contracts have clause-level wording that depends on the specific commercial context of each deal. The ultimate measure of whether a legal strategy was good is whether the client won -- and that feedback loop takes months to years.

Management consulting is the slowest. Output quality depends entirely on understanding the client's specific situation. The same strategic recommendation can be correct at Company A and catastrophic at Company B. There's no universal validation standard. There isn't even reliable consensus on what "good advice" looks like.

Line them up:

Code --> Accounting/Coding --> Insurance --> Law --> Consulting

Moving left to right, the notion of a "correct answer" gets progressively fuzzier, feedback cycles get progressively longer, and the judgment-to-intelligence conversion gets progressively slower.

Five domains lined up left to right: moving right, the notion of a correct answer gets fuzzier, feedback slower, and judgment-to-intelligence conversion slower

This isn't linear. The code domain could see most intelligence work automated within two to three years. Law might still be in copilot mode a decade from now. Consulting might still be human-dominated in twenty years.

A corollary for founders: if you're choosing a startup direction, look at this spectrum. The left side is already hypercompetitive -- there are hundreds of code copilot companies. The right side looks like a bigger TAM, but the deployment timeline is much longer because the evaluation infrastructure doesn't exist yet. The middle ground -- insurance, tax, medical billing -- is probably the best window right now.

1.1 The Copilot Phase Isn't a Transition -- It's a Data Collection Period

Bek's judgment-to-intelligence conversion thesis has an unexplored corollary: what drives the conversion?

Data. Specifically, labeled, high-quality judgment data.

Where does this data come from? From the copilot usage process itself.

Consider: a lawyer uses Harvey to draft a contract. AI generates a first draft. The lawyer spends twenty minutes editing it. Every edit -- a deleted clause, a reworded provision, an added qualifier -- is an extremely high-quality training signal. This is more precise than any hand-labeled dataset, because it reflects real judgment in a real business context.

Accounting works the same way. AI generates a report, the accountant spots a misclassification and corrects it. That correction is teaching the model: in this specific context, the right judgment call is X.

This reframes the copilot phase entirely. Copilot isn't just a transitional product form. The copilot phase is the training phase for autopilot.

Every copilot product is a stealth data collection pipeline. Users think they're using a tool. They're also teaching it.

Copilot looks like a transition but is really autopilot's training phase: every human edit of an AI draft becomes judgment data flowing down a pipeline that grows autopilot

This changes startup strategy. Starting with copilot isn't "settling for less." It's accumulating the most strategically valuable asset -- judgment data. Bek's outsourcing market entry strategy and a copilot-first strategy aren't in tension. They're sequential: enter the market with copilot, accumulate data, then use data to evolve toward autopilot.

But there's a reef under the surface.

Who owns the correction data? When a lawyer edits an AI-generated contract, does that data belong to the law firm or the AI company? Most SaaS terms of service include a clause like "we may use data generated in the course of service delivery to improve our products." Users click "agree" and move on.

In the AI copilot context, the implications of that clause are completely different. "Improve our products" used to mean optimizing the UI and fixing bugs. Now "improve our products" means training models -- your professional judgment is being fed into a system that could serve your competitors.

Almost no one is having this conversation seriously yet. But it will become a big one. Data rights, privacy terms, and the question of who owns "what AI learned from human judgment" -- this line will eventually be drawn, and where it falls will profoundly affect the speed of the copilot-to-autopilot transition.

1.2 Where the 6x Market Lives

If autopilot's TAM is six times copilot's, where specifically?

Bek offers a practical entry strategy: start by replacing work that's already outsourced.

Why? Three reasons. The budget already exists -- no need to convince the customer to spend new money. The buyer is already conditioned to pay for outcomes, not hours -- outsourcing contracts are inherently results-oriented. And quality standards are relatively clear -- the outsourcing contract itself functions as an SLA.

Specific markets: insurance brokerage ($140-200B), accounting and audit, medical billing, tax preparation, legal documentation, managed IT services, recruiting.

Every one follows the same pattern -- rules are complex but deterministic, labor costs are high, and the customer wants results, not process.

But there's an underappreciated transition state. The leap isn't directly from copilot to autopilot. There's an intermediate layer: AI executes autonomously within boundaries defined by business rules, and automatically escalates to a human when something falls outside those boundaries.

This isn't full self-driving. It's L3 -- autonomous under defined conditions, with human takeover when necessary.

In concrete product terms: AI handles 90% of routine insurance quotes, pushing anomalous cases to a human. AI generates 95% of standard contract clauses, flagging non-standard requirements for attorney review.

Most companies over the next three to five years will probably settle at this state. Not because they don't want full automation, but because the trust infrastructure hasn't been built yet.

1.3 The Liability Chain: The Technology Can Run, the Law Can't Keep Up

Why can't trust infrastructure be built?

It's not a technology problem. It's a liability attribution problem.

In copilot mode, the legal relationships are clean. AI is a tool, no different from Word or Excel. A lawyer uses AI to draft a contract, the lawyer signs off. If something goes wrong, the client goes after the lawyer, the lawyer bears professional liability. The entire chain of responsibility is identical to the pre-AI world.

In autopilot mode, that chain breaks.

A concrete scenario: an AI autopilot system generates an insurance quote for a commercial property, pricing it too low -- the system underestimated a risk factor. The customer buys the policy. Six months later, a loss event occurs, and the payout far exceeds premium revenue.

Who pays?

Three parties point at each other. The customer says "I trusted your system." The AI service company says "the model's output isn't fully within our control." The model provider (Anthropic, OpenAI) says "we provide general capabilities, we're not responsible for specific business decisions."

When an autopilot misprices insurance and a loss hits, four parties — customer, AI company, model provider, insurer — point at each other, the liability chain snaps and no one owns it

The insurer (if the insurer commissioned the AI to underwrite) says "the system made the decision, not our underwriter."

This liability chain is currently undefined. No legal precedent. No industry standard. Not even a regulatory framework for discussing the problem.

Look at one analogy to understand how hard this is.

Autonomous vehicles. The technology for L4 self-driving in geofenced areas has been ready for years. Waymo has been operating in San Francisco for a long time. But what about the legal framework around autonomous driving?

Who's liable when there's an accident -- the owner, the manufacturer, the software company, the sensor supplier? How do you price insurance -- based on the driver's record or the algorithm's version? Every state has different laws. After each major accident, the regulatory direction can reverse.

From "the technology works" to "the legal framework is in place," autonomous driving has taken a decade and still isn't fully resolved.

AI autopilot faces the same impasse, except it's more complex. Autonomous driving at least involves the physical world -- cars, roads, people, physical evidence at the scene. AI does knowledge work -- a legal opinion, an insurance quote, a tax filing. When it's wrong, the error might not surface for months or years, by which time the system has been updated several versions.

This isn't a problem that can be solved with technology. It requires legislation, industry associations, insurance product innovation, and a large body of case law.

These things develop at the speed of institutions, not Moore's Law.

1.4 Evaluation Infrastructure: The Overlooked Bottleneck

What's the root cause of the broken liability chain? We don't even have a credible way to evaluate whether AI got something right.

Saying "evaluation" is easy. Actually building it requires answering a chain of specific questions: who defines "correct"? By what standard? How often are the standards updated? Who audits compliance?

The maturity of evaluation infrastructure varies enormously by domain.

Accounting is the most mature. GAAP provides a globally recognized rule set. The Big Four provide audit capability. Whether a financial statement is correct has a clear standard, professional certification bodies, and legal consequences for getting it wrong. This infrastructure took the better part of a century to build. Precisely because it exists, AI automation in accounting is advancing fastest -- you can definitively say "these books are right" or "these books are wrong."

Medical coding is similar. ICD-10 has tens of thousands of coding rules, each diagnosis mapping to a deterministic code set. Whether the coding is correct can be checked against the manual. So medical coding can be heavily automated. But diagnosis itself cannot -- the same symptoms may elicit different diagnoses from different physicians, and the "correctness" of a diagnosis often can't be determined until treatment outcomes emerge. Coding has mature evaluation infrastructure; diagnosis doesn't. Two types of work within the same healthcare system, on completely different automation timelines.

Law has almost no equivalent evaluation infrastructure. There is no "legal GAAP." Whether a contract is good or a legal strategy is sound depends on the jurisdiction, the judge, opposing counsel's skill, and the client's specific commercial objectives. The ultimate "evaluation" is whether the client won the case, and that feedback loop takes months to years. Until then, you're relying on peer review -- and peer review is inherently subjective.

Insurance underwriting falls in between. Actuarial pricing has standards -- given risk factors, rates can be calculated. But the quality of underwriting judgment is ultimately measured by portfolio performance -- whether the loss ratio stays within expected bounds. That feedback cycle is measured in years. Whether an underwriting decision was correct might not be knowable for three to five years.

See the pattern?

The more mature a domain's evaluation infrastructure, the faster the copilot-to-autopilot conversion. The weaker the evaluation infrastructure, the longer it stays stuck in copilot mode.

This implies a massive opportunity: whoever establishes credible evaluation standards in a given vertical first -- the equivalent of that domain's "GAAP" -- captures the gateway from copilot to autopilot.

This isn't a pure technology startup play. It requires industry buy-in, regulatory cooperation, and enough case volume to prove the standard's validity. It's built through time and trust, not through fundraising. But precisely because of that, once established, the first-mover advantage is extreme.

2. Organizations Are Changing -- Hierarchy Is a Two-Thousand-Year-Old Routing Protocol

Work itself is changing. What about the organizations built to carry out that work?

Dorsey and Botha's piece covers the organizational side. Their angle is more radical than Bek's.

The core argument: hierarchy's essence isn't a power structure -- it's an information routing protocol.

From Roman legions' centuries to the Prussian general staff, from American railroads to Taylor's scientific management, from the Manhattan Project to McKinsey's matrix org -- two thousand years of organizational innovation, all solving the same problem: one person's cognitive bandwidth is limited, information can't travel too far, so you need intermediate layers for routing.

A manager's core function isn't "managing people." It's translating strategy downward and aggregating status upward. It's being a router.

AI offers the first viable alternative.

Block (parent company of Square and Cash App) is the most aggressive experimenter right now. They've built a four-layer architecture:

Capability atoms -- minimal-granularity functional units
World model -- a real-time global view of company state
Intelligence layer -- automatically composes capability atoms in response to signals
Interface -- the interaction layer between humans and the system

A concrete example: a Square merchant's tax filing deadline is approaching, and simultaneously Cash App just approved that same person's loan application. Previously those two signals lived in separate business units and might never have intersected, unless some PM happened to think of building that feature. Now the intelligence layer automatically detects the signal and combines the tax tool with the lending capability, surfacing it to the merchant.

No PM made that decision. The system discovered the need on its own.

Organizational roles changed too. Block defined three types: IC (individual contributor), DRI (directly responsible individual, 90-day term), and player-coach (writes code and leads people simultaneously). No permanent middle management positions. The roadmap isn't driven by PM-authored annual plans -- it's driven by failure signals. Wherever the intelligence layer can't automatically compose a solution, that's the next product direction.

2.1 Block Is Special -- Don't Rush to Copy

The architecture sounds elegant. But Block's conditions are extremely specific.

Block's "world model" runs on payment data. Every transaction is a structured event -- amount, merchant, category, timestamp, buyer, seller. This is among the highest signal-to-noise data in all of commerce.

Compare that to a typical B2B SaaS company: its customer interaction data consists of click events, dwell times, feature usage counts. Try building a "world model" from that. What you'll get is a slightly smarter dashboard, not an intelligence layer that autonomously discovers needs.

The gap isn't in the model. It's in the structural quality and signal density of the data.

Building Mio taught me a deep lesson about this: context quality determines the ceiling of intelligence, not the model itself. The raw capability gap between GPT-4, Claude, and Gemini is narrowing. But whoever has richer, more precise, longer-accumulated context delivers more value.

Block's failure-driven roadmap also has a prerequisite: failures from the intelligence layer must be meaningful signals. When the model is strong enough and the data dense enough, a single failure can precisely pinpoint "this capability atom doesn't exist" or "this combination pathway hasn't been defined" -- that's a useful product signal.

But with a weak model and sparse data? Every failure's root cause is "not enough data" or "model isn't smart enough." That's not a product signal. That's a capability deficit. The failure-driven roadmap degenerates into "give me more data" on repeat.

So within Block's architecture, the transferable and non-transferable parts need to be separated.

Transferable: the three role definitions (IC / DRI / player-coach), the failure-driven roadmap philosophy, eliminating permanent middle management. These are organizational design choices that don't depend on data quality. Any company can adopt them.

Non-transferable: the world model, the intelligence layer's autonomous composition capability. These depend on Block's uniquely high-density structured payment data. Most companies don't have an equivalent data asset.

The corollary: most companies will settle in a hybrid state -- AI handles 80% of information routing, humans retain exception handling and final accountability. Organizations can get flatter, but they won't get flat. Middle layers will compress, but they won't disappear.

3. SaaS Will Fracture Into Four Categories

Work is changing. Organizations are changing. What about SaaS?

Back to the opening question: do we still need SaaS?

The answer isn't "yes" or "no." It's "which kind."

Category one: pure intelligence tools -- replaced.

QuickBooks-style products. Rules are complex but deterministic, and AI can complete the entire task end-to-end. Users no longer need a "tool that helps you do bookkeeping" -- they want "books that are already done." Companies are already doing this: Crosby automates NDA generation, WithCoverage automates insurance quoting. These SaaS products will shift from selling licenses to selling outcomes, or get eaten by autopilot service providers.

Category two: judgment-intensive -- become copilots.

Legal strategy, management consulting, creative design. Harvey in legal is the prototype -- AI handles enormous amounts of supporting work, but final decisions require human judgment. These products survive, but change form -- from "feature-rich tools" to "assistants that augment human judgment." The professionals stay. The tools get replaced.

Category three: infrastructure -- actually benefits.

Cloud computing (AWS/GCP/Azure), databases, payment rails (Stripe), API gateways. Autopilot needs to call more infrastructure, process more requests, store more context. More AI agents means more API calls, more compute demand, more data storage. Usage for these companies grows as autopilot adoption increases.

Category four: new category -- context infrastructure.

This is the most interesting, and currently the most vacant, category.

It spans four directions:

Authorization management -- who permits AI to do what? Not traditional RBAC (role-based access control), but "can this AI agent sign contracts under $50K on the company's behalf? Can it approve expense reports under $10K?" This is an entirely new permission system, far more granular than traditional IAM, and it needs to adjust dynamically.

Evaluation frameworks -- is AI's output correct? As analyzed above, the answer depends on the domain. Accounting has GAAP. But insurance quotes? Legal strategy? Hiring recommendations? These domains don't even have consensus on what constitutes a correct answer. Who builds these standards?

Audit trails -- when something goes wrong, how do you trace back? What data did the AI see, what reasoning did it perform, at which decision node did it commit, was there hallucination? Traditional auditing examines human operation logs. AI auditing needs to examine reasoning chains. This capability barely exists today.

Context asset management -- the domain knowledge AI accumulates about a customer, an industry -- how is it stored, migrated, and priced? If a customer switches service providers, does this context travel with them? Or does it become the provider's asset? This question is the same one raised earlier about copilot data rights.

These four directions together constitute the "trust infrastructure" of the AI autopilot era.

No mature product covers these today. The product category hasn't even been named.

3.1 Pricing and Moats Get Rewritten

If the shift is from selling tools to selling outcomes, the commercial foundations need to be redefined.

Pricing. Per-seat logic collapses. If AI does 90% of the work, charging based on "how many people use it" stops making sense. Outcome-based pricing -- per closed set of books, per processed insurance claim, per generated contract -- becomes the norm. But this is a massive shock to the SaaS financial model: from predictable subscription revenue to volatile transaction revenue. Wall Street doesn't know how to value this yet. The ARR narrative gets harder to tell.

Moats. Traditional SaaS moats are feature richness and switching costs. In the autopilot era, the moat is context assets -- how much domain knowledge and evaluation data you've accumulated for this customer, this industry. This aligns with the digital twin logic I discussed in The Agent Economy: the real moat isn't capability, it's understanding.

Features can be copied. Understanding cannot.

An AI autopilot that has served the insurance industry for three years, accumulating hundreds of thousands of underwriting decisions with feedback data -- which prices were accurate, which payouts exceeded expectations, which risk factors were underweighted -- builds a living, continuously calibrating body of industry knowledge. A latecomer with equivalent model capability still needs the same amount of time to accumulate the same knowledge.

This is a moat built on time, not technology.

Where I Stand

My assessment:

Sequoia's directional read is correct. The intelligence layer getting replaced by autopilot is a deterministic trend. The shift from selling tools to selling outcomes is irreversible. Hierarchy's information-routing function will be massively compressed by AI.

But the transition period is much longer than optimists think.

It's not a technology problem. It's an institutional problem.

Who sets the evaluation standards? How is quality calibrated? How often are standards updated? Who bears responsibility when things go wrong? Who owns what AI learned from human judgment? Each of these questions requires industry consensus, and the speed of consensus formation isn't something technology can accelerate.

This parallels the printing press story -- technology instantly changed the cost structure of information distribution, but the institutional frameworks around the new technology (copyright law, publishing regulation, academic peer review) took over a century to mature.

The realistic state of most enterprises over the next few years is a hybrid: AI executes within rule boundaries, humans handle exceptions and bear responsibility. Not L5 full autonomy. L3 conditional autonomy.

For founders, the opportunity breaks down into three tiers:

First, on the left side of the spectrum (code, accounting) -- build autopilot. Evaluation infrastructure already exists. Competitive, but deployable.

Second, in the middle ground (insurance, tax, medical billing) -- run the copilot-to-autopilot playbook. Use the copilot phase to accumulate judgment data while simultaneously pushing for the establishment of evaluation standards.

Third, build the context infrastructure itself -- authorization, evaluation, audit, context asset management. This is the hardest play, but also the deepest moat.

What these three tiers share: none of them are purely technical problems. All of them require understanding the industry, understanding the institutional fabric, understanding how trust gets built.

The winners of the tool era understood software. The winners of the outcomes era will understand work.