Prompt Cache Is Architecture, Not an Optimization

📊 Slides

In Part 1, I argued that the real product is the runtime harness, not the model. In Part 2, I showed that the core of that harness is a stateful query loop. This final part zooms into one specific mechanism that quietly shapes everything else: prompt caching.

There was one part of the Claude Code source snapshot that made me stop and reread the file twice.

It wasn't the permission system. It wasn't the tool pipeline. It wasn't even the query loop.

It was the forking logic.

More specifically: the way forked workers are engineered so they can share a byte-identical request prefix with the parent and with each other.

That sounds niche. It isn't.

It's one of the clearest signs I've seen that serious agent systems eventually start treating prompt cache as a structural resource, not a nice bonus.

The Intuition Most People Start With

Most people think about prompt cache like they think about CDN cache.

Helpful when it hits. Nice for cost. Maybe good for latency. But not something you would shape your whole architecture around.

That intuition works if your agent mostly does short, isolated tasks.

It breaks the moment you have:

  • long sessions
  • repeated context reuse
  • background agents
  • subagents
  • multi-agent fan-out
  • expensive system prompts and tool schemas

At that point, cache stops being incidental.

If you lose it too often, you're not just wasting money. You're changing what kinds of workflows are economically viable.

The Weird Trick in the Snapshot

Here's the pattern that caught my attention.

When the runtime forks a worker from an assistant message that contains tool-use blocks, it does not just hand the child a fresh instruction and move on.

Instead, it does something much stranger:

  1. It keeps the full parent assistant message
  2. It constructs tool_result placeholders for every tool-use block
  3. It gives every placeholder the exact same text
  4. It appends the child-specific directive only at the end

Why?

Because the runtime wants all forked children to share the same request prefix.

Not approximately the same.

Not semantically the same.

Byte-identical.

That is a very different level of seriousness about cache behavior.

Why This Matters So Much

Imagine the parent has already paid to build a huge context prefix:

  • system prompt
  • tool schemas
  • conversation history
  • previous assistant output

Now you want to fan out three workers.

If each worker rebuilds that prefix in a slightly different shape, you pay for it three more times.

If the runtime can keep the cache-critical prefix identical and only vary the final instruction block, suddenly fan-out gets much cheaper.

That changes behavior.

It means the product can afford to:

  • spawn more sidecar agents
  • summarize background workers more often
  • extract memory continuously
  • compact using a forked summarizer
  • let long-lived workflows branch without exploding cost

So yes, this is about caching.

But really it's about what the product is allowed to do at all.

The Snapshot Builds Around Cache Identity Everywhere

Once I noticed the fork trick, I started seeing the same principle across the repo.

CacheSafeParams

The forking helper has an explicit type for the parameters that must stay identical:

  • system prompt
  • user context
  • system context
  • tool-use context
  • parent context messages

That alone is revealing.

The runtime doesn't just "reuse some context."

It formally models the cache-critical portion of the request.

Compaction avoids breaking cache when it can

The compaction path has an entire branch for prompt-cache-sharing.

Even more telling, the code explicitly warns not to set max output tokens in the cache-sharing path, because doing so would alter the thinking configuration and break cache identity.

That kind of comment only exists when a team has been burned by subtle cache misses enough times to start designing around them proactively.

Subagent context cloning has cache reasons too

Even the subagent context constructor is partially justified in terms of preserving identical replacement behavior for tool results, so later requests don't diverge in wire format.

Again: not a late optimization.

An architectural concern.

This Is Why "Just Spawn a Worker" Is Misleading

People often talk about multi-agent systems as if delegation were mostly a planning problem.

Break the work apart. Assign a worker. Collect the result.

But if every extra worker duplicates the most expensive part of the context, delegation becomes costly in a very literal sense.

Which means worker topology is downstream from cache strategy.

That is one of the deepest lessons in this codebase.

You can't separate:

  • orchestration design
  • context design
  • cache design

They're the same system.

The moment you add subagents, your cache policy starts shaping your product behavior.

A Better Way to Think About Agent Cost

People still talk about agent cost as if it were mostly "how many tokens did the model consume?"

That's not wrong.

It's just incomplete.

A better question is:

"How much repeated expensive prefix am I forcing the system to rebuild?"

That gets you closer to the real design problem.

Because a lot of agent work is not expensive due to the final answer.

It's expensive because of all the context you had to drag back into the room to get there.

This is why the cache-aware fork pattern is such a big deal.

It says:

don't only optimize outputs optimize prefix reuse

That's a much more powerful idea.

The Tradeoff: Cache Discipline Makes the Runtime More Fragile

There is a cost to this.

The more your runtime depends on cache identity, the more seemingly harmless refactors become expensive.

Change:

  • tool ordering
  • prompt rendering
  • model settings
  • replacement-state behavior
  • summary formatting

And you may quietly destroy the cache behavior that made the whole system economical.

That is the downside of building around cache.

You gain leverage.

You also gain a new class of subtle regressions.

In a sense, the runtime becomes more like a distributed system:

  • small mismatches matter
  • byte-level identity matters
  • "equivalent" is not equivalent enough

That is not free complexity.

It's just often worth it.

Why This Pattern Will Matter More Over Time

I think this idea matters even more in the future than it does now.

Why?

Because the more successful agent products become, the more they will:

  • maintain long histories
  • branch work into multiple workers
  • attach more tools
  • integrate more deeply with real environments

All of that increases prefix cost.

And as prefix cost rises, cache-aware runtime design becomes more central.

Which means the best agent systems may start to look less like "prompt engineering products" and more like "cache-preserving execution engines with a model inside."

That sounds dramatic.

But the source snapshot points in exactly that direction.

The Broader Lesson

The deepest lesson here isn't even specifically about cache.

It's this:

Once a product becomes an agent runtime, cost, latency, and control flow stop being separable concerns.

You don't get to design:

  • orchestration first
  • performance later

You have to design them together.

The forked-placeholder pattern is a perfect example.

It's an orchestration trick whose entire purpose is performance economics.

And because it changes the economics, it changes what orchestration patterns the product can afford to use.

That's what architecture looks like.

So What Should Builders Copy?

If you're building your own agent system, I wouldn't copy the exact implementation blindly.

I would copy the mindset.

Specifically:

  1. Treat expensive prefixes as reusable assets
  2. Identify which parts of a request must remain identical for reuse
  3. Design worker fan-out so only the smallest suffix varies
  4. Assume cache behavior can be broken by seemingly harmless mutations
  5. Make cost structure visible to runtime design, not just to finance dashboards

That mindset is worth more than any one helper function.

The Takeaway

Prompt cache sounds like an optimization because most people encounter it from the outside.

From the inside of a serious runtime, it doesn't look like an optimization at all.

It looks like a budget for how much branching, memory maintenance, and background work the product is allowed to do.

That makes it architecture.

Not glamorous architecture.

But the kind that quietly determines whether the whole system scales from a neat demo to something you can actually afford to run.


© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0