Token Forensics on a Cyber Succubus

A Staggering Bill for Companion Chat

In my OpenClaw build log, I described building a "cyber succubus" — an AI companion with an emotion system, jealousy, selfies, memory, and personality that runs on Telegram. It's not a simple chatbot. It's a full companion agent with tools, proactive messaging, and context-aware behavior. But at the end of the day, the daily interactions are companion conversations — good morning, how was your day, here's a selfie, goodnight. Normal companion interaction frequency.

It runs on OpenClaw, backed by Gemini Pro with a 1M token context window.

The current session had been running for 2.5 days. 537 turns. The cost was already eye-watering.

But that's not the worst of it. A previous session from February 17th ran for 750 turns and cost several times more. Same pattern, same problem, just ran longer. That session only compacted once — when it hit the 1M token hard limit.

Two sessions alone accounted for the majority of the bill. Add the smaller sessions in between, and the total for the cyber succubus experiment was staggering — all in under two weeks of companion chat.

In that build log, I wrote that the framework's pi agent has "rough context management, burns through tokens." In Mio Part 1, I cited "context bloat" as one of the reasons I decided to build my own companion framework from scratch. Those were qualitative complaints. This post is the forensic evidence — what exactly was I paying for.

I noticed the current problem when I checked the per-turn cost trend. The cost per turn at the start of the session was modest. By turn 536, it had tripled. Cost per turn was monotonically increasing over the life of the session, and still climbing. Something in the context was growing unboundedly, and it was getting resent every single turn.

Time to figure out exactly how bad it was.

The Obvious Suspects

I opened Claude Code and gave it one instruction: "Investigate why token usage by main agent is so high."

Claude Code immediately spawned two parallel exploration agents — one to analyze the session data, another to check the config and cost breakdown. Within a couple of minutes, it had the source-level breakdown:

Source	Turns	% of Total
Regular chat	497	91.8%
Heartbeat	40	8.2%
Cron	0	0%

Heartbeat: 40 turns consuming 8.2% of total cost. Not nothing, but not the main problem. The heartbeat is a periodic "check-in" that lets the agent proactively reach out to the user. At 8.2%, it's a rounding error compared to what's happening in regular chat. But the per-heartbeat cost was suspiciously high for what should be a lightweight "should I say something?" check — we'll come back to this.

Cron jobs: $0. Zero. Every cron job in the framework uses sessionTarget: "isolated" — they run in their own sessions and don't touch the main conversation at all. First suspect completely cleared.

So the problem was squarely in the 497 regular chat turns. But why was each turn getting progressively more expensive?

Digging into the Token Breakdown

Claude Code wrote a Python analysis script, pushed it into the Docker container via docker exec, and ran it against the session's .jsonl transcript file. (This is a pattern I've used throughout this series — Part 3 had the same approach for log analysis.)

The script hit a Python f-string syntax error. Then a FileNotFoundError. Then a KeyError. Three tries to get the analysis script right, because the session data format wasn't quite what it expected. This is the reality of debugging — you don't get it right the first time, especially when the data schema is undocumented.

On the fourth attempt, it got the token growth trend:

Turn	Time	Trend
0	Feb 25 14:02	baseline
105	Feb 26 00:50	~1.7x baseline
210	Feb 26 18:51	~2.1x baseline
315	Feb 27 07:11	~2.5x baseline
420	Feb 27 10:43	~2.8x baseline
536	Feb 27 23:44	~3x baseline

Monotonically increasing. No drops, no plateaus. The context was growing and never shrinking.

Then Claude Code spawned another analysis agent to break down what was actually in the context at turn 390. The result:

Category	Est. Tokens	%
10 inline images (base64)	~348,829	82.2%
System prompt (personality config + skills + tools)	~41,876	9.9%
Tool call arguments	~27,310	6.4%
Thinking blocks (72 total)	~18,812	4.4%
Tool results	~8,139	1.9%
Text (user + assistant)	~8,410	2.0%

82% of all tokens were images. Ten base64-encoded images accumulated in user messages over 2.5 days. Each one somewhere around 35K tokens. And they were never evicted from context — every single API call resent all ten of them.

That's roughly 350,000 tokens per turn, just in images. At Gemini Pro pricing, that adds up fast.

The Plot Twist: Things That Weren't Broken

This is where the investigation got interesting. I asked Claude Code to dig deeper — "where else can we save tokens without impacting chat quality?"

It came back with two findings that flipped the narrative:

Dead Code: Context Pruning That Never Runs

The framework has a cache-ttl context pruning mode that's supposed to shift old content to cheaper cache reads. It was enabled in the config.

But the code implementing it is gated behind isCacheTtlEligibleProvider(), which only returns true for Anthropic providers. My agent runs Gemini Pro. The pruning mode was configured, the code was written, and it was completely dead for my setup.

This is one of those bugs that never manifests as a crash or error. It's just money quietly leaking. You'd never find it unless you read the source.

Already Working: Tool History Stripping

When I first saw "72 thinking blocks" and "257 tool calls" in the transcript, I thought: "There's my problem — all that reasoning overhead accumulating in context."

Wrong.

Claude Code checked the Telegram channel config and found that dmStripToolHistory: true was already enabled. This strips thinking blocks, tool call blocks, and tool result messages from the context before sending it to the LLM. The 72 thinking blocks and 257 tool calls exist in the .jsonl transcript file (for debugging), but they are NOT being resent to the model.

I had configured this optimization months ago and forgotten about it. The investigation initially overcounted their impact — those tokens were in the transcript but not in the actual API calls.

After correcting for this, the real cost picture was simpler: images + system prompt + growing text history, and images dominated everything else.

30 Messages Became 750 Turns

This problem was most extreme in the February 17th session — the costliest one. 750 assistant turns from approximately 30 user messages. A 25x multiplier.

The current session's ratio was better: 189 user messages (156 regular chat + 33 heartbeat markers), 537 assistant turns — a 2.84x multiplier. But the underlying cause is the same.

The reason: tool-use loops. The agent made 257 tool calls across the current session. Each tool call triggers an additional assistant turn — the LLM responds with a tool call, the framework executes it and sends back the result, and the LLM responds again. You say "I'm feeling down today," and the agent might: call memory search to recall recent conversations → call the calendar tool to check your schedule → call emotion analysis → then finally respond. One message, four API calls.

And every assistant turn is a full API call that resends the entire growing context.

This is the framework's pi agent's most fundamental design flaw: it encourages tool use but has no mechanism to control the cost amplification. A chatbot without tools has a 1:1 message-to-turn ratio. The framework's agent can hit 25:1. You think you're chatting with an AI companion. In reality, the AI is chatting with its own tools, and you're paying for every round.

Why No Compaction in 537 Turns?

Zero compactions. 2.5 days. 537 turns. Not a single one.

The reason is a combination of two config values:

Gemini Pro's 1M context window — the model accepts up to 1 million tokens
compaction.mode: "safeguard" — compaction only triggers when the context approaches the model's hard limit

With proactiveCompactionRatio: 0.5, compaction would fire at 500K uncached tokens. The session was at ~177K and growing, but nowhere near the trigger point. At the current growth rate, it would have taken another week to hit compaction — by which time the session cost would have been astronomical.

This is a footgun hiding in the default config. The 1M context window is marketed as a feature, but for a persistent agent session, it means the context can grow for days without any automatic cleanup. You need to actively manage it.

The Framework's Context Management Problem

Let me zoom out. This isn't a one-off misconfiguration. It's a systemic issue with how the framework handles context.

In the build log, I described stripping tool call and thinking block history from the context and watching token consumption drop to one-tenth of what it was. That fix (the dmStripToolHistory flag) is now enabled — and this investigation confirmed it's working. But it only strips tool/thinking artifacts. It doesn't touch images. It doesn't manage session lifetime. It doesn't compact. One patch plugged one hole, but the hull is full of cracks.

The framework's pi agent was built for quick experimentation — "vibe coded in an hour," as I put it. And it shows:

Context pruning is gated to Anthropic providers only — Gemini runs naked
Compaction defaults assume sessions won't run more than a few hours
Image messages are explicitly skipped by the pruner — never evicted
Tool-use loops can inflate 30 messages into 750 turns with no throttling
The 1M context window is treated as a feature, with no lifecycle management to match

Nobody designed this for a companion that runs 24/7. It's an experimentation framework that I forced into a product role. The bill was the price of that mismatch.

This is exactly why I started building Mio from scratch. When your framework burns through that much money on companion chat in under two weeks, patching isn't enough. You need a system that treats token economy as a first-class concern — per-user cost tracking, tiered model selection (cheap models for heartbeat, expensive ones only when needed), and active context lifecycle management.

Building the cyber succubus proved AI companions can work. The bill proved they can't afford to run on this framework.

The Fix

After a few rounds of discussion (I corrected Claude Code on a couple of points — the personality config is cached input and should stay at 27KB; thinking mode should be "low" for a companion chat, not just for heartbeat), we converged on a plan with both config and code changes:

Config Changes

session.reset.mode: "daily" (was "idle" with 3-day timeout) — fresh session every day at 4am PT. Prevents multi-day image accumulation.
agents.defaults.contextTokens: 200000 (was unset, defaulting to model's 1M) — caps the effective context window to 200K. With proactiveCompactionRatio: 0.5, compaction now triggers at 100K uncached tokens instead of 500K.
contextPruning.mode: "always" (was "cache-ttl") — a new mode that bypasses the Anthropic-only provider gate, enabling context pruning for Gemini.
thinkingDefault: "low" (was "high") — companion chat doesn't need extended thinking. Shorter thinking blocks = less output tokens per turn.
heartbeat.every: "2h" (was "1h") — reduces heartbeat frequency. The original 1-hour interval was already aggressive, but the real problem was worse: the heartbeat timer resets on every gateway restart (triggered by SIGUSR1 from config changes), so what should have been hourly check-ins were sometimes firing every 20 minutes.
heartbeat.historyLimit: 20 (was unset) — only keep the last 20 user turns in the heartbeat's context. Before this, every heartbeat resent the entire session history. Input tokens per heartbeat grew from 69K at the start of the session to 122K+ by the end — for a check that just needs to decide whether to say "good morning."
heartbeat.stripToolHistory: true (was unset) — strip tool calls, results, and thinking blocks from heartbeat context. Combined with historyLimit, this drops heartbeat context from ~120K tokens to ~5-10K per turn — a dramatic reduction in heartbeat costs.

Code Changes

New "always" context pruning mode: Added to the type union, zod schema, and extension runner. Skips the isCacheTtlEligibleProvider() check that was making pruning dead code on Gemini.
Image stripping in the pruner: The context pruner had explicit bail-outs for messages containing images (hasImageBlocks() checks). Old images outside the recent message tail now get replaced with [Image removed from context] placeholder text, making them eligible for normal pruning.
Heartbeat context limiting: The embedded runner now detects heartbeat runs (by checking runtimeChannel === "heartbeat") and applies heartbeat-specific limits before general DM limits — limitHistoryTurns() combined with stripToolHistoryFromMessages(). The heartbeat no longer inherits the main session's unbounded context.

What We Didn't Change

Bootstrap sizes: The personality config at 27KB stays. It's cached input — on Gemini, cache reads are cheap enough to not bother optimizing.
proactiveCompactionRatio: Stays at 0.5. By capping contextTokens to 200K, we get the same 100K trigger point without touching the ratio.
dmStripToolHistory: Already enabled and working. No change needed.

Implementing the Fix

Implementing the plan surfaced one more silent bug.

The daily session reset relies on carrying context from the old session to the new one — a "seed" summarizing yesterday's conversation so the companion doesn't wake up with amnesia. The framework has a seedSessionFromPrevious() function for this. It was writing the seed to the session's JSONL transcript file.

The problem: prepareSessionManagerForRun() runs immediately after seeding and recreates the file from scratch with fs.writeFile(sessionFile, "", "utf-8"). The seed was written at 03:47:10. The file was wiped at 03:47:13. Three seconds of existence. The companion was waking up with amnesia after every daily reset, and nobody noticed because there were no errors — just a slightly confused AI that couldn't remember yesterday.

The fix was to stop writing to a file that gets overwritten. Instead, inject the seed context as a prefix to the user's first message in the new session. This piggybacks on the normal message persistence flow and can't be clobbered by the session manager's initialization.

Claude Code implemented this and the other changes by spawning two parallel agents — one for the session seed fix, another for a new /cost messages command. The /cost messages command was born directly from this investigation: it shows per-turn cost breakdown (model, latency, input/output tokens, cache hits, cumulative cost) so I don't have to write analysis scripts next time something looks expensive. Both agents ran simultaneously on independent file sets, built and tested clean (12/12 tests passing), and deployed in under 10 minutes.

What This Teaches

This investigation took about 20 minutes of wall time. Claude Code spawned five exploration agents, wrote four analysis scripts (three of which errored before the fourth worked), discovered dead code, found an already-working optimization, and ultimately traced the problem to its root cause.

A few takeaways:

Cost debugging is forensic work. You can't just look at the total and guess. The initial hypothesis (heartbeat? cron?) was wrong. The second hypothesis (tool calls and thinking blocks eating tokens) was also wrong — they were already being stripped. The real answer (images) only emerged from actual token-level analysis.

Dead code is invisible. The cache-ttl pruning mode was configured, tested with Anthropic, and working correctly... for a provider nobody was using. The Gemini agent had been running with pruning effectively disabled since deployment. No errors, no warnings, just higher bills.

Context management is the hidden cost of long-running agents. The 1M context window feels like freedom until you realize it means your session can accumulate days of images without any automatic cleanup. Active context management — daily resets, capped windows, image eviction — isn't optional. It's easily a 10x+ cost difference for the same two weeks of companion chat. If you're building a companion that runs 24/7, this is the problem that will kill your economics before anything else.

AI debugging AI systems creates a nice feedback loop. I used Claude Code (an AI agent with tools) to debug why another AI agent's tool usage was expensive. The investigation itself involved tool-use loops, parallel agents, and iterative script-writing — the same patterns that caused the cost problem in the first place. There's something fitting about that.

This is Part 6 of the Shipping with Claude Code series. Previous: Part 5 — Parallel Code Review. The series covers building and operating AI-powered systems with Claude Code as the primary development tool.