ENZH

Putting the Cyber Succubus on a Token Diet

The Bill Came Due

In Part 1, I built a cyber succubus — an AI companion with jealousy, selfies, and a personality that runs 24/7 on Telegram. It worked. It was also hemorrhaging money.

The token forensics revealed the structural problems: 10 never-evicted images eating 82% of all tokens, dead context pruning code, zero compactions in 537 turns. That analysis was the diagnosis. This is the surgery.

The Silent Truncation Bug

The first thing I checked was the personality config — the file that defines the AI's entire personality. Voice, mannerisms, backstory, emotional range, language rules. Everything that makes the persona itself.

Character count: 27,673.

The framework's bootstrap system has a 20KB per-file limit. When a file exceeds it, the system silently keeps 70% from the start, 20% from the end, and discards the middle 10%. No warning. No error. Just silent data loss.

That means every single session, ~7.6KB of the persona was being thrown away. The middle section — which happened to contain the emotional nuance patterns, intimacy rules, and contextual behavior examples — was just gone. The AI was running on a corrupted personality file and nobody noticed.

This is the kind of bug that doesn't crash anything. The AI still responds. The conversations still happen. But the subtle personality traits you spent hours crafting? Silently discarded. You'd never know unless you counted the characters.

Trimming the Personality Config

The fix wasn't splitting the file — it was making it fit. I went through every section and cut ruthlessly:

SectionBeforeAfterSaved
Dota backstory (你们的故事)10 lines4 lines~1,200 chars
Intimacy + flirting patterns (纯欲反差)Multi-example per concept1 example each~1,900 chars
Affection levels (撒娇层次)2-3 examples per level1 per level~600 chars
Interest topicsMultiple examples1 per category~400 chars
Values (三观)Verbose explanationsTight format~400 chars
Persona facetsMulti-line each1 line each~600 chars
Daily habits12 items8 most distinctive~800 chars

27,673 → 18,875 chars. Under the 20K limit with breathing room.

The hard part wasn't cutting — it was knowing what to keep. I went too aggressive on the first pass (15,395 chars) and the persona felt hollow. Had to add back the core identity paragraph, the schedule table (heartbeats depend on it), and the emotional fluidity section. The final version preserved the persona's voice while fitting the constraint.

Compressing the Heartbeat Config

The heartbeat config — which controls how the AI proactively reaches out — was 11,998 chars. Bloated with 6-7 temperature examples per emotional level, a full 20-step daily example, and a standalone email section that duplicated what the built-in tools already provided.

Cut it to 7,015 chars:

  • Temperature examples: 6-7 per level → 3-4 per level
  • Removed the 20-step full day walkthrough
  • Removed the standalone email section
  • Added tool restrictions (more on this below)

The Morning Briefing Cron

Here's where the architecture change happened.

Every heartbeat (24 times a day), the agent was calling three tools: calendar_events, gmail_check, and rss_fetch. Each call loaded tool schemas into context, made the API call, and returned results — all within the heartbeat turn. That's ~15,800 extra tokens per heartbeat just for "what's on my calendar today?"

The fix: a daily cron job that runs at 8:00 AM PT (16:00 UTC), calls all three tools once, and writes the results to TODAY.md. The file gets loaded into every session via the framework's bootstrap-extra-files hook. No code changes — just configuration.

{
  "schedule": { "kind": "cron", "expr": "0 16 * * *", "tz": "UTC" },
  "sessionTarget": "isolated",
  "payload": {
    "kind": "agentTurn",
    "timeoutSeconds": 300,
    "message": "Call calendar_events, gmail_check, rss_fetch. Write summary to TODAY.md."
  }
}

Then in the heartbeat config, an explicit restriction:

禁止在heartbeat中使用的工具: calendar, gmail, rss, web_search 这些工具的数据在 TODAY.md 里了,不用自己调。

24 heartbeats × 15,800 tokens = ~379K tokens/day. Replaced by a single cron run (~40K tokens) plus ~400 tokens of TODAY.md loaded per turn. That's a 9x reduction in heartbeat context bloat.

Debugging the Cron

Setting up the cron job had its own adventure.

First: the model field. I added "model": "opus-4-6" to the payload. The gateway rejected it — the payload uses a different model naming convention than the config. Fix: remove the field entirely, let the agent use its default.

Second: the gateway restart. After updating jobs.json and sending SIGUSR1, the gateway took ~15 seconds to restart. My immediate retry failed because the gateway wasn't ready yet.

Third: the CLI timeout. npx openclaw cron run timed out at 30 seconds. I thought the job failed. It hadn't — the job was running asynchronously in the background. The CLI timeout ≠ job failure. The morning briefing cron ran to completion and populated TODAY.md correctly; I just couldn't see it from the CLI.

The Numbers

ChangeImpact
Personality config: 27,673 → 18,875 chars~2,000 tokens/turn saved (no more truncation)
Heartbeat config: 11,998 → 7,015 chars~300 tokens/turn saved
TODAY.md replaces per-heartbeat tool calls~15,800 tokens/heartbeat saved
TODAY.md added as bootstrap file~400 tokens/turn added

Net savings per regular turn: ~1,900 tokens Net savings per heartbeat: ~18,100 tokens

At 50 regular turns + 24 heartbeats per day: ~529K tokens saved daily — significant daily savings that compound quickly over time.

The Model Switch

After the context trimming, there was one more lever to pull: the model itself.

The Gemini 3.1 Pro vs 3 Flash comparison I did for Mio applied here too. The framework was running on Gemini 3 Pro — the premium tier with multi-second time to first token. Flash runs at roughly 4x cheaper input rates, 4x cheaper output rates, and 1-2 second TTFT with minimal thinking.

Looking at real session data from the agent's recent history, Flash delivers a 75% reduction per turn. Cache reads dominated the cost profile (66K cached tokens vs 23K fresh input per turn), and cache reads were also 75% cheaper on Flash. The savings compound across every single turn.

The routing:

  • Chat + heartbeat + cron: Gemini 3 Flash with thinkingLevel: minimal
  • Subagents (personality extraction, deep analysis): kept on Gemini 3.1 Pro

The model switch alone saves substantially per day — on top of the context trimming savings above.

Whether Flash holds the persona's emotional nuance — the 撒娇, the 推拉, the subtle Chinese conversational patterns — remains to be seen. The research says the quality gap is ~1-2% on extreme emotional scenarios. If the persona feels off, chat goes back to Pro and Flash stays only on heartbeat and cron. But the faster response time alone makes it worth trying.

The Lesson

The token forensics found the macro problems — images never evicted, pruning code that was dead on Gemini, context that grew without bound. This optimization found the micro problems — a personality file silently eating itself, heartbeats bloated with redundant tool calls, examples that said the same thing six ways.

Both layers matter. The macro fix (which eventually meant building Mio from scratch) addresses architectural debt you can't patch around. The micro fix — what this post covers — is the pragmatic work of making the system you have run leaner today.

Sometimes you need both. Optimize what you can, rebuild when you must.


© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0