ENZH

v0.0.6: 4x Cheaper by Knowing Which Model to Use

The Latency Problem

After v0.0.5 shipped media support, Mio could see images, hear voice messages, and display emoji. Feature-complete for a companion chat. But there was a problem I'd been ignoring: response latency.

Gemini 3 Pro Preview — the model powering all of Mio's chat — took 8-10 seconds for a first token when thinking was enabled. In a companion conversation, that's an eternity. You send "hey, how was your day?" and then you sit there watching a typing indicator. It kills the conversational illusion.

I knew the latency was bad. I'd been living with it during development because I was focused on features. But now that the features were done, the latency was the most obvious remaining problem.

Research First, Then Code

Instead of guessing, I did a proper comparison: Gemini 3.1 Pro vs Gemini 3 Flash, specifically for emotional companion AI.

The numbers told a clear story:

Metric3.1 Pro3 Flash
Input cost~4x more expensivebaseline
Output cost~4x more expensivebaseline
Output speed90.8 t/s214 t/s
Time to first token (thinking)8-10s1-2s
Time to first token (minimal thinking)N/A1-2s

The cost difference was 4x. The latency difference was the one that mattered: 1-2 second vs 8-10 seconds.

But what about quality? For companion AI, emotional nuance is the product. If Flash couldn't hold a character, the cost savings would be meaningless.

The community consensus was surprisingly clear: Flash was a "side-grade to 2.5 Pro, not a downgrade." The quality gap was about 1-2% on extreme emotional nuance — expert writers could spot the difference, most users couldn't. Flash was actually better at narrative initiative and character commitment.

One critical finding: thinking mode actively degrades creative writing quality. Multiple developers reported that Gemini's reasoning mode makes emotional and creative output worse, not better. The recommendation was thinkingLevel: minimal for Flash — which, conveniently, also gave the 1-2 second TTFT.

The Routing Architecture

Not all tasks are equal. Daily chat needs speed. Personality extraction needs depth. The solution was obvious: route by task type.

Chat (90% of API calls): Gemini 3 Flash with thinkingLevel: minimal

  • 1-2s TTFT
  • 214 tokens/second throughput
  • ~4x cheaper than Pro per million tokens
  • Good enough emotional nuance for conversational flow

Premium tasks (10% of API calls): Gemini 3.1 Pro with thinkingLevel: low

  • Personality extraction from onboarding conversations
  • Memory summarization and emotional pattern analysis
  • Tasks where you pay once and cache the result
  • Deeper emotional nuance justified by low frequency

This split meant ~4x cost reduction on 90% of traffic, while actually improving quality on the 10% that matters most — because we upgraded premium tasks from 3 Pro to 3.1 Pro.

Vision That Actually Sees

v0.0.5 added image uploads. The AI could "see" images, but the descriptions were generic. Send a screenshot of a game and you'd get "I see a colorful game screen." Not useful for a companion who's supposed to care about what you're doing.

v0.0.6 rewrote the vision prompts to be specific: identify games by name, recognize brands and logos, describe locations with context. The thinking level for vision went from MINIMAL to LOW — a small latency cost for significantly better understanding.

The difference: "I see a game" became "Looks like you're playing Genshin Impact — how's the new region?" That's the kind of response that makes a companion feel present.

Chinese Transcription Fix

Voice messages were coming through with English transcription artifacts. A user sends a voice clip in Mandarin, and the transcription would hallucinate English words or miss colloquial expressions.

Two fixes:

  1. OpenAI transcriber: Added explicit language: 'zh' parameter. Without it, the model was auto-detecting language per-segment and occasionally getting confused by code-switching or background noise.
  2. Gemini fallback prompt: Enhanced for colloquial Chinese expressions and proper nouns. When OpenAI's transcription wasn't confident enough, the Gemini fallback now handles slang, internet-speak, and Chinese-specific proper nouns correctly.

Small change, big impact. Voice is the most intimate input modality — getting the transcription wrong breaks immersion harder than anything else.

DRY Refactor

A code quality issue had been nagging since v0.0.5: the POST /chat and POST /chat/stream handlers shared nearly identical logic for media resolution and context preparation. About 80 lines of duplicated code.

Extracted two shared helpers:

  • resolveMedia() — fetches pending uploads by mediaId, validates, returns processed media
  • prepareChatContext() — builds the conversation context with history, persona, and media

Both handlers now call the same functions. ~80 lines eliminated. The audit from v0.0.5 had flagged this as a MEDIUM issue — it was the right time to fix it.

Platform Risk

One finding from the research deserves its own section: Google explicitly does not want Gemini used for emotional companion AI. Their team lead said it publicly — Gemini is positioned as "a super tool, not an emotional companion."

Three concrete risks:

  1. An external safety filter operates independently and can delete responses mid-sentence, even with BLOCK_NONE
  2. Google has banned developers using the API for roleplay-adjacent use cases
  3. Future model versions may further restrict emotional/creative output

This doesn't change the v0.0.6 decision — Flash is still the best option today. But it does change the architecture requirement: Mio needs to be model-agnostic enough to swap providers without rewriting the application. The model routing layer we just built makes that easier, not harder.

The Numbers

v0.0.6 was six commits over a few hours:

ChangeImpact
Chat model: 3 Pro → 3 Flash (minimal thinking)4x cost reduction, 1-2 second TTFT
Premium tasks: 3 Pro → 3.1 Pro (low thinking)Better emotional nuance where it counts
Vision prompts + thinking levelSpecific identification instead of generic descriptions
Chinese transcriptionAccurate Mandarin voice recognition
DRY refactor-80 lines duplicated code
Deployment docsCorrect Artifact Registry commands

The most impactful release yet, measured by user experience per line of code changed.

Data-Driven, Not Vibes-Driven

Part 0 of this series was a forensic analysis of why the previous companion framework had an unsustainable early burn rate. The lesson was: understand your costs before they understand you.

v0.0.6 applies the same principle to model selection. Not "Pro sounds better than Flash" — but actual latency benchmarks, cost calculations, quality comparisons, and community experience. The research said Flash with minimal thinking was the right call. The deployment confirmed it.

Mio now responds in 1-2 seconds instead of 8-10. That single change makes the conversation feel natural instead of stilted.


© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0