v0.0.6: 4x Cheaper by Knowing Which Model to Use
The Latency Problem
After v0.0.5 shipped media support, Mio could see images, hear voice messages, and display emoji. Feature-complete for a companion chat. But there was a problem I'd been ignoring: response latency.
Gemini 3 Pro Preview — the model powering all of Mio's chat — took 8-10 seconds for a first token when thinking was enabled. In a companion conversation, that's an eternity. You send "hey, how was your day?" and then you sit there watching a typing indicator. It kills the conversational illusion.
I knew the latency was bad. I'd been living with it during development because I was focused on features. But now that the features were done, the latency was the most obvious remaining problem.
Research First, Then Code
Instead of guessing, I did a proper comparison: Gemini 3.1 Pro vs Gemini 3 Flash, specifically for emotional companion AI.
The numbers told a clear story:
| Metric | 3.1 Pro | 3 Flash |
|---|---|---|
| Input cost | ~4x more expensive | baseline |
| Output cost | ~4x more expensive | baseline |
| Output speed | 90.8 t/s | 214 t/s |
| Time to first token (thinking) | 8-10s | 1-2s |
| Time to first token (minimal thinking) | N/A | 1-2s |
The cost difference was 4x. The latency difference was the one that mattered: 1-2 second vs 8-10 seconds.
But what about quality? For companion AI, emotional nuance is the product. If Flash couldn't hold a character, the cost savings would be meaningless.
The community consensus was surprisingly clear: Flash was a "side-grade to 2.5 Pro, not a downgrade." The quality gap was about 1-2% on extreme emotional nuance — expert writers could spot the difference, most users couldn't. Flash was actually better at narrative initiative and character commitment.
One critical finding: thinking mode actively degrades creative writing quality. Multiple developers reported that Gemini's reasoning mode makes emotional and creative output worse, not better. The recommendation was thinkingLevel: minimal for Flash — which, conveniently, also gave the 1-2 second TTFT.
The Routing Architecture
Not all tasks are equal. Daily chat needs speed. Personality extraction needs depth. The solution was obvious: route by task type.
Chat (90% of API calls): Gemini 3 Flash with thinkingLevel: minimal
- 1-2s TTFT
- 214 tokens/second throughput
- ~4x cheaper than Pro per million tokens
- Good enough emotional nuance for conversational flow
Premium tasks (10% of API calls): Gemini 3.1 Pro with thinkingLevel: low
- Personality extraction from onboarding conversations
- Memory summarization and emotional pattern analysis
- Tasks where you pay once and cache the result
- Deeper emotional nuance justified by low frequency
This split meant ~4x cost reduction on 90% of traffic, while actually improving quality on the 10% that matters most — because we upgraded premium tasks from 3 Pro to 3.1 Pro.
Vision That Actually Sees
v0.0.5 added image uploads. The AI could "see" images, but the descriptions were generic. Send a screenshot of a game and you'd get "I see a colorful game screen." Not useful for a companion who's supposed to care about what you're doing.
v0.0.6 rewrote the vision prompts to be specific: identify games by name, recognize brands and logos, describe locations with context. The thinking level for vision went from MINIMAL to LOW — a small latency cost for significantly better understanding.
The difference: "I see a game" became "Looks like you're playing Genshin Impact — how's the new region?" That's the kind of response that makes a companion feel present.
Chinese Transcription Fix
Voice messages were coming through with English transcription artifacts. A user sends a voice clip in Mandarin, and the transcription would hallucinate English words or miss colloquial expressions.
Two fixes:
- OpenAI transcriber: Added explicit
language: 'zh'parameter. Without it, the model was auto-detecting language per-segment and occasionally getting confused by code-switching or background noise. - Gemini fallback prompt: Enhanced for colloquial Chinese expressions and proper nouns. When OpenAI's transcription wasn't confident enough, the Gemini fallback now handles slang, internet-speak, and Chinese-specific proper nouns correctly.
Small change, big impact. Voice is the most intimate input modality — getting the transcription wrong breaks immersion harder than anything else.
DRY Refactor
A code quality issue had been nagging since v0.0.5: the POST /chat and POST /chat/stream handlers shared nearly identical logic for media resolution and context preparation. About 80 lines of duplicated code.
Extracted two shared helpers:
resolveMedia()— fetches pending uploads by mediaId, validates, returns processed mediaprepareChatContext()— builds the conversation context with history, persona, and media
Both handlers now call the same functions. ~80 lines eliminated. The audit from v0.0.5 had flagged this as a MEDIUM issue — it was the right time to fix it.
Platform Risk
One finding from the research deserves its own section: Google explicitly does not want Gemini used for emotional companion AI. Their team lead said it publicly — Gemini is positioned as "a super tool, not an emotional companion."
Three concrete risks:
- An external safety filter operates independently and can delete responses mid-sentence, even with
BLOCK_NONE - Google has banned developers using the API for roleplay-adjacent use cases
- Future model versions may further restrict emotional/creative output
This doesn't change the v0.0.6 decision — Flash is still the best option today. But it does change the architecture requirement: Mio needs to be model-agnostic enough to swap providers without rewriting the application. The model routing layer we just built makes that easier, not harder.
The Numbers
v0.0.6 was six commits over a few hours:
| Change | Impact |
|---|---|
| Chat model: 3 Pro → 3 Flash (minimal thinking) | 4x cost reduction, 1-2 second TTFT |
| Premium tasks: 3 Pro → 3.1 Pro (low thinking) | Better emotional nuance where it counts |
| Vision prompts + thinking level | Specific identification instead of generic descriptions |
| Chinese transcription | Accurate Mandarin voice recognition |
| DRY refactor | -80 lines duplicated code |
| Deployment docs | Correct Artifact Registry commands |
The most impactful release yet, measured by user experience per line of code changed.
Data-Driven, Not Vibes-Driven
Part 0 of this series was a forensic analysis of why the previous companion framework had an unsustainable early burn rate. The lesson was: understand your costs before they understand you.
v0.0.6 applies the same principle to model selection. Not "Pro sounds better than Flash" — but actual latency benchmarks, cost calculations, quality comparisons, and community experience. The research said Flash with minimal thinking was the right call. The deployment confirmed it.
Mio now responds in 1-2 seconds instead of 8-10. That single change makes the conversation feel natural instead of stilted.