v0.0.6: 4x Cheaper by Knowing Which Model to Use

The Latency Problem

After v0.0.5 shipped media support, Mio could see images, hear voice messages, and display emoji. Feature-complete for a companion chat. But there was a problem I'd been ignoring: response latency.

Gemini 3 Pro Preview — the model powering all of Mio's chat — took 8-10 seconds for a first token when thinking was enabled. In a companion conversation, that's an eternity. You send "hey, how was your day?" and then you sit there watching a typing indicator. It kills the conversational illusion.

I knew the latency was bad. I'd been living with it during development because I was focused on features. But now that the features were done, the latency was the most obvious remaining problem.

Research First, Then Code

Instead of guessing, I did a proper comparison: Gemini 3.1 Pro vs Gemini 3 Flash, specifically for emotional companion AI.

The numbers told a clear story:

Metric	3.1 Pro	3 Flash
Input cost	~4x more expensive	baseline
Output cost	~4x more expensive	baseline
Output speed	90.8 t/s	214 t/s
Time to first token (thinking)	8-10s	1-2s
Time to first token (minimal thinking)	N/A	1-2s

The cost difference was 4x. The latency difference was the one that mattered: 1-2 second vs 8-10 seconds.

But what about quality? For companion AI, emotional nuance is the product. If Flash couldn't hold a character, the cost savings would be meaningless.

The community consensus was surprisingly clear: Flash was a "side-grade to 2.5 Pro, not a downgrade." The quality gap was about 1-2% on extreme emotional nuance — expert writers could spot the difference, most users couldn't. Flash was actually better at narrative initiative and character commitment.

One critical finding: thinking mode actively degrades creative writing quality. Multiple developers reported that Gemini's reasoning mode makes emotional and creative output worse, not better. The recommendation was thinkingLevel: minimal for Flash — which, conveniently, also gave the 1-2 second TTFT.

The Routing Architecture

Not all tasks are equal. Daily chat needs speed. Personality extraction needs depth. The solution was obvious: route by task type.

Chat (90% of API calls): Gemini 3 Flash with thinkingLevel: minimal

1-2s TTFT
214 tokens/second throughput
~4x cheaper than Pro per million tokens
Good enough emotional nuance for conversational flow

Premium tasks (10% of API calls): Gemini 3.1 Pro with thinkingLevel: low

Personality extraction from onboarding conversations
Memory summarization and emotional pattern analysis
Tasks where you pay once and cache the result
Deeper emotional nuance justified by low frequency

This split meant ~4x cost reduction on 90% of traffic, while actually improving quality on the 10% that matters most — because we upgraded premium tasks from 3 Pro to 3.1 Pro.

Vision That Actually Sees

v0.0.5 added image uploads. The AI could "see" images, but the descriptions were generic. Send a screenshot of a game and you'd get "I see a colorful game screen." Not useful for a companion who's supposed to care about what you're doing.

v0.0.6 rewrote the vision prompts to be specific: identify games by name, recognize brands and logos, describe locations with context. The thinking level for vision went from MINIMAL to LOW — a small latency cost for significantly better understanding.

The difference: "I see a game" became "Looks like you're playing Genshin Impact — how's the new region?" That's the kind of response that makes a companion feel present.

Chinese Transcription Fix

Voice messages were coming through with English transcription artifacts. A user sends a voice clip in Mandarin, and the transcription would hallucinate English words or miss colloquial expressions.

Two fixes:

OpenAI transcriber: Added explicit language: 'zh' parameter. Without it, the model was auto-detecting language per-segment and occasionally getting confused by code-switching or background noise.
Gemini fallback prompt: Enhanced for colloquial Chinese expressions and proper nouns. When OpenAI's transcription wasn't confident enough, the Gemini fallback now handles slang, internet-speak, and Chinese-specific proper nouns correctly.

Small change, big impact. Voice is the most intimate input modality — getting the transcription wrong breaks immersion harder than anything else.

DRY Refactor

A code quality issue had been nagging since v0.0.5: the POST /chat and POST /chat/stream handlers shared nearly identical logic for media resolution and context preparation. About 80 lines of duplicated code.

Extracted two shared helpers:

resolveMedia() — fetches pending uploads by mediaId, validates, returns processed media
prepareChatContext() — builds the conversation context with history, persona, and media

Both handlers now call the same functions. ~80 lines eliminated. The audit from v0.0.5 had flagged this as a MEDIUM issue — it was the right time to fix it.

Platform Risk

One finding from the research deserves its own section: Google explicitly does not want Gemini used for emotional companion AI. Their team lead said it publicly — Gemini is positioned as "a super tool, not an emotional companion."

Three concrete risks:

An external safety filter operates independently and can delete responses mid-sentence, even with BLOCK_NONE
Google has banned developers using the API for roleplay-adjacent use cases
Future model versions may further restrict emotional/creative output

This doesn't change the v0.0.6 decision — Flash is still the best option today. But it does change the architecture requirement: Mio needs to be model-agnostic enough to swap providers without rewriting the application. The model routing layer we just built makes that easier, not harder.

The Numbers

v0.0.6 was six commits over a few hours:

Change	Impact
Chat model: 3 Pro → 3 Flash (minimal thinking)	4x cost reduction, 1-2 second TTFT
Premium tasks: 3 Pro → 3.1 Pro (low thinking)	Better emotional nuance where it counts
Vision prompts + thinking level	Specific identification instead of generic descriptions
Chinese transcription	Accurate Mandarin voice recognition
DRY refactor	-80 lines duplicated code
Deployment docs	Correct Artifact Registry commands

The most impactful release yet, measured by user experience per line of code changed.

Data-Driven, Not Vibes-Driven

Part 0 of this series was a forensic analysis of why the previous companion framework had an unsustainable early burn rate. The lesson was: understand your costs before they understand you.

v0.0.6 applies the same principle to model selection. Not "Pro sounds better than Flash" — but actual latency benchmarks, cost calculations, quality comparisons, and community experience. The research said Flash with minimal thinking was the right call. The deployment confirmed it.

Mio now responds in 1-2 seconds instead of 8-10. That single change makes the conversation feel natural instead of stilted.