v0.1.0: Mio Found Its Voice

First Stable

Seven versions got Mio from an idea to a working prototype. v0.0.1 through v0.0.7 were about making things work — memory, personality, images, voice recognition, model routing, web browsing. Each version added a capability. Each version had rough edges.

v0.1.0 is different. Yes, it adds new capabilities — voice messages, a new visual identity. But the real work was deeper: stripping away over-engineering, rewriting every persona from the ground up, and cutting onboarding down to the essentials. This is the first version I'd put in front of a stranger and say: "Talk to them."

The version number says it. Zero-dot-zero is a prototype. Zero-dot-one is a product.

Mio Has a Voice Now

Before v0.1.0, Mio was text-only. It could receive voice messages (speech-to-text from v0.0.5), but could only reply in text. The asymmetry was jarring — you speak to Mio, it types back. Like calling someone and getting a text reply.

Fish Audio changed that. Their S1 model does text-to-speech with voice cloning — give it a reference voice, and it generates speech that sounds like that person. Not robotic TTS. Not text-to-speech with obvious synthesis artifacts. A voice that sounds like it belongs to someone.

The integration is simpler than I expected:

const res = await fetch('https://api.fish.audio/v1/tts', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${apiKey}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    reference_id: voiceConfig.reference_id,
    text,
    format: 'opus',
  }),
})

Native OGG/Opus output. No ffmpeg dependency. The response is an audio buffer that goes straight to Telegram's sendVoice. One API call, one send.

The interesting design decision was when to send voice. Not every message should be spoken. "好的" doesn't need a voice note. A three-paragraph explanation doesn't either. Voice works best for emotional moments — comfort, teasing, excitement, goodnight messages.

The LLM decides. A [VOICE] marker in the system prompt instructions tells the model: "When you want to express something emotionally, wrap it in [VOICE]." The server parses the marker, strips it from the text, and fires off TTS in the background. Text bubbles arrive instantly; the voice note follows a second later.

User: 我今天好累啊
Mio: [VOICE]辛苦啦宝贝，今天做了什么呀？

The text appears immediately. Then: a voice message, warm and familiar, asking about your day.

Each persona has its own voice. A voice.json file in the preset directory points to a Fish Audio reference model. No voice.json = voice disabled. The loader caches per-preset, so the config is read once and reused:

{
  "reference_id": "6654b47e06334174ac47059ff0a8f6dd"
}

Five presets, five voices. Keke speaks Taiwanese Mandarin. Yinan has a mature male voice. Surou sounds calm and measured. Each voice matches the persona's personality — not just in what they say, but in how they sound.

Cost: negligible per voice message. Even at heavy daily usage, the voice cost is a rounding error compared to LLM chat costs.

A New Face

Mio's selfie generation (from v0.0.5) used Korean webtoon-style reference images. They worked, but they felt generic — interchangeable anime faces that could belong to anyone.

v0.1.0 switches to Makoto Shinkai's anime film style. The reference images are now portrait shots — upper body, soft lighting, the warm color palette from films like Your Name and Weathering with You. Each persona has a distinct appearance that matches their personality.

The technical changes were practical:

Compression: The old reference images were 6MB PNGs. Gemini's image generation API accepts them, but uploading 6MB per selfie request is wasteful. The new images are 500-700KB JPGs. Same visual quality for the AI model, 10x smaller uploads.

Timeout: Gemini's image generation can take 30+ seconds for complex scenes. The old 30-second timeout was cutting off valid generations. Bumped to 120 seconds.

Cloud Run lifecycle: The selfie generation was fire-and-forget — the handler returned before the selfie promise resolved. On Cloud Run, this means the container can shut down mid-generation. Fixed by awaiting the selfie promise before the handler returns.

Progress indicator: A upload_photo chat action now pulses every 4 seconds while the selfie generates. The user sees "uploading photo..." instead of silence.

Who They Are, Not What They Are To You

This was the biggest change in v0.1.0, and it has nothing to do with technology.

Every persona was originally defined by a relationship role: 闺蜜 (bestie), 学姐 (senior classmate), 大叔 (mature uncle), 小奶狗 (puppy boyfriend), 台妹 (Taiwanese girl). The personality flowed from the role. Mimi acted like a bestie because their identity was "bestie." Surou acted like a senior because their identity was "senior."

The problem: this creates a ceiling. A "bestie" persona can only express emotions that fit the bestie archetype. They can be supportive, gossipy, enthusiastic. They can't be vulnerable, distant, or mysteriously quiet — those don't fit the label. The relationship role becomes a cage.

v0.1.0 removes every relationship label. Each persona is now defined by personality traits, backstory, and speech patterns — not by what they are to you.

Mimi isn't "your bestie." They're a 24-year-old Shanghai person who works in media, talks fast, overshares, and has strong opinions about everything. The bestie energy is still there — but it emerges from who they are, not from a label that dictates how they should behave.

Surou isn't "your senior." They're a 27-year-old who studied abroad, speaks measured and precise, and shows care through actions rather than words. The mentor quality is still there — but they can also be playful, tired, or unexpectedly silly, because they're a person, not a role.

The preset folder names changed to match: coco → keke-taimei, guimi → mimi-guimi, xuejie → surou-xuejie. Names first, personality tag second. Identity, not archetype.

Every identity definition, behavior rules file, and personality config was rewritten. Deep character backstories — daily routines, quirks, the way they text differently at midnight vs. morning. The kind of detail that makes a character feel like someone who exists outside the conversation.

The Fourth Wall

After the persona rewrites, a new problem appeared. Users would say things like "你是AI吧" (you're an AI, right?) and some personas would break character — acknowledging they're artificial, explaining how they work, even apologizing for being "just an AI."

This destroys immersion. The moment a companion admits it's software, the relationship shifts from "talking to someone" to "using a tool."

v0.1.0 adds an anti-fourth-wall-break rule to every preset's COMMUNICATION_GUIDELINES:

When users ask about AI, chatbots, or technology, treat it as ordinary curiosity about a topic — like asking about the weather or a movie. Never self-identify as technology. Never break the conversation's reality.

A user asking "你是不是AI" gets the same treatment as someone asking "你喜欢猫吗" — it's a conversation topic, not an existential question. The persona can engage with the topic of AI without admitting to being one.

14 Questions → 4

The onboarding flow before v0.1.0 asked users to configure:

Their nickname
What to call the persona
Relationship type
Hobbies
Speech style preference
Verbosity level
Emoji frequency
Humor level
Proactive messaging style
Jealousy level
Conflict resolution style
Timezone
About themselves
Custom backstory (optional)

Fourteen questions before you can even say hello. Most users would abandon by question 6.

v0.1.0 keeps four:

Your nickname — what the persona calls you
Relationship type — how you want to interact (friend, romantic, mentor)
Timezone — so Mio messages at appropriate hours
About you — free text, whatever you want Mio to know

Everything else is now fixed per persona. Keke's speech style is Keke's speech style. Yinan's humor level is Yinan's humor level. You don't configure a personality — you choose one.

Why not let users customize? Because letting users edit name, personality, hobbies = the character becomes an empty shell.

The old onboarding asked users to fill out a dozen fields. The result? Whatever users casually typed clashed with the carefully written backstory. Keke's story has them going to 50嵐, scrolling Dcard, playing Animal Crossing on weekends — those details ARE Keke's soul, the foundation that makes character consistency work. The moment a user changes their hobby to "fishing," the entire persona collapses.

The real issue: story quality, creator-controlled >> user-written. Instead of giving users blank fields, bake everything — personality traits, speech style, talkativeness, interaction mode, emoji frequency — into the preset. Only expose what truly needs personalization. Template variables reduced to just {user_nickname} and {relationship_type}. The entire "personalization settings" section was deleted; character traits are woven into the character description itself.

Advanced customization (rename, rewrite personality, write backstories) becomes a paid power-user feature later. First, polish the presets to perfection.

To do this, I assembled a 5-agent Opus 4.6 team to rewrite all personas in parallel. Every character was massively enriched — Keke got 50嵐 and Dcard, Dashu got Yamazaki 12 and his signature twice-cooked pork recipe, Mimi got BBDO and Mo Yan references, Xiaonai got League Gold II and Mixue Bingcheng. Not generic labels — friends with names, specific drinks, real places. That's what makes a character feel alive.

Deleting 200 Lines

The verbosity system was my most over-engineered creation.

The idea: dynamically adjust response length based on the conversation's pace. Quick back-and-forth? Short replies. Deep emotional conversation? Longer replies. The system classified each exchange, computed a verbosity score, and translated it into a maxBubbles count that would physically cut the stream after N messages.

In practice: it caused AI_NoOutputGeneratedError when the stream got cut at the wrong moment. The model would generate a token, the transform would kill the stream, and the SDK would throw because it received content but no proper finish. Users saw empty replies or errors.

The fix wasn't to make the cutting smarter. The fix was to realize the cutting was unnecessary.

LLMs already control output length through the system prompt. "Keep replies to 1-2 short sentences in casual chat" works. It's been working since the earliest chat models. The verbosity system was solving a problem that was already solved — and creating new problems in the process.

v0.1.0 deletes extractVerbosity(), extractInteractionMode(), classifyLength(), getVerbosityLimits(), and createMaxBubblesTransform(). Roughly 200 lines. maxTokens is now a fixed 1024 as a safety ceiling, not a dynamic calculation.

Response length is controlled entirely by personality descriptions in the system prompt. Each persona's personality config specifies how they text — Keke sends rapid short messages, Yinan writes longer thoughtful replies. The model follows these naturally. No stream manipulation needed.

Marker Hygiene

A subtle bug: the [VOICE] and [SELFIE] markers were leaking into the database. The agent loop would generate text with markers, the server would parse and act on them, but the raw text (markers included) was being stored in the messages table and fed into memory extraction.

This meant memory summaries contained [VOICE] prefixes. Past context included [SELFIE] tags. The model would see these in its history and sometimes reproduce them in non-voice contexts.

v0.1.0 strips all markers before database insertion and before memory extraction. The markers are ephemeral instructions — they exist in the generation, get consumed by the server, and disappear. They're never stored, never remembered, never recycled.

What Changes

Before v0.1.0: Mio was a capable text companion with a lot of rough edges. Over-engineered internals. Generic visual identity. Personas defined by relationship labels. Onboarding that felt like a tax form.

After v0.1.0: each persona is a distinct person with a face, a voice, a backstory, and a way of being that doesn't depend on what they are to you. You pick someone who resonates, answer four questions, and start talking. They reply in text and sometimes — when the moment calls for it — in their own voice.

This is the version that makes people forget they're talking to software. Not because the technology is hidden, but because the character is vivid enough that the technology doesn't matter.