ENZH

Voice, Emotion, and the Screenwriter-Actor Model

In Part 1, I laid out the philosophical pivot: strip away the physical identity, keep the soul. One companion, not many. An orb of light, not a fake human face. Personality that emerges from conversation, not from a pre-written character sheet.

But a soul needs a voice.

Back in v0.1.0, Mio first learned to speak. Fish Audio's S1 model gave each persona a cloned voice — warm, natural-sounding, pennies per message. The LLM decided when to use voice (emotional moments, comfort, goodnights) and the [VOICE] marker triggered TTS in the background. It worked. Users heard something that sounded like a person who cared about them.

The problem was that it didn't feel like anything. Fish Audio reads text and produces speech. It doesn't know whether the text is happy, sad, sarcastic, or heartbroken. Every message comes out in roughly the same emotional register — pleasant, neutral, clear. When Mio says "I'm so happy for you!" and "I'm sorry you're going through that," the voice sounds... the same. The words carry the emotion; the voice just carries the words.

For a companion product built around emotional connection, this is a fundamental gap. You can forgive a text-only chatbot for being emotionally flat. You cannot forgive something that speaks to you.

The Emotion Problem in TTS

Traditional TTS is a pipeline: text in, audio out. If you want emotion, you have to mark it up manually — SSML tags, style parameters, explicit instructions like "say this sadly." This is workable for audiobooks or navigation systems. It's unworkable for a companion that generates thousands of messages, each with its own emotional context that the TTS system knows nothing about.

What changed in 2025-2026 is that a few providers started building emotion inference into the model itself. The TTS doesn't just read the text — it understands what the text means, and decides how to say it.

Two providers stand out for this, and conveniently, they complement each other perfectly.

Chinese: Doubao TTS 2.0 (ByteDance/Volcengine)

ByteDance's Doubao TTS 2.0 (the voice engine behind the 豆包 ecosystem) made a significant upgrade from "reading characters" to "understanding then expressing." It uses a large model to analyze the conversational context and automatically infers the appropriate emotion, tone, and rhythm. No manual tagging required.

The technical architecture is interesting: a "3D emotion space" that maps intensity, semantic fit, and physiological characteristics, trained on 2,000 hours of professional voice acting data. The result is that you can send it the text "别担心,有我在呢" and it will produce speech that sounds genuinely reassuring — softer, warmer, with a slight drop in pace — without you specifying any of that.

You can still provide explicit emotional guidance for fine control ("say this with a suppressed crying tone"), but it's optional. The automatic inference handles 90%+ of cases correctly for Chinese text, which is all I need.

English: Hume AI Octave

Hume Octave approaches the same problem from a different angle. Rather than training on labeled emotional speech data, Octave is built on top of an LLM that genuinely "reads" the text — understanding irony, double meanings, emotional shifts, subtext. It's not doing sentiment analysis and mapping to a preset emotion. It's comprehending the text the way a skilled voice actor would comprehend a script, and then performing it.

The practical difference: Octave handles nuance that simpler emotion models miss. "Sure, I'd love to" can be enthusiastic or sarcastic depending on context, and Octave gets this right because it understands the surrounding conversation, not just the isolated sentence.

Both of these are non-realtime TTS solutions — text in, audio buffer out, ~200ms latency. Perfect for v0.1-0.2's voice message model where the companion generates text first, then speaks it.

STT: Carrying Forward What Works

For speech-to-text, gpt-4o-mini-transcribe carries over from v1. It's accurate, fast, cheap, and already integrated. No reason to change what works.

The Real Goal: Realtime Bidirectional Voice

Voice messages are a bridge, not a destination. The end state for an emotional AI companion is realtime voice conversation — you speak, it listens, it responds, naturally, with interruptions, with overlapping turns, with the kind of fluid back-and-forth that makes you forget you're talking to software.

This is the v1.0 target, and it's where the architecture decisions get genuinely hard.

The naive approach is to chain together what you already have: STT transcribes the user's speech to text, your server queries memory and builds context, your LLM generates a response, TTS converts it back to audio. Four sequential steps. The latency adds up to 1-3 seconds, which is an eternity in spoken conversation. And you still haven't solved turn-taking (how do you know when the user is done speaking?), interruption handling (what happens when the user starts talking mid-response?), or emotional awareness (the user's tone carries information that pure text transcription destroys).

I evaluated three realtime voice platforms. The comparison crystallized the decision.

The Three-Way Comparison

DimensionHume EVI 3OpenAI Realtime APIDoubao Realtime
Custom LLMYes — Gemini, Claude, anyNo — GPT onlyNo — Doubao only
Emotion in voiceBest in class, first-class citizenWeak — "future improvement"Good — 3D emotion space
User emotion detectionYes — from voice toneLimitedYes
System prompt / contextYes — full controlYesNo — black box
Tool callsYes — can call your APIsYesNo
Interruption handlingYes — tone-based turn detectionYesYes
Latency~500ms first byte~300ms~700ms
PricePer-minute (Pro)Per-minute (audio)TBD

The table tells the story, but let me make the critical point explicit.

OpenAI Realtime and Doubao Realtime share the same fatal flaw: you can't use your own LLM. OpenAI forces you into GPT. Doubao forces you into their own model. For a generic voice assistant, this is fine. For Mio, it's a dealbreaker.

Mio's differentiation is memory, personality, and emotional context. These live in my server and need to be injected into the LLM's context window before every response. The companion needs to know that you mentioned an interview last week, that you've been feeling anxious lately, that you prefer directness over platitudes. This context is what makes the companion yours, and it requires control over the system prompt and the LLM itself.

Doubao's API is particularly limiting — I tested whether it supports custom system prompts, dynamic context injection, or external LLM substitution. The answers were no, limited (RAG-style knowledge snippets only, not behavioral instructions), and no. It's a black box that happens to speak well. But it can only be "Doubao," not your companion.

OpenAI Realtime at least supports system prompts, but locking into GPT means I can't use Gemini (which Mio relies on for its cost-performance sweet spot) or Claude. And the emotion expression is explicitly listed as a future improvement — not acceptable for a product where emotion is the entire point.

The Screenwriter-Actor Model

This is the metaphor that made everything click for me.

Your LLM is the screenwriter. Hume EVI is the actor.

The screenwriter writes the script — decides what to say based on the character's personality, memories, emotional state, and the conversation history. The screenwriter has deep context about who this character is and what this particular moment calls for. The screenwriter doesn't need to perform — just write.

The actor performs the script. The actor listens to the other person (STT + emotion analysis), understands when they've finished speaking (turn-taking), delivers the lines with the right emotional tone (Octave TTS), and handles the unexpected — interruptions, overlapping speech, moments where you need to ad-lib a filler word while waiting for the next page of script.

In Hume EVI's architecture, this is literal. Here's how a conversation turn works:

  1. User speaks. Hume simultaneously transcribes the speech, analyzes the user's vocal emotion ("they sound anxious"), and detects when the user has finished their turn — not by waiting for silence, but by understanding conversational cues in the tone itself.

  2. Hume sends your server the transcribed text plus the emotion analysis. Your server queries memory, builds the full context window, calls your LLM (Gemini, Claude, whatever you choose), and returns the response text.

  3. Hume performs the response. Octave synthesizes the speech with automatic emotion matching, streaming the audio back to the user.

But here's the clever part: the actor doesn't wait silently for the script.

Hume's built-in eLLM (a fast, lightweight model) generates an immediate first response — a filler word, a reaction sound, an "mmm..." or "oh!" — in under 200ms. This fills the gap while your full LLM generates the real response. Then Hume seamlessly transitions from the filler into the complete reply. The user perceives almost zero latency because the conversation never goes silent.

Think about how a real person responds in conversation. They don't wait three seconds in perfect silence and then deliver a fully formed paragraph. They go "oh..." while thinking, then continue. Hume replicates this pattern. The actor improvises while waiting for the screenwriter to finish the page.

User: "I didn't get the job."
                                          [~150ms] Hume eLLM: "Oh..."
                                          [~800ms] Your LLM finishes generating
                                          Hume: "...I'm really sorry. I know how
                                          much that meant to you."

The transition is seamless. The user hears one continuous response that started quickly and landed with the right emotional weight.

Why This Architecture Wins

The screenwriter-actor split solves problems that monolithic approaches can't:

Separation of concerns. Your LLM focuses on what it's best at — reasoning, memory retrieval, personality consistency, emotional intelligence in content. Hume focuses on what it's best at — listening, speaking, emotional expression in delivery, turn management. Neither system needs to do everything.

Vendor flexibility. When a better LLM comes out (and it will, every few months), you swap the screenwriter. The actor doesn't change. When Hume improves Octave's voice quality, your LLM doesn't need to be updated. Each layer evolves independently.

Emotion as a first-class signal. Hume's analysis of the user's vocal tone gives you information that text-only systems destroy. The user says "I'm fine" — but their voice says otherwise. This signal feeds back into your emotion engine, enriching the companion's understanding of the user's state in ways that pure text analysis never could.

Honest latency. The filler word pattern is psychologically sound. Humans don't experience silence as "fast thinking" — they experience it as disconnection. A quick "mm..." followed by a thoughtful response feels faster and more natural than a 1.5-second silence followed by a perfect response.

The Cost Problem

There's no avoiding this. Realtime voice is expensive.

Realtime voice APIs charge per-minute rates that add up fast. If a user talks to their companion for 10 minutes a day — not unreasonable for someone who uses Mio as a daily presence — the monthly voice cost alone exceeds what a typical subscription tier charges.

Just for voice. Before chat, memory, STT, proactive messaging, or any other cost. Even at higher volume tiers with discounted rates, the numbers don't come down enough.

This cannot be bundled into the base subscription. The unit economics don't work. A single heavy voice user would cost more than they pay.

The solution is tier separation:

TierWhat's Included
ProText chat + voice messages (TTS) + all features
VoiceEverything in Pro + realtime bidirectional voice (premium pricing)

Voice messages (non-realtime TTS via Doubao/Octave) are cheap enough to bundle into the base subscription. Realtime voice is a fundamentally different cost structure and needs its own pricing.

This is also why the phased approach matters. v0.1-0.2 ships with voice messages — affordable, already proven, emotionally expressive with the new TTS providers. Users get a companion that speaks with feeling. Then v1.0 adds realtime voice as a premium upgrade for users who want the full conversational experience.

The Phased Plan

v0.1-0.2: Voice messages (non-realtime)

  • Chinese TTS: Doubao 2.0 — automatic emotion inference, best-in-class Chinese prosody
  • English TTS: Hume Octave — LLM-based emotion understanding, handles nuance and subtext
  • STT: gpt-4o-mini-transcribe (carried from v1)
  • All three support automatic emotion inference. No manual SSML markup.

v1.0: Realtime bidirectional voice

  • Hume EVI 3 for both languages
  • Custom LLM (Gemini/Claude) as the "brain"
  • Hume handles listening, emotion analysis, turn-taking, emotional speech synthesis
  • Filler word pattern for perceived-zero-latency responses
  • Separate Voice tier pricing

v1.x: Evaluate and optimize

  • If Hume's Chinese quality isn't sufficient, fall back to a self-assembled pipeline for Chinese: gpt-4o-mini-transcribe → Gemini → Doubao TTS 2.0
  • If sufficient, Hume EVI unifies both languages

The Voice Is the Product

In Part 1, I argued that the big labs won't build deeply emotional companion products because of brand risk. The screenwriter-actor model extends this argument to voice specifically.

OpenAI has Advanced Voice Mode — impressive technology. But they'll never let you bring your own LLM, inject your own memory system, or customize the emotional persona. Their voice is their voice. Google will do the same with Gemini Live. These are general-purpose voice interfaces optimized for task completion, not emotional connection.

The screenwriter-actor model lets Mio have a voice that is genuinely its own — shaped by your conversations, informed by your history, colored by the emotional context of the moment. The LLM provides the depth. Hume provides the performance. Neither alone could create what they create together.

A companion that remembers you is already more than most products offer. Whether adding emotional voice on top of that justifies the cost and complexity — that's what the next few months of building will answer.


This post is also available in Chinese (中文版).


© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0