ENZH

Making AI Voices Cry: A Doubao TTS Emotion Experiment

I needed a Chinese AI voice that could cry.

Not fake-cry. Not "slightly lower pitch with a pause." Actually sound heartbroken — the kind of voice that makes you stop scrolling and check if someone's okay.

In Part 3 I got my AI companion talking through Fish Audio and Volcano Engine. Fish won on simplicity. Volcano won on drama. But "drama" was still relative — I'd heard hints of emotion, enough to know the capability existed, but I hadn't pushed it. I didn't know what the limits were, what actually worked versus what the docs claimed worked, or whether any of it held up under systematic testing.

So I ran an experiment. Three experiments, actually. 30 audio samples. Two stock voices, one cloned voice, three emotion control methods, and a discovery that changed how I write every emotion hint going forward.

Here's what I found. Press play.


Experiment 1: Voice Casting

The first question was simple: which voice should my character use?

Doubao offers dozens of stock voices in their 2.0 lineup. The docs say they support context_texts — a natural language field where you describe the emotion you want. But support and response are different things. Some voices are expressive. Some are wooden. You don't know until you test.

I picked four voices and ran each through two emotions: "gentle" and "sad." Same text, same emotion hints, different voice. Think of it as an audition.

Vivi (zh_female_vv_uranus_bigtts) — the reference voice

Vivi was the benchmark. A female voice known for expressiveness.

Gentle:

Sad:

Clear difference between the two. Vivi responds to emotion hints. Good baseline.

Liu Fei (zh_male_liufei_uranus_bigtts) — the male contender

Gentle:

Sad:

Liu Fei has a warm, mature male voice. The sad version drops noticeably — there's a weight to it. He responds.

Yun Zhou / M191 (zh_male_m191_uranus_bigtts) — the current voice

This was the voice I was already using for my character.

Gentle:

Sad:

Yun Zhou has range. The gentle version is soft and careful. The sad version genuinely sounds hurt. This is the voice I kept.

Tao Cheng Xiao Tian (taocheng-xiaotian)

Gentle:

Sad:

Xiao Tian is younger, lighter. The emotion shift is present but less dramatic. Fine for casual use, but not the depth I needed for heartbreak scenes.

Casting verdict

Not all voices are created equal. Vivi and Yun Zhou respond strongly to emotion hints. Liu Fei is good but more subtle. Xiao Tian responds but doesn't go deep. If you're building something that needs emotional range, audition your voices — don't just pick one from the catalog and hope.


Experiment 2: Can Cloned Voices Do Emotion?

This is where it gets interesting.

Doubao lets you clone a voice — upload a 10-30 second sample, and it creates a speaker ID that sounds like that person. The official docs say context_texts works with "2.0 stock voices." They say nothing about cloned voices.

But I had a cloned voice for my character, and I needed it to express emotion. So I tested it anyway. Four conditions, same text:

1. Baseline — no emotion control

Just the text, no context_texts, no special model. Pure cloned voice.

2. Context texts only

Added context_texts with an emotion hint. Still using the standard seed-icl-2.0 resource.

3. Context texts + expressive model

Same context_texts, but switched to seed-tts-2.0-expressive as the model parameter.

4. Seed TTS 1.1 resource

Tried the older seed-tts-1.1 resource ID to see if it handled cloned voice emotion differently.

What I heard

The jump from baseline (#1) to context texts (#2) is audible. The cloned voice does respond to context_texts — this is undocumented but confirmed. Adding the expressive model (#3) pushes it further — there's more variation in pitch and pacing. The 1.1 resource (#4) sounds different but not necessarily better.

The takeaway: context_texts works on cloned voices. Nobody told me this. I found out by ignoring the docs and trying it. If you're using Doubao's voice cloning for a character that needs emotions, you're not stuck with a flat voice.


Experiment 3: The Matrix

With proof that emotion control works, I wanted to know how well it works. Systematically. So I set up a proper comparison:

  • 2 voices: Su Rou (苏柔, female) and Yi Nan (一楠, male)
  • 3 emotions: ASMR (whispered/intimate), gentle, sad
  • 3 conditions: baseline (no emotion), context texts only, context texts + expressive model

That's 2 x 3 x 3 = 18 samples. Here they all are.

Su Rou (苏柔) — Female Voice

ASMR

ConditionAudio
Baseline
Context texts
Context + Expressive

Gentle

ConditionAudio
Baseline
Context texts
Context + Expressive

Sad

ConditionAudio
Baseline
Context texts
Context + Expressive

Yi Nan (一楠) — Male Voice

ASMR

ConditionAudio
Baseline
Context texts
Context + Expressive

Gentle

ConditionAudio
Baseline
Context texts
Context + Expressive

Sad

ConditionAudio
Baseline
Context texts
Context + Expressive

What the matrix reveals

Three findings jumped out:

1. Context texts alone make a real difference. Across both voices and all three emotions, the jump from baseline to context texts is consistently audible. This isn't placebo. The model is doing something.

2. The expressive model adds another layer. Context + expressive is consistently the most emotional of the three conditions. The improvement over context-only varies — sometimes subtle, sometimes dramatic — but it's always present. In my measurements, the expressive model added roughly 26% more audio variation compared to context texts alone.

3. Not all voice-emotion pairs respond equally. Su Rou's sad is devastating. Yi Nan's ASMR is convincingly intimate. But Su Rou's ASMR is less distinct from her gentle, and Yi Nan's sad doesn't hit as hard as Su Rou's. Every voice has strengths. Test your specific use case.


The Real Breakthrough: Paint a Scene, Don't Name an Emotion

This is the finding that changed everything.

Early in the experiments, my context_texts hints looked like this:

伤心

One word. "Sad." And the result was... slightly sad. Barely different from baseline. I thought maybe emotion control was overhyped.

Then I rewrote the hint:

用哭泣的声音,边哭边说,很伤心,声音颤抖带着哽咽

"Use a crying voice, speak while crying, very sad, voice trembling with sobs caught in the throat."

The difference was night and day. Go back and listen to the sad samples above — those all use the vivid, scene-painting version. The model doesn't respond to labels. It responds to descriptions. You're not tagging an emotion; you're directing a performance.

Here's my full set of production-grade emotion hints:

EmotionHint
Sweet/flirty用甜蜜撒娇的声音,像在跟男朋友撒娇,语调上扬很开心
Sad/crying用哭泣的声音,边哭边说,很伤心,声音颤抖带着哽咽
ASMR/whisper用ASMR悄悄话的声音,非常小声非常轻柔,像在耳边低语
Excited用非常激动兴奋的语气,开心到快要尖叫了
Angry用愤怒嫌弃的语气,非常不满在骂人,声音拔高
Sleepy/lazy用疲惫慵懒的声音,边打哈欠边撒娇,声音软绵绵的
Calm/caring用低沉磁性的声音,表面平淡但充满关心,像大叔在叮嘱
Restrained sadness用压抑悲伤的声音,故意克制但声音微微颤抖,不想让人看出来

The pattern: verb + physical description + scenario. Not "sad" but "voice trembling with sobs." Not "happy" but "pitch rising like talking to a boyfriend." You're giving the model a scene to act out, not a category to select from.


What This Means for Production

If you're building a product that uses Doubao TTS and needs emotional expression, here's what I'd recommend based on this research:

For stock voices: Use context_texts with vivid hints. Add seed-tts-2.0-expressive if the extra expressiveness justifies the model switch. Test your specific voice — not all respond equally.

For cloned voices: context_texts works despite not being documented. Use it. Pair with model_type: 4 in the additions field. The per-sentence approach (one API call per emotionally-distinct sentence) gives the most control.

For emotion hints: Never use single keywords. Always paint a scene. The quality of your hint is the single biggest lever you have — bigger than voice selection, bigger than model version.

For the full technical setup — API details, code snippets, gotchas — see the Doubao TTS Runbook.


This is Part 4 of the OpenClaw Field Notes series. Part 1 built the cyber succubus. Part 2 put it on a token diet. Part 3 gave it a voice.

This one is about making that voice feel something.

Turns out, the secret isn't in the model or the API parameters. It's in how you ask. Paint a scene vivid enough, and the AI will perform it.


This post is also available in Chinese (中文版).

← PrevNext →

© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0