ENZH

The Chattiness Problem: Teaching an LLM When to Shut Up

LLMs are trained to be helpful. Helpful means thorough. Thorough means verbose. That's fine for a chatbot answering questions about your insurance policy. It's a disaster for a companion app where the goal is to feel like texting a real person.


The Ratio Problem

I measured response-to-input length ratios across Mio's 5 personas. The numbers were ugly.

Before any controls, the average ratio in realistic mode was 5.02x — the LLM's reply was five times longer than the user's input. For short pings (single-character messages like "嗯"), the ratio ballooned to 6.60x. When you text a friend "嗯" (a one-character acknowledgment), they might reply "幹嘛" (what's up) — 2 characters. An LLM replies with a paragraph asking how your day went, sharing what it's been thinking about, and ending with a thoughtful question.

That's not a conversation. That's a monologue.

The problem is baked into the training. LLMs are rewarded for comprehensive, well-structured responses. Every RLHF rater who gave a thumbs-up to a detailed answer reinforced the same behavior: more is better. For a companion app where the baseline interaction is casual texting, this training signal is actively harmful.


What Didn't Work

Hard max tokens. The obvious first thought: just cap the output at N tokens. I rejected this quickly. Mid-sentence cutoffs are worse than being verbose. You can't truncate "I was thinking about what you said yesterday and—" and call it natural. The user sees a thought that literally got cut off. It breaks immersion harder than any wall of text.

Prompt-only instructions. "Keep replies short and natural" in the system prompt. This improved things somewhat but was wildly inconsistent. LLMs treat length instructions as suggestions, not constraints. Chatty personas (like 小柒, the puppy-like college boy) would routinely blow past any prompted limit. Quiet personas (like 苏柔, the reserved editor) were already brief, making the prompt redundant. The instruction helped the middle but couldn't tame the extremes.


It's Not One Problem, It's a Matrix

The breakthrough was realizing response length should vary across three independent dimensions:

Mode. Realistic (texting-like brevity) vs companion (more expressive, allowed to be warmer and longer).

Relationship closeness. Distant (刚认识 / just met) to neutral (普通朋友 / friends) to close (情侣 / partners). How you reply to an acquaintance versus a partner is fundamentally different — even to the same message.

Persona verbosity profile. Quiet (苏柔, 陈哥) vs neutral (可可, 蜜蜜) vs chatty (小柒). Each persona has an intrinsic talkativeness that should be respected.

A quiet persona who just met you should reply with 2-4 characters to a short ping. A chatty persona in a close relationship can afford 12-18. Treating these as the same problem guarantees you solve neither. The relationship dimension connects directly to the closeness system I built in v0.1.4 — the same state that drives emotional behavior now drives response length.


The Hybrid Two-Layer System

The solution is two layers working together: a prompt-level contract that guides generation, and a deterministic post-processor that enforces limits.

Layer 1: Per-Turn Length Contract

Injected into every system prompt, this contract specifies a character limit per response (computed from mode + relationship + persona profile), a bubble count limit (max number of message bubbles), and a universal principle: short and warm, leave space for the user to respond.

Example budgets in realistic mode:

  • Distant + short ping: 4 chars / 1 bubble
  • Neutral + casual: 8 chars / 1 bubble
  • Close + deep topic: 84 chars / 3 bubbles
  • Chatty persona bonus: +15-40% on all limits

The contract isn't just "be short." It's a specific numerical target the LLM can aim for. "Reply in under 8 characters" is actionable in a way "keep it brief" is not.

Layer 2: Deterministic Post-Processing

This is where the real enforcement happens. A pure code function — normalizeAssistantReplyLength() — runs on every response before it reaches the user. No extra LLM call. Just code.

The normalizer splits the response into sentences using Chinese/English sentence boundary detection, trims to fit the character and bubble budget, and preserves complete sentences so it never cuts mid-thought. The trimmed result is what gets persisted to the database and sent to the user — they see the same text.

Why not a second LLM call to "please shorten this response"? Cost and latency. Every message already has a per-call cost (see the unit economics breakdown in the Mio Manifesto). Adding a shortening pass doubles that and adds 500ms+ of latency. For a chat app where responsiveness is part of the experience, that's unacceptable. A deterministic function runs in under 1ms.


The Over-Correction Arc

The first pass worked too well. Realistic + distant replies became brutally terse. When a user said "在干嘛" (what are you doing), the LLM replied "嗯。" (uh-huh). Technically short. Also technically useless — it's a non-answer to a direct question.

The fix was activity question detection. A deterministic guard checks if the user asked a "what are you doing" type question (在干嘛, 在忙什么, 你在做什么, etc.) and sets a floor: the response must actually answer the question, even if it's brief. "在忙。" (busy) is 2 characters AND answers the question. "嗯。" doesn't. The floor doesn't override the ceiling — it just ensures the response has semantic content, not just filler.

This pattern kept recurring. Every time I tightened limits, I'd find a category of input where brevity crossed from natural into broken. Each one needed a specific guard. Questions need answers. Greetings need reciprocation. Emotional messages need acknowledgment. The system isn't just about length — it's about minimum viable response quality at any given length.


Short AND Warm

The key insight was that brevity and warmth are not opposites. Real people are brief but not cold. The final tuning added a universal engagement principle to the prompt contract: be short, be warm, leave space.

Here's what this looks like in practice. For the input "嗯" in realistic mode with the 可可 persona:

  • Distant (刚认识): "嗯。" — 2 chars. Polite acknowledgment between strangers. Correct.
  • Neutral (普通朋友): "幹嘛啦。" — 4 chars. Casual "what's up" between friends. Correct.
  • Close (情侣): "幹嘛,句點我喔?" — 8 chars. Playful teasing between partners. Correct.

Same character, same persona, completely different energy based on relationship closeness. That's not something you get from a flat "keep it short" instruction. It requires the full matrix.


Results

The measured improvements after deploying the hybrid system:

MetricBeforeAfterChange
Overall realistic ratio5.022.78-45%
Short-ping realistic6.605.00-24%
Casual realistic1.950.70-64%
Deep topic realistic6.492.63-59%
Distant avg ratio3.831.83-52%
Neutral avg ratio3.972.28-43%

Short-ping ratios are still higher than ideal — when the input is literally one character, even a 3-character response is a 3x ratio. But the absolute lengths are now in the right range. A 5-character reply to "嗯" feels natural. A 33-character reply does not.


What This Means for AI Companions

The chattiness problem is underrated in companion AI. Most products optimize for engagement metrics, and longer responses correlate with longer sessions. More words, more reading time, more "engagement." But real relationships have silence. Real people don't narrate their thoughts. The person who texts you a novel every time you say "hey" isn't attentive — they're exhausting.

The hardest part of this work wasn't making the LLM shorter. It was making it shorter in the right way, so brevity reads as comfortable silence rather than cold dismissal. A 2-character response from a close partner that carries warmth and playfulness is worth more than a 200-character response that reads like a therapist's check-in.

The hybrid approach — prompt contracts for guidance, deterministic code for enforcement — turns out to be a general pattern for controlling LLM behavior in production. Prompts are good at nuance but bad at hard limits. Code is good at hard limits but bad at nuance. Use both.

← PrevNext →

© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0