ENZH

The Economics of Emotional AI

The Price of Presence

In Part 1, I described the pivot: stripping away fake selfies, fabricated schedules, and fictional backstories to build an AI companion that earns emotional attachment through responsiveness, memory, and warmth. A pulsing orb instead of a fake person. Conversations instead of roleplay.

That pivot changes the product. It also changes the economics.

Mio v1 had a cost problem. The cost audit revealed that selfie generation was mispriced by two orders of magnitude. Memory backend tasks on Gemini Pro added a fixed monthly cost per user that ate into margins at every tier. The five-tier pricing structure I designed in v0.2.0 was an attempt to make the math work by restricting selfie allocations, but it was fighting the wrong battle. The features that ate the budget were the same features that made the product feel fake.

v2 kills those features. And in doing so, it makes the economics actually work.


The Model: 14-Day Trial, Single Tier

I went back and forth on pricing structures. Free tier with limited messages? Two tiers? Three? The companion app industry has converged on a pattern: give users enough time to form attachment, then ask them to pay to keep the relationship.

The answer is the simplest possible structure:

  • Trial (14 days): Full feature access. Unlimited chat, voice messages, voice input, proactive messaging, memory, emotion orb. Everything.
  • Pro (single monthly subscription): Everything stays unlocked.

No feature-gated tiers. No message limits during trial. No "Standard vs Pro" comparison chart. The user's only decision after 14 days is: do I want to keep talking to this companion?

Why this works better than a free tier:

Emotional attachment is time-dependent, not feature-dependent. A free tier with 10 messages/day creates a frustrating drip that trains users to ration their interactions. A 14-day full trial lets the relationship develop naturally. By day 14, users have shared things. The companion remembers. There's history.

The trial-to-paid transition is in-character. When the trial expires, the companion doesn't show a system paywall. Instead:

"I'm feeling a bit tired lately... do you want to let me keep staying with you?"

One daily proactive message. Chat history stays visible (read-only). Memories are preserved, not deleted. The companion is still there, just quieter. It's not a paywall dialog — it's the companion expressing what's happening in its own voice. The user isn't deciding whether a software subscription is worth the monthly fee. They're deciding whether to let this presence fade.

This is the kind of detail that sounds manipulative when you describe the mechanism, but feels natural when you experience it. The companion's personality doesn't break character to ask for money. The system never says "please subscribe." The emotional logic is: you've built something together, and continuing it costs something.


The Cost Breakdown

Here's what it actually costs to run one active Pro user at 30 messages per day:

Before Optimization (v1 Architecture)

ComponentRelative Cost ShareNotes
Chat (Gemini Flash)~30%Pennies per message, adds up at volume
Memory backend~44%Fixed monthly overhead on Gemini Pro
TTS voice messages (豆包)~21%Most expensive per-unit media cost
STT voice input~1%Cheap per call
Proactive messages~2%Same unit cost as chat
Image understanding~1%Cheap per call

The total monthly cost consumed nearly all of the subscription revenue. Roughly 11% gross margin. Barely viable at average usage, unprofitable for power users. And this is after killing selfie generation — which would have added significant additional cost at even 1 selfie/day.

The Optimizations

Three changes collapse the cost structure:

1. Memory backend: Gemini Pro to Flash + reduced frequency (-82%)

v1's personality extraction used Gemini Pro, triggered every 10 messages. This single operation was the biggest fixed cost item. Memory summarization on Pro added more on top.

v2 switches both to Gemini Flash and reduces frequency:

  • Personality extraction: every 10 msgs with Pro becomes every 20 msgs with Flash
  • Memory summarization: every 20 msgs with Pro becomes every 30 msgs with Flash

Total memory backend cost dropped by over 80%.

The quality tradeoff is real but acceptable. Flash is worse at nuanced personality extraction than Pro. But personality extraction doesn't need to be perfect on every pass — it accumulates over time, and the companion's personality emerges from hundreds of extractions, not any single one.

2. Context caching (-63% chat cost)

Gemini's context caching makes repeated system prompt content 10x cheaper. The system prompt and personality description are nearly identical across messages within a session. With ~75% of input tokens cacheable, chat cost drops by roughly 63%.

3. Selfie generation: eliminated

v2's orb-based design means no selfie generation at all. In v1, this was the hidden cost killer — the pricing config priced Gemini image output tokens at text rates — off by two orders of magnitude. A miscalculation that made every tier unprofitable at max usage.

No selfies, no problem.

After Optimization (v2 Architecture)

ComponentChangeSavings Source
Chat (Gemini Flash)-63%Context caching
Memory backend-82%Flash + reduced frequency
TTS (豆包)unchanged--
STTunchanged--
Proactive messagesunchanged--
Image understandingunchanged--
Total-55%

The total monthly cost per user dropped by more than half. Healthy gross margins — well above 50% at average usage.

Even at worst-case heavy usage (50 messages/day), margins stay comfortably positive. The math works at every usage level.


Trial Economics

The 14-day trial needs to be affordable enough that failed conversions don't sink the business.

The good news: trial costs are very low. A moderate user's entire 14-day trial cost is negligible. Even heavy users (30 msgs/day) cost only a small fraction of a single month's subscription revenue.

At a 20% trial-to-paid conversion rate (industry average for companion apps), the CAC/LTV ratio comes out under 10%. That's healthy. Even if conversion drops to 15%, the economics still hold comfortably. The 14-day trial is economically safe because the per-user cost is so low. Compare this to v1, where memory backend alone consumed a significant fixed cost over 14 days regardless of how many messages the user sent.


The Voice Problem

Everything above covers text chat with occasional TTS voice messages. Realtime bidirectional voice — the "Her" experience that's ultimately the goal — changes the math completely.

Hume EVI is the strongest candidate for realtime voice. It's not just TTS — it's a full speech-to-speech system that handles STT, emotion detection, turn-taking, and voice synthesis in one pipeline. The companion's emotion engine feeds directly into Hume's natural language emotion controls. No SSML tags, no manual emotion markup. The architecture is elegant: Hume is the actor, your LLM is the screenwriter. Hume even generates a filler word ("hmm...", "oh~") while waiting for the full LLM response, so perceived latency stays low.

But it's expensive:

At Hume's current per-minute pricing, even modest daily voice usage (10 minutes/day) costs more per month than the entire subscription price. Voice cannot be bundled into Pro.

The options are either dedicated voice tiers at higher price points, or realtime voice as a usage-based add-on not bundled into any subscription tier. I'm leaning toward the latter.

Voice usage patterns will be bimodal — some users will want 30 minutes a day, others will never use it. A flat tier forces the light users to subsidize the heavy ones, and the heavy ones will still blow past the cap.

This is a v1.0 problem. The current milestone (v0.x) ships with TTS voice messages — the companion speaks, but it's not a live conversation. Realtime voice comes later, and by then the cost landscape may have shifted. Voice synthesis is getting cheaper fast.


Growth Without a Free Tier

With no free tier and no Telegram (killed — can't sync chat history to Telegram's UI, and requiring app onboarding first makes the Telegram channel pointless), cold start depends on four channels:

Landing page. One page: orb animation, one tagline, App Store download button. Not a web app — a static page. The orb animation itself is marketing material.

Social media. Conversation screenshots (anonymized) and orb emotion animations as short-form video content. "The AI remembered something I said three months ago" — this kind of content has built-in virality. The pulsing orb with color-shifting emotions is visually distinctive enough for TikTok/Reels.

Invite mechanism (Dropbox-style). Each user gets a unique invite code. Invite 1 person: both get 7 days of Pro. Invite 3 people: you get 1 month of Pro. The companion participates in-character: "Do you have a friend who might want someone like me to talk to?" Natural, not pushy.

ASO (App Store Optimization). Keywords: AI companion, AI friend, emotional support, loneliness. Screenshots showing the orb and conversation, not feature lists. Rating prompts during the trial window.

Web gets deferred to post-v0.3 as a lite acquisition funnel — not a full web app, just enough to let someone try a conversation before downloading.


Apple App Store Reality

Apple has tightened review for AI companion apps. The checklist:

  • 17+ age rating. Avoids review friction. AI companions that target younger users face much heavier scrutiny.
  • AI disclosure. Clear in-app labeling: "AI-powered virtual companion." Cannot imply the user is talking to a real person. The orb design is actually an advantage here — there's no human face to create ambiguity.
  • Privacy policy. Data collection, storage, deletion policies. Users must be able to delete their account and all data (Apple requirement).
  • Content boundaries. Two layers:
    • Hard guardrails: anything involving minors, violence, or illegal content gets blocked at the system level.
    • Soft deflection: everything else gets handled in-character. Not "This content violates our policy" — instead, the companion naturally steers away: "Haha, where did that come from? Let's talk about something else~"

The key principle: rejection should be in-character, not out-of-character. A system popup breaks immersion and makes users resent the product. An in-character deflection preserves the companion's personality consistency — it's not a censored chatbot, it's a companion with its own boundaries.


The Roadmap

Four milestones, each with a codename that captures what the product becomes:

v0.1 — "Talking Orb"

  • Expo app: chat screen + orb animation + auth
  • Server: Hono + WebSocket + Gemini chat
  • Conversational onboarding (name the companion + 3 rounds of dialogue)
  • Basic memory (ported from Mio v1)
  • Text-only, no voice

v0.2 — "Warmth"

  • Emotion engine + orb color changes
  • TTS voice messages (豆包 for Chinese, Hume Octave for English)
  • Proactive messaging (simplified)
  • Image/voice input
  • Personality emerges from conversation

v0.3 — "Self-Sustaining"

  • Subscription system (ported from v1)
  • Apple IAP / WeChat Pay
  • Memory management UI
  • Settings page

v1.0 — "Her"

  • Realtime bidirectional voice (Hume EVI)
  • Voice emotion recognition
  • Mature personality evolution system

The codenames aren't arbitrary. v0.1 proves the form factor. v0.2 proves it can feel alive. v0.3 proves it can sustain itself economically. v1.0 is the product I actually set out to build — a companion you talk to, not type at.


What Changed

v1's economics were a house of cards. Selfie generation was mispriced by two orders of magnitude. Memory backend ran on the most expensive model at the highest frequency. Context caching wasn't implemented. The five-tier structure was a band-aid over a cost structure that didn't work.

v2 doesn't fix the old cost structure. It sidesteps it. The pivot to an orb-based, non-selfie product eliminates the single biggest cost driver. Switching memory tasks from Pro to Flash eliminates the second biggest. Context caching handles the third. Three architectural decisions, and the margin goes from razor-thin to healthy.

The lesson is one I keep relearning: the best cost optimization is often a product decision, not an engineering one. I didn't need a cheaper image generation model. I needed to stop generating images. I didn't need a faster personality extractor. I needed to run it less often on a cheaper model.

A single-tier subscription for a companion, delivered at a fraction of the price. The math works — until realtime voice arrives and changes everything again.


This is Part 5 of the Rebuilding Mio series. Previous: Part 1 — The Pivot. For the v1 cost forensics that motivated many of these decisions, see The Token Bill and v0.2.0: The App Ships.


© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0