ENZH

From OpenClaw to Mio: Why I Decided to Build From Scratch

Looking Back

If you've been following along, you know how this story goes.

In The Companion Vision, I laid out the theory — why every AI companion fails today, and what a truly understanding AI needs: memory orchestration, personality modeling, multi-agent architecture. Not a better chatbot. An organism.

Then I ran the experiment with OpenClaw. I took a utility AI assistant and gave it a complete personality — one that would get jealous, demand boba tea, and chase you for not replying. That article showed what "soul-driven behavior" looks like in practice: don't write rules, write a soul, and let the model figure out the rest. It worked better than I expected.

But at the end of that article, I wrote one sentence:

If you want to build a truly refined AI companion product, you probably need to heavily strip down OpenClaw, or build your own framework from scratch — which is exactly what I'm doing now.

Mio is that framework. This article is about why I started from scratch, and what Mio actually does.

What OpenClaw Validated

Credit where it's due — OpenClaw helped me validate several critical things:

Soul-driven design works. Don't write behavioral rules. Write a personality. Let the model derive behavior. The model's inferences are more natural and human-like than any hand-written rules. This conviction carried directly into Mio's design.

Personality can be engineered. The file-driven personality system — personality config, identity definition, behavior rules — while crude, proved that you can define an AI's "soul" in a structured way, and the model will actually embody it.

Heartbeat is the critical feature. AI proactively reaching out to you — that's the threshold between "tool" and "living entity." OpenClaw's heartbeat config was simple (timer-based polling), but it validated the experience.

Selfies and voice are killer features. An AI companion that can send voice messages and selfies is a fundamentally different experience from text-only. These multimodal capabilities need to be designed into the architecture from day one.

But OpenClaw Hit a Ceiling

OpenClaw can validate ideas, but it can't carry a product. The problems aren't bugs — they're architectural.

Context Bloat

This is the fatal one. OpenClaw's built-in pi agent handles context extremely crudely — every LLM call gets the raw output of all historical tool calls and thinking blocks stuffed into the context. In a normal conversation, 90% of the context is historical garbage you don't need.

I ran an experiment: stripped out all historical tool call outputs and thinking blocks from the context. Result? Token consumption dropped to one-tenth.

This isn't something you can "optimize" — the core design never considered context efficiency. The pi agent was vibe-coded in an hour. The fact that it runs at all is impressive enough.

Primitive Memory

OpenClaw's memory is a flat file. After each conversation, append a few lines of summary. No vector search, no semantic retrieval, no importance ranking, no temporal decay.

The result: as the memory file grows, more irrelevant content gets stuffed into context, burning tokens on memories that have nothing to do with the current conversation. You're talking about travel plans, and the context is packed with a technical discussion from three months ago.

I explored this problem in Giving PanPanMao Memory — memory isn't about "stuff everything in," it's about "recalling the right thing at the right time." OpenClaw can't do this at all.

Bloatware

OpenClaw is a general-purpose framework. It ships with a massive collection of extensions you don't need: collaboration features, task management, third-party integrations. If all you want is an AI companion, 80% of the codebase is irrelevant.

This isn't a complaint — OpenClaw was designed for generality. But for my use case, "trimming" a general framework is more painful than starting fresh. You change one thing, and three dependencies break. You remove one extension, and the build fails.

Eventually I realized: I'm not modifying a car. I'm trying to build a sports car on a truck chassis. The steering wheel and tires are in the right place, but the entire foundation is wrong.

Starting From Scratch: Mio

Mio is not a fork of OpenClaw. It's built from the first line of code, designed specifically for one scenario: AI companions.

Architecture Overview

apps/
  server/      Hono API server (Cloud Run)
  web/         Next.js frontend (Vercel)
  worker/      Background job processor
packages/
  core/        Core AI agent logic
  shared/      Database schema (Drizzle + Supabase)
  channels/    Channel adapters (Telegram, web, ...)
  extensions/  Agent extensions
  platform/    Platform utilities
presets/       Character template library

Monorepo with pnpm workspaces + Turborepo. Each package has clear responsibilities and unidirectional dependencies. No bloatware. Nothing you don't need.

Deep Memory System

This is the fundamental difference between Mio and OpenClaw. Not a flat file with appended summaries, but a complete memory engine:

Hybrid search. Every memory gets both a vector embedding (pgvector, Gemini embedding) and a full-text index (tsvector). Retrieval runs both paths in parallel, merges with weighted scoring. Vectors catch semantic similarity; full-text catches exact keyword matches — especially important for Chinese text.

Temporal decay. Memories have a half-life (default 30 days). Recent conversations carry more weight, but truly important memories don't fade just because they're old — importance is scored on a separate dimension.

Automatic extraction. After every conversation turn, the MemoryAccumulator asynchronously calls an LLM to extract new information — facts, personality observations, emotional events. Dedup logic is baked into the prompt; known information doesn't get stored twice.

Memory consolidation. The MemoryConsolidator periodically merges similar memories (cosine similarity > 0.9), preventing the memory store from bloating.

LLM reranking. After hybrid search returns candidates, a reranker uses a lightweight model (gemini-2.0-flash) to precision-rank them — ensuring the 5 memories that get injected into the system prompt are actually relevant to the current conversation.

Multi-hop query decomposition. "What was that book you recommended last time we talked about travel?" — a single embedding search probably won't find this. The QueryDecomposer breaks it into sub-queries ("travel-related conversations" and "recommended books"), searches each independently, then merges results.

Episode memory. The EpisodeManager groups memories into conversational episodes and generates episode summaries. When a user says "remember that time we talked about...", the system can trace back to the full episode context, not just isolated fragments.

Agentic retrieval. For particularly complex queries, the AgenticRetriever runs an iterative loop: search, evaluate results, if insufficient refine the query and search again — up to three rounds. Simple queries skip this entirely.

This is what I described as "memory orchestration" in The Companion Vision — not stuffing all memories into context, but recalling the right thing at the right time.

Emotion System

OpenClaw's "conversation temperature" was a text description in the personality config. Mio turns it into a real state machine:

Four temperature levels — Cold, Cool, Warm, Hot. After each conversation turn, the EmotionEngine computes a new temperature score based on message frequency, sentiment analysis, and user engagement. There's an inertia factor (default 0.5), so mood doesn't swing wildly from a single message.

Emotion doesn't just affect speech style — it affects whether the AI reaches out proactively, what it says, how long its replies are. Cold means short replies with a hint of hurt feelings. Hot means chatty, sharing things unprompted, occasionally flirty. These aren't hardcoded rules — the temperature gets injected into the system prompt, and the model derives behavior from there.

There's also an independent valence dimension (emotional positivity). The cross-product of temperature and valence — like "high temperature but negative valence" during an intense argument — gives the model nuanced emotional expression.

Proactive Messaging

OpenClaw's heartbeat is a timer. Mio's proactive messaging is a complete system:

  • Heartbeat query every 30 minutes, finding qualifying sessions (inactive 2+ hours, cool/cold emotion)
  • Respects quiet hours (default 23:00-08:00, per user timezone)
  • Maximum 3 proactive messages per day
  • Cold users get template messages (no LLM call, zero cost); active users get model-generated messages
  • Updates emotion state and session after sending

This isn't just "send a message on schedule" — it's a system with awareness, deciding "should I reach out now, and if so, what should I say?"

Multi-Channel

OpenClaw bakes channel adapters into core code. Mio extracts them into an independent @mio/channels package with a standard interface:

interface ChannelConnector {
  send(message: OutboundMessage): Promise<void>
}

Telegram is complete. Discord and Feishu are next. Want to add WhatsApp? Implement the interface. Core logic stays untouched.

Onboarding

New users go through an 11-question onboarding flow: 3 text questions + 8 button questions. Answers get written into personality config template variables, generating a personalized personality configuration.

Every button question has a "Custom" option — users can type freely, with length limits and injection protection.

This means every user's Mio is different. Not "pick a preset character," but shaping a personality through conversation.

Personality Presets

Speaking of presets — Mio ships with 5 starting personality presets. But presets are just the beginning. As conversations accumulate, the PersonalityExtractor runs every 10 messages, extracting user personality profiles from conversation patterns. The MemorySummarizer generates a memory digest every 20 messages, keeping the persona's understanding of you up to date.

The result: your Mio gets better at understanding you over time. Not because you manually told it things, but because it learns from every conversation.

Cost Control

Building AI companions means LLM costs are unavoidable. Mio's strategy is tiered model usage:

OperationModelWhy
Main chatgemini-3-proNeeds the best reasoning
Memory extractiongemini-3-flashFast, cheap; only needs to extract facts
Memory summarygemini-3-flashSame
Rerankinggemini-2.0-flashCheaper; precision ranking doesn't need a large model
Embeddinggemini-embedding-001Negligible cost
Proactive messagesgemini-2.0-flash / templatesCold users go straight to templates, zero LLM cost

Every LLM call logs token consumption and USD cost to a token_transactions table. Fire-and-forget, never blocking user responses. This gives you precise visibility into how much each user and each operation type costs.

Google Search Grounding

Gemini models come with Google Search grounding — the model can search the internet in real time. Mio enables this by default (MODE_DYNAMIC, the model decides when to search).

The system prompt instruction: if you found something, share it naturally. Don't say "I searched for you" — just know it, like a person who casually checked their phone.

This means your AI companion doesn't just have memory and emotion — it has real-time information access. Ask about today's weather, latest news, nearby restaurants — it answers in character.

The Message Pipeline

The full message flow:

  1. User message arrives from a channel (Telegram, Web)
  2. Router looks up binding, loads agent, workspace, session, history
  3. ContextAggregator assembles all context: retrieves memories, loads personality profile, computes emotion state
  4. Feeds aggregated context + history + user message to the LLM
  5. Streams the response
  6. Persists messages
  7. Async post-response: extract memories, update personality profile, generate summary, prune old memories
  8. Updates emotion state

Messages are also debounced — when a user sends rapid-fire messages, the system waits 5 seconds (configurable) to collect them all before processing as one batch, avoiding one LLM call per message. Responses are split by newline into multiple message bubbles, each with a typing delay that scales with content length, simulating the rhythm of a real person typing.

These details, taken together, are what create "lifelike presence."

Why Not Just Modify OpenClaw

Someone might ask: if OpenClaw's personality methodology is sound, why not build on top of it?

Because the cost of patching exceeded the cost of rebuilding.

OpenClaw's core loop (pi agent) assumes all history lives in context — switching to retrieval-based memory means rewriting the core loop. Its extension system assumes extensions can freely inject into context — controlling context size means rewriting the extension system. Its channel adapters are mixed into core code — adding a new channel means touching core.

Every change fights the framework's design assumptions. At some point you realize the original code you've kept is less than 10% — so why carry the other 90% as baggage?

Mio knew what it was from line one: an AI companion framework built for deep memory and lifelike presence. Every design decision — channel abstraction, memory retrieval, emotion state machine, cost tracking — orbits that single goal.

What's Next

Mio runs now. Telegram channel is complete, memory system is live, emotion engine is working. There's still a lot to build:

  • Discord and Feishu channels
  • Web chat interface (Next.js, already in progress)
  • Selfie extension (killer feature validated on OpenClaw)
  • Voice messages
  • Worker process (background jobs running independently)
  • Wearable device integration — the perception layer is the next frontier

This is the first entry in the "Creating Mio" series. Future posts will cover specific technical implementations — memory system design details, emotion engine tuning, onboarding iteration.

From OpenClaw to Mio is essentially going from "validating the idea" to "building it right." OpenClaw proved that soul-driven AI companions are viable. Mio aims to prove they can be genuinely good.


© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0