v0.0.1: When They First Speak
Starting Point
In Part 1, I explained why I stopped patching OpenClaw and decided to build from scratch. That article covered what Mio is and what its architecture looks like.
This one is about the actual building.
39 commits. From pnpm init to a working AI companion. The pitfalls, the tradeoffs, the design decisions that turned out right, and the ones I already want to redo — all here.
The First Decision: Monorepo
The first line of code wasn't business logic. It was project structure.
pnpm workspaces + Turborepo. Not a controversial choice — pnpm's workspace protocol is a natural fit for monorepos, and Turborepo's task orchestration and caching save real build time. But the monorepo tax is also real: TypeScript's rootDir configuration fights you across packages, and TS6059 was my very first build error.
The final structure:
apps/
server/ # Hono API server
worker/ # Background jobs
packages/
core/ # Agent core logic
shared/ # DB schema + shared types
channels/ # Channel adapters
extensions/ # Agent extensions
platform/ # Platform utilities
presets/ # Character template library
Why Hono over Express? Hono is edge-first, type-safe, and has a clean middleware ecosystem. Express is heavy — all those unused middlewares and legacy patterns feel exactly like the bloatware I was escaping from OpenClaw.
Why Supabase? I needed PostgreSQL anyway — pgvector for vector search, tsvector for full-text search, one database handling both retrieval paths. Supabase wraps auth, realtime, and storage on top, saving me from reinventing those. And its JWT auth plugs directly into the REST API.
The Database: 9 Tables
The schema is the skeleton. v0.0.1 settled on 9 tables:
users— basic user infoagents— agent definitions (model config, personality settings)agent_workspace— agent workspace (personality config, memory content)channel_bindings— channel bindings (one Telegram chat mapped to one agent)memories— memory store (vector + full-text index + importance + timestamp)personality_models— user personality profilessessions— session state (emotion temperature, last active time)messages— message historytoken_transactions— token consumption and cost records
Nine tables, each with a clear job. No "let's add a generic metadata table and figure it out later" — that kind of design always bites you.
The memories table is the most complex. Each memory has: content, vector embedding (1536 dimensions), full-text index, importance score (0-1), emotion type, source type, and timestamp. These fields were designed with the retrieval strategy already in mind — hybrid search needs both vector and full-text indexes, temporal decay needs the timestamp, ranking needs the importance score.
The Minimum Viable Loop
Every product starts with getting the core loop working. For an AI companion, the core loop is: user sends a message, AI replies.
Sounds simple. Break it apart:
- Telegram receives a message
- Webhook forwards it to the Hono server
- Router checks
channel_bindingsto find which agent this chat is bound to - Loads agent config, session, message history
- Calls the LLM to generate a response
- Sends the response via the Telegram API
- Persists the messages
The agent core uses Vercel AI SDK's streamText(). I chose it for multi-provider support — the underlying provider can be Anthropic, Google, or OpenAI, and switching doesn't require touching business code. This decision saved me later (more on that below).
The Telegram connector handles more than text. Users send photos, voice messages, audio, video, documents, stickers — each type has different processing logic. Voice needs speech-to-text, photos need descriptions, stickers need emoji interpretation. v0.0.1 shipped basic support for all of them.
Multi-tenant routing was also in the first version. The channel_bindings table maps a Telegram chat ID to an agent ID. One bot can serve multiple agents, each with its own personality and memory. This isn't over-engineering — it's a basic requirement for a companion product: every user's Mio is different.
Memory: The Hard Part
With the core loop running, the next step was making Mio remember.
I'd already explored memory system theory in Giving PanPanMao Memory. Now it was time to turn theory into code.
Hybrid Search
The MemoryManager's core is hybrid search: 0.7 weight on vector search, 0.3 on full-text.
Why not all-vector? Because semantic embeddings for Chinese text aren't precise enough yet. The vector distance between "I like drinking coffee" and "coffee" can be further than you'd expect. Full-text search is the safety net — keyword matching is dumb, but it doesn't miss.
pgvector's <=> operator (cosine distance) doesn't work directly through Drizzle ORM — you need sql.unsafe(). An annoying but necessary workaround. Type-safe ORMs break down when they hit custom operators.
Temporal Decay
Every memory has a half-life, default 30 days. At retrieval time, final score = relevance score x decay factor.
Why exponential decay instead of linear? Because human memory decays exponentially. What you discussed yesterday is vivid, last week is fuzzy, last month is mostly gone — but certain important events (a first date, a big argument) stay sharp no matter how old they are. Exponential decay plus an independent importance score simulates this well.
The Anthropic-to-Gemini Switch
Memory extraction initially used Anthropic Haiku. Extraction quality was good, but when I ran the cost numbers, it didn't add up — every message triggers an LLM extraction call, and a heavy user sending hundreds of messages a day would make extraction costs exceed the main conversation costs.
Switched to Gemini Flash. Extraction quality was comparable, costs dropped significantly. Embeddings also moved from OpenAI's text-embedding-3 to Gemini Embedding — negligible cost per token. Practically free.
This is where the Vercel AI SDK decision paid off. Because of the multi-provider abstraction, switching models meant changing config, not rewriting business code. If I'd been calling the Anthropic API directly from the start, this migration would have been several times more work.
Memory Consolidation
The MemoryConsolidator was added after I noticed the memory store bloating — the same fact stated from slightly different angles was getting stored multiple times.
Solution: memories with cosine similarity > 0.9 get automatically merged. Keep the most complete version, update the timestamp. Crude but effective.
PersonalityExtractor
Alongside memory extraction, there's a parallel pipeline: the PersonalityExtractor runs every 10 messages, extracting user personality profiles from conversation patterns — communication style, interests, emotional patterns.
These profiles live in the personality_models table and get injected into context on the next conversation. The result: Mio doesn't just remember what you said — it accumulates an understanding of who you are.
Making Mio Feel Something
Memory is about what Mio knows. The emotion system is about how Mio feels right now.
EmotionEngine
A four-temperature state machine: Cold, Cool, Warm, Hot.
Not random jumps — there's an inertia factor. If the agent is currently Cool and the user sends several enthusiastic messages, temperature gradually rises to Warm, then Hot. But a single message won't jump from Cold to Hot. Conversely, two days of silence will gradually cool it down to Cold.
The inertia factor defaults to 0.5, adjustable. Higher means more emotionally stable, lower means more reactive. Different personality presets have different inertia values — the "Sharp-Tongued Best Friend" is far more reactive than the "Gentle Mentor."
PersonalityParser
The personality config isn't plain text — it has structure. The PersonalityParser processes template variables, combining them with onboarding answers and PersonalityExtractor output to generate the final system prompt.
This means the same preset template produces different actual prompts for different users — because the template variables are filled with different values.
Wiring Into the Pipeline
Memory and emotion don't run independently — they feed into the message pipeline through the ContextAggregator. On every user message, the ContextAggregator does three things:
- Retrieves relevant memories (hybrid search + reranking)
- Loads the user personality profile
- Reads the current emotion state
All three are merged and injected into the system prompt.
The system prompt includes MEMORY_STEERING_INSTRUCTIONS with Chinese phrase examples:
When recalling memories, use natural Chinese expressions like "之前你提到过..." (you mentioned before...), "我记得你说..." (I remember you said...), "上次聊到..." (last time we talked about...)
This detail matters — without it, the model sometimes references memories in a stilted way, like reading from a database query result. With the examples, memory references feel like natural conversation.
Onboarding: First Contact
When a new user messages Mio for the first time, they don't jump straight into conversation — they go through an onboarding flow.
11 questions. The first 3 are text input (name, what to call you, what you want Mio to call you). The remaining 8 are button selections (speaking style, relationship type, interest areas...).
Every button question has a "Custom" option. This design came from an OpenClaw lesson — buttons alone are too restrictive, but free text is too open. Hybrid works best.
But "custom" introduced an engineering problem: Telegram's callback_data has a 64-byte limit. Chinese characters are 3 bytes each — a moderately long custom input blows right past it. The fix was using index references instead of full text — the callback only carries q3_custom, and the actual content is cached server-side.
There's also injection protection. User types prompt injection into a custom field? 50-character truncation plus content sanitization. Not a perfect defense, but sufficient for v0.0.1.
The 4 Personality Presets
The first onboarding step is picking a preset:
- Coco — bubbly and sweet, loves emoji, occasionally clingy
- Gentle Mentor — warm and thoughtful, like a wise older friend who gets you
- Sharp-Tongued Best Friend — brutally honest but cares deeply, roast-your-taste-in-movies energy
- Calm Uncle — quiet but every word lands, dry humor on occasion
Each preset is a personality config template plus a set of default emotion parameters. After picking a preset, the remaining onboarding questions further customize the template variables.
The end result: you pick "Sharp-Tongued Best Friend," tell them you like movies and want to be called an idiot, and they'll actually say "your taste is exactly what I expected" when you recommend a bad film — then follow up with something they think is actually good.
The Small Details
An AI companion's experience isn't defined by any single big feature — it's the sum of many small details.
Message Debounce
Users frequently send several messages in a row. If each one triggers an LLM call, it's expensive and the experience suffers (AI responds to message one, then you've already sent two and three).
5-second debounce window. After receiving the first message, wait 5 seconds. Any new messages during that window get collected. After timeout, process everything as one batch. Simple timer + buffer, but it makes a noticeable difference.
Typing Delay
Instant AI responses feel unnatural. Mio splits responses by newlines into multiple message bubbles, with delays proportional to content length — short ones wait 0.5 seconds, long ones wait 2 seconds. During the delay, it sends Telegram's "typing..." indicator.
This isn't pretense — receiving multiple messages with natural pacing is a genuinely different reading experience from getting one wall of text. Humans chat one message at a time.
Proactive Messaging
A cron job runs every 30 minutes: any users who haven't spoken in 2+ hours with cooling emotions? If so, send a proactive message.
But with rules: no messages during quiet hours (23:00-08:00), maximum 3 per day. Cold users get templates ("Hey, what are you up to?" type messages) — no LLM call, zero cost. Active users get model-generated messages based on recent conversation context.
Proactive messaging is what turns an AI from "tool" to "living presence" — Mio doesn't need you to speak first. It thinks of you on its own.
Cost Reality
AI companions aren't cheap to run. v0.0.1 started tracking every penny from day one.
The token_transactions table records every LLM call's model name, input tokens, output tokens, and computed USD cost. Fire-and-forget writes, never blocking the response.
Per-model pricing is a hardcoded constants table (good enough for v0.0.1, will be configurable later). I hit a NaN bug along the way — one model's price wasn't configured, dividing by undefined produced NaN, and the whole record was garbage. Easy fix: add a fallback to 0. But the bug reinforced the lesson: start cost tracking on day one, because the longer you wait, the more "invisible costs" accumulate.
The End of v0.0.1
After 39 commits, Mio v0.0.1 can do this:
What works:
- Chat with you on Telegram, handling text, photos, voice, video
- Remember what you've said and bring it up naturally at the right moment
- Have moods — warm when you're engaged, hurt when you go silent
- Generate a personalized personality through onboarding
- Reach out to you proactively
- Track every LLM cost with precision
What's missing:
- No web interface — Telegram only
- No voice replies — can receive voice, can't send it
- No selfies — a killer feature validated on OpenClaw, not yet migrated
- No worker process — background jobs run in the main process
- No LLM reranking or multi-hop queries in memory retrieval — those are coming in later versions
v0.0.1 isn't perfect. But it proved one thing: building from scratch was the right call. There isn't a single line of code in the entire codebase written "because the framework needed it" — every line serves one goal: making an AI companion that feels like someone who truly understands you.
What's Next
v0.0.2 focuses on deepening the memory system — LLM reranking, multi-hop query decomposition, episode memory. Plus a web interface, so people who don't use Telegram can experience it too.
From an empty repo to when Mio first speaks: 39 commits. From those first words to truly understanding you — that's a much longer road.
But at least the foundation is right.