ENZH

Two Repos, One Soul

πŸ“Š Slides

Clawd Soul Β· Part 4 of 5


Part 3 covered making AI useless. This final post covers what holds it all together β€” two repos, two npm dependencies, 2,000 lines of code.


Three core claims:

  1. Separating body and soul into two repos was a day-one decision β€” zero AI logic in the Electron shell, the soul engine runs as an independent HTTP service, enabling multi-platform and soul portability
  2. Two npm dependencies (better-sqlite3 + sqlite-vec) isn't laziness β€” it's a deliberate choice: every abstraction layer is a place where the character can leak through as "AI behavior"
  3. Seven different AI coding agents connect to one pet simultaneously, each with a different integration pattern β€” from command hooks to in-process plugins

1. Zero Lines of AI Code in the Body

The entire system is two repositories. One manages the body. One manages the brain.

clawd-on-desk (Electron) = the body:

  • 12 animated states: idle, thinking, typing, building, juggling, conducting, error, happy, notification, sweeping, carrying, sleeping
  • Click, drag, speech bubbles, chat window
  • Permission bubbles β€” in-app Allow/Deny, not terminal prompts
  • Mini mode β€” hides at screen edge, peeks on hover
  • Eye tracking β€” 20fps cursor following, 3px max offset, quantized to a 0.5px grid
  • Sleep sequence: yawning β†’ dozing β†’ collapsing β†’ sleeping, triggered after 60 seconds of inactivity
  • Theme system: pixel crab Clawd + calico cat Calico

Zero lines of AI logic. Not "minimal AI logic." Zero.

The eye tracking deserves a mention because it's the feature that makes people say "wait, it's looking at me." The pet's eyes follow your cursor at 20 frames per second. The maximum offset is 3 pixels β€” enough to be noticeable, small enough to be subtle. The position is quantized to a 0.5px grid, which sounds like an irrelevant detail until you realize that without quantization, the eyes jitter on sub-pixel movements. The grid gives the tracking a slightly mechanical quality that reads as "small creature carefully watching" rather than "software interpolating coordinates." These are pixel-art-scale decisions β€” at this resolution, half a pixel matters.

The sleep sequence is another body-only behavior. After 60 seconds of no mouse or keyboard activity, the pet starts yawning. Then it dozes β€” head drooping, eyes half-closed. Then it collapses onto its side. Then it's fully asleep. Four distinct animation states for a single transition. You come back from a coffee break and the pet is sprawled out asleep on your taskbar. Drag it and it wakes up startled. None of this involves the soul. It's pure animation logic responding to system idle time.

clawd-soul (HTTP on port 23456) = the brain:

  • Screen reading via vision API, 1920x1080 JPEG at quality 85
  • Personality system: prose character files, five archetypes
  • Memory: SQLite + vector search, three tiers
  • Conversation storage: JSONL with auto-compaction at ~500K tokens
  • Mood and trust engine
  • Daily diary at 23:00, memory consolidation ("dreaming") at 23:30
  • 11 source files, roughly 2,000 lines of code

This split isn't an organizational preference. It's a product decision.

Look at the body's feature list again. Eye tracking, sleep sequences, permission bubbles, mini mode. That's a lot of functionality. But none of it requires knowing what a language model is. The body receives animation commands over HTTP β€” "play typing," "play sleeping," "show speech bubble with this text" β€” and renders them. It doesn't know why it's typing. It doesn't know what the speech bubble means. It's a puppet. The soul pulls the strings.

This matters because the moment you put AI logic in the rendering layer, the rendering layer starts making decisions about the character. It starts interpreting responses, formatting them, maybe truncating them for display. Each of those decisions is a place where the character can break β€” where the pet stops feeling like a creature and starts feeling like a UI.


2. The Soul Outlives the Body

Why two repos instead of one? Three reasons.

Soul portability. Export a save file from one machine, import it on another. Your pet still recognizes you. All memory, personality drift, trust level β€” everything lives in the soul. The body is a shell. This is the digital equivalent of moving to a new apartment with your cat. The cat doesn't care about the apartment. The cat cares about the relationship.

Multi-platform without friction. The soul engine is a plain HTTP service. It runs on anything with Node.js. Today the client is Electron on desktop. Tomorrow it could be an iOS app, an Android widget, a browser tab. They all connect to the same soul on localhost:23456. Bodies are disposable. The soul persists.

Think about what this means in practice. Someone builds a web-based pet viewer. Someone else builds a mobile widget that shows the pet's mood. Someone writes a CLI that lets you chat with your pet from the terminal. None of these people need to understand the memory system, the personality engine, or the prompt architecture. They just need to know how to call an HTTP API. The soul is a black box that takes context in and returns character out.

Independent evolution. The body gets new animations, new themes, new interaction patterns β€” none of that touches AI logic. The soul gets better memory, refined prompts, new personality dimensions β€” none of that touches UI code. The two repos have completely independent release cadences. A contributor can add a new animation state to the body without understanding anything about how the soul works, and vice versa.

The deeper reason: bodies go obsolete. Electron might get replaced. Desktop pets might go out of fashion. But the soul β€” who your pet is, what it remembers about you, how your relationship evolved over months β€” shouldn't die with a UI framework. Binding body and brain into one repo is a bet that there will only ever be one form factor. That's a bet I'd lose.


3. Two Dependencies Is Not Laziness

The soul engine's package.json has two entries under dependencies:

DependencyPurpose
better-sqlite3SQLite bindings
sqlite-vecVector search extension for SQLite

Everything else is Node.js built-ins:

CapabilityModule
HTTP servernode:http
AI provider callsnode:https (raw HTTPS requests, no SDKs)
File I/Onode:fs
Path handlingnode:path
Cryptographynode:crypto

Four AI providers β€” Azure OpenAI, OpenAI, Google Gemini, Anthropic Claude β€” all called via raw HTTPS. No openai package. No @anthropic-ai/sdk. No @google/generative-ai. Just HTTP requests.

Why?

Every abstraction layer is a place where the character can leak through as "AI behavior." SDKs bring their own retry patterns, error formatting, response structures, timeout handling. These are invisible to most developers, but users feel them β€” the conversation subtly starts to feel like "a system talking to you" rather than "a pet talking to you." SDK error messages surface through the character. Retry backoff patterns create unnatural pauses. Structured response parsing imposes a rhythm that doesn't match the personality.

The thinner the stack between the personality layer and the user, the more the character comes through unfiltered.

There's a practical benefit too. Two dependencies means two things that can break. Two things that can have security advisories. Two things that can ship breaking changes. The entire soul engine is auditable β€” every line of it β€” in a single afternoon.

I know what this sounds like. "Not invented here syndrome." "Reinventing the wheel." Except I'm not reinventing anything. The OpenAI API is an HTTPS POST with a JSON body. The response is a JSON object with a choices array. Writing that call directly is about 40 lines of code. The openai npm package is 15,000+ lines. Those extra 14,960 lines do useful things β€” but none of them are useful for a pet that needs to feel like a living creature, not a software integration.

There's a subtler point here about error handling. When an SDK encounters a rate limit or a network timeout, it has opinions about what to do β€” exponential backoff, retry with jitter, structured error objects. These are sane defaults for a productivity tool. For a pet, they're wrong. If the AI provider is down, the pet should just... go quiet. Maybe look confused. Maybe fall asleep. The error state should feel like the creature is having a moment, not like the system encountered a 429. Raw HTTP gives you full control over how failures feel to the user.

Four providers also means automatic fallback. If Azure is slow, try OpenAI direct. If OpenAI is down, try Gemini. The pet doesn't care which model is generating its thoughts β€” it cares about staying in character. Multi-provider support with raw HTTP is trivial. With SDKs, you'd need four separate packages, each with its own initialization, authentication pattern, and response format. The abstraction that's supposed to simplify things actually makes the multi-provider story harder.


4. The Dual-Window Hack

This section is a collection of engineering war stories. They're not glamorous. They don't involve clever algorithms. They involve fighting the operating system until it cooperates.

Desktop pets have an unusual rendering challenge. The pet needs to float above all windows, have a transparent background, and be click-through everywhere except the pet sprite itself. Standard stuff for a game overlay. Nightmarish in Electron.

Clawd uses two independent top-level windows:

  • Render window: Large transparent surface, permanently click-through via setIgnoreMouseEvents(true). Handles SVG animation rendering and eye tracking only. You can click straight through it to whatever's behind.
  • Input window: Small opaque rectangle positioned exactly over the pet's hitbox. Focusable. Receives all pointer events β€” clicks, drags, hover.

The obvious approach is one window. We tried that first. It worked on macOS. On Windows, it didn't.

The bug: WS_EX_NOACTIVATE combined with a layered window and Chromium's child HWND creates a dead activation path after z-order changes. You click on the pet and nothing happens. The click hits a phantom area β€” the window manager thinks the region is interactive, but the activation chain is broken, so the event never reaches the renderer. Hours of debugging.

The fix: separate input into its own focusable window. The render window stays permanently click-through. The input window is small, always on top, positioned by the animation system to track the pet's current location. Two windows, synchronized frame-by-frame, pretending to be one entity.

Tradeoff: the input window steals focus when you click the pet. We accepted it. The alternative β€” drag being completely broken on Windows β€” was worse. Engineering is full of these moments β€” you trade one imperfection for another and pick the one your users will notice less.

This wasn't the only platform war story. Windows has a foreground window lock that prevents applications from stealing focus. The chat window needs to come to the foreground when the user clicks the pet. Solution: an ALT key trick combined with a koffi FFI call to AllowSetForegroundWindow, delegated to a PowerShell helper process. It works. It's ugly. It ships.

Then there was the language submenu truncation bug. Three hours of investigation led to a definitive conclusion: it's an incompatibility between Electron and Windows DWM. Unfixable. The codebase has a comment that says "DO NOT TOUCH" next to the relevant code. Sometimes the right engineering decision is to document the problem and walk away.

Anyone who's shipped a desktop app on Windows knows: half the bugs aren't engineering problems. They're archaeology problems.


5. Seven Agents, One Pet

The desktop pet tracks seven AI coding agents simultaneously:

AgentIntegration PatternLatency
Claude CodeCommand hook β†’ HTTP POST~0ms
Codex CLIJSONL log file polling~1.5s
Copilot CLICommand hook (camelCase convention)~0ms
Gemini CLISession JSON polling~1.5s + 4s completion window
Cursor Agentstdin/stdout JSON hook~0ms
Kiro CLIAgent config injection~0ms
opencodeIn-process Bun plugin~0ms

Every agent has a different integration mechanism. Claude Code uses a command hook β€” a zero-dependency Node script that fires an HTTP POST to the pet on every command event. Codex uses incremental JSONL log polling with event deduplication. Gemini CLI requires polling a session JSON file, plus a 4-second completion window because its log format doesn't have explicit "done" signals.

All seven can run at the same time. The pet tracks each session independently. Animation mapping follows a simple escalation: 1 active session = typing animation. 2 sessions = juggling. 3 or more = building. For sub-agents: 1 sub-agent = juggling, 2 or more = conducting.

State priority is a numeric system: error(8) > notification(7) > sweeping(6) > attention(5) > carrying(4) > working(3) > thinking(2) > idle(1) > sleeping(0). Higher-priority states preempt lower ones. If the pet is in a typing animation and an error comes in, it switches to the error state immediately.

The opencode integration is the most complex. opencode is a TUI β€” it doesn't expose an HTTP API. The solution: an in-process Bun plugin that starts a random-port HTTP bridge inside opencode's process. Authentication uses randomBytes(32) for the token and timingSafeEqual for validation. The pet discovers the bridge by reading a lockfile that the plugin writes on startup.

The design philosophy behind all of this: programmers already run three or four AI agents simultaneously. That's just how people work now. A desktop pet that only knows about one agent is blind to most of what's happening on your machine. The pet should see all your tools and express what's going on through its animation state β€” you glance at the corner of your screen and know, without checking any terminal, that three agents are active and one just hit an error.

There's also an emergent behavior nobody planned for. When multiple agents are active, the pet's animation is more energetic β€” juggling, building, conducting. When they all finish, the pet settles back to idle. Over a workday, the pet's animation state becomes a rough EKG of your productivity. You can tell, just from peripheral vision, whether you're in a deep work session or a lull. The pet becomes an ambient display of your own work patterns β€” not because we designed that feature, but because the state priority system and the animation mapping interact in a way that naturally reflects the rhythm of a coding session.


6. MIT, Local, Hackable

MIT license. Data stays 100% local. Screenshots are analyzed in memory and immediately discarded β€” never written to disk. The save file is yours. Export it anytime. No cloud. No accounts. No telemetry.

This is a non-negotiable design constraint, not a marketing bullet point. A pet that watches your screen knows what you're working on, what websites you visit, what messages you read. That data is intimate. It can't leave your machine. There is no universe in which "upload your screen captures to our servers for a better experience" is an acceptable product decision for a companion that's supposed to earn your trust.

The design ethos: hackable, personal, privacy-first.

Want a new theme? One SVG spritesheet plus animation definitions. A new personality archetype? Write a prose character file. A new agent integration? One HTTP endpoint. The barrier to contribution is low by design.

The hackability isn't accidental β€” it follows directly from the architecture. Because the body and soul are separate repos communicating over HTTP, any piece can be replaced independently. Don't like Electron? Write a native macOS client. Don't like the personality system? Swap in your own soul engine. The HTTP contract is the only thing that matters.

Twenty contributors so far. The pixel art comes from clawd-tank by @marciogranzotto. The calico cat theme is by ιΉΏιΉΏ.


Looking Back Across Five Posts

This series covered five aspects of building an AI pet.

This post covered the engineering β€” two repos, two dependencies, 2,000 lines, seven agent integrations, and a collection of platform bugs that no amount of good architecture can prevent.

If I had to distill one lesson from building this, it's that the technology choices should be invisible. The user doesn't know or care that the body and soul are separate repos. They don't know there are only two npm dependencies. They don't know that seven agents feed into one animation state machine. They just see a small creature on their screen that seems to be paying attention.

Every architectural decision was in service of that illusion. Separate the repos so the character can survive across platforms. Minimize dependencies so the character isn't filtered through abstraction layers. Track all agents so the character reflects what's actually happening. The architecture isn't the product. The relationship is the product. The architecture just needs to stay out of the way.

Together, these five posts describe something simple: a tiny creature that sits on your screen, watches what you do, remembers who you are, and doesn't try to help.

The tech stack is small. Two npm dependencies. Eleven source files. About two thousand lines of code. This might be the lowest line count at which software starts to feel like it has a soul.

All code is open source:


Β© Xingfan Xia 2024 - 2026 Β· CC BY-NC 4.0