ENZH

Your Apple Watch Is Already a Voice Transcription Device

📊 Slides

There's a whole category of "AI recording devices" now. Here's what they cost:

DeviceHardwareSubscriptionAnnual Cost
Plaud Note$159$100-240/yr$259-399
Plaud NotePin S$179$100-240/yr$279-419
Bee AI$50$12/mo$194
Otter.ai (software only)$100-200/yr$100-200
Limitless Pendant$199discontinueddead product

The pitch is "effortless capture." The reality is: another device to charge, another subscription to manage, another thing to forget in a drawer. And here's the part nobody talks about — your recordings live on someone else's server. Every conversation, every meeting, every private thought, uploaded to a third-party cloud for transcription.

Some devices let you export audio without a subscription. But then you're manually uploading files, pasting into a transcription tool, copying output into your notes app. Every single time. Nobody keeps that up for more than a week.

Meanwhile, I'm already wearing a device with a microphone, a dedicated Action Button, and automatic cloud sync to my own machine. It's called an Apple Watch.

With my setup: zero hardware cost (you already own the watch), zero subscription, zero data leaving your machine. The audio stays on your Mac, gets transcribed locally via API, and the result goes wherever you want. You own every byte.

The entire "AI recording gadget" category is a hardware solution to a software problem.


The Architecture Trap

My first instinct was wrong. I wanted a custom watchOS app that would record audio, hit a Vercel serverless function, transcribe with Whisper, summarize with Claude, and push structured notes back. I sketched the architecture. It was clean. It was also completely unnecessary.

Before I wrote any code, I talked to a friend who had actually built a custom watchOS recording app. His experience was brutal:

  • watchOS networking is unreliable. Battery management aggressively kills background connections. Your upload might complete, or it might silently fail.
  • CloudKit as an intermediary creates a painful four-hop pipeline: watch records, CloudKit syncs, server polls CloudKit, server processes. Each hop is a failure point.
  • 30-second chunking (watchOS's recording constraint for background audio) generates hundreds of small files. CloudKit doesn't handle bulk small-file sync gracefully.

His conclusion after weeks of development: "The watch records fine. Automated workflow for getting the audio off the watch? Haven't found a good approach."

That last sentence was the most valuable technical insight I received on this project. Not because it was novel — but because it reframed the problem. The hard part isn't recording. The hard part isn't transcription. The hard part is getting audio off the watch reliably. And Apple already solved it.


The Insight That Killed the Architecture

Voice Memos syncs via iCloud. Not through CloudKit's developer API — through Apple's own iCloud infrastructure, the same sync engine that handles Photos, Notes, and every other first-party app. It's battle-tested, handles large files, works in the background, and has been reliable for years.

One test confirmed it. I pressed the Action Button, recorded a 30-second memo, stopped it. Opened Finder on my Mac. The file appeared in ~/Library/Group Containers/group.com.apple.VoiceMemos.shared/Recordings/ within seconds.

That was the moment the entire custom-app architecture collapsed. I didn't need a watchOS app. I didn't need CloudKit. I didn't need a server. I needed a file watcher on my Mac.

The best code is the code you don't write. Every line of the custom watchOS app I didn't build is a line that can't break, can't drain battery, can't fail silently during a network transition. The total infrastructure cost of Voice Memos sync is zero — it's built into the OS.


The Pipeline

What remained was simple enough to build in thirty minutes with Claude Code.

File watching: macOS launchd has a WatchPaths directive — stable since 2005, two decades of production use. Point it at the Voice Memos directory. When a new .m4a appears, it triggers a Python script. No polling. No cron. No third-party file watcher. The OS does it natively.

Transcription + analysis in one call: This is where multimodal APIs change the game. The traditional pipeline is two steps: speech-to-text (Whisper, Deepgram, etc.), then text-to-LLM for summarization. Two API calls, an intermediate text format to manage, and all the audio nuance (tone, emphasis, pauses) gets lost in the text serialization.

Gemini Flash accepts audio directly. One API call. Send the .m4a, get back structured JSON — transcription, bilingual summary (EN + ZH), key points, and action items. The model hears the audio and reasons about it simultaneously. No intermediate representation. No information loss.

The prompt specifies the output schema. The response comes back as structured JSON that maps directly to the delivery format. One call. Done.

Delivery targets: The script has a pluggable target system — Apple Notes, Feishu docs, Obsidian vault, or anything else. Each target is a Python function that takes the structured JSON and writes it somewhere. Adding a new target takes minutes.

But the most interesting target is agent.py. It shells out to claude -p with a templated prompt containing the transcription and structured notes. This means any Claude Code skill becomes a delivery target without writing integration code. Want to post the summary to Slack? Claude knows how. Want to create a Jira ticket from the action items? Claude knows how. Want to draft a follow-up email? Claude knows how. The AI becomes the glue layer — instead of writing API integrations for each destination, you describe the destination in natural language.

Pluggable delivery took maybe ten extra minutes to design. It has already saved hours by letting me add new targets without touching the core pipeline.


What I Actually Built

watch-transcriber: ~850 lines across 12 files. Ten commits over thirty minutes of pairing with Claude Code. Zero external dependencies beyond google-genai.

The workflow:

  1. Press Action Button on Apple Watch Ultra
  2. Record a voice memo (any length)
  3. Stop recording
  4. File syncs to Mac via iCloud (seconds)
  5. launchd detects new file, triggers processing
  6. Gemini Flash transcribes + analyzes in one multimodal call
  7. Structured output delivered to configured targets

No app to install on the watch. No server to maintain. No subscription. No charging cable for a separate device.


The Lessons

This was a two-hour build, but the architectural decisions behind it generalize.

Talk to people who've built what you want to build. My friend's watchOS experience saved me weeks. Not days — weeks. I would have hit every single one of those failure modes myself, because nothing in the documentation warns you about battery management killing your network calls. The internet tells you to build a watchOS app. Someone who actually tried tells you not to.

Use platform primitives. Voice Memos, iCloud sync, and launchd WatchPaths are boring, invisible, maintenance-free infrastructure. They don't show up in architecture diagrams. They don't have version numbers to track. They just work. The most reliable system is the one that delegates hard problems to layers that have been solving them for decades.

Multimodal APIs collapse pipelines. The shift from "transcribe then analyze" to "send audio, get analysis" isn't just a convenience improvement. It eliminates an entire class of bugs (encoding issues, lost context, format mismatches) and reduces latency by half. When a model can consume multiple modalities natively, every intermediate serialization step is technical debt waiting to happen.

AI as integration glue. The agent delegation pattern — "here's structured data, deliver it to X" via Claude Code — inverts the traditional integration model. Instead of writing N API clients for N destinations, you write one prompt template and let the AI figure out the API. The marginal cost of adding a new delivery target dropped from "write and debug an API integration" to "describe what you want in English."


The Meta-Lesson

I almost spent weeks building a custom watchOS app, a CloudKit sync pipeline, a Vercel backend, and a Whisper integration. Instead I spent thirty minutes with Claude Code wiring together things that already exist.

The gap between those two timelines was one conversation with someone who had tried the hard way.

In an era of abundant AI tooling, the highest-leverage skill isn't building — it's recognizing when not to. The recording hardware exists. The sync infrastructure exists. The multimodal intelligence exists. The only thing missing was someone to connect the dots.

Sometimes the best architecture decision is realizing you don't need one.


© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0