ENZH

I Recommended Eight Open-Source Meeting Recorders. My Friend Returned All of Them.

中文版

After last week's Zoom's Native AI Sucks post, my friend went and tried every single one of the 8 open-source meeting recorders I'd recommended.

His verdict was blunt.

"They all suck."

Three failure modes, every project. Recognition quality is bad — Chinese is wrong, accents are wrong, domain terms are wrong. Summaries are bad — the notes don't read like prose, half the key points are missing. UI/UX is dire — crashes, hangs, setup that feels like configuring a 1998 Linux desktop.

I was thrown off for a minute. Eight projects, from 11.6k-star Meetily down to 3-star toys, three orders of magnitude in social proof — they couldn't all be uniformly broken.

Then I spent another night on it. This time I didn't read READMEs. I read issue trackers, independent benchmark papers, and the maintainers' own blogs.

What I found goes deeper than the previous post.


The Layer I Missed: Whisper Has Been Outpaced for a Year

Data first.

AISHELL-1 (the standard Mandarin ASR benchmark) character error rate, independently measured:

ModelCER
Whisper-Large-v3 (OpenAI)5.14%
SenseVoice-Large (Alibaba DAMO)2.09%
FireRedASR2-LLM (Xiaohongshu)2.89% (avg-Mandarin)
Fun-ASR (Alibaba Tongyi)1.22%

WenetSpeech test_meeting (specifically multi-speaker meeting audio):

ModelCER
Whisper-Large-v318.39%
Fun-ASR6.49%

Common Voice Cantonese:

ModelCER
Whisper-Large-v310.41%
SenseVoice-Large6.78%

Every project I recommended last week — Meetily, Anarlog, pasrom/meeting-transcriber, OpenWhispr, Recap, Oatmeal, Parrot, Notes4Me — runs on Whisper underneath (Whisper Large v3, Whisper Turbo, WhisperKit, whisper.cpp; same lineage). And Whisper is 2.5–4× behind on Chinese.

It's not the apps that are broken. It's the foundation.

Whisper also has a known and unfixed problem: hallucination on silence. Thirty seconds of dead air during a meeting and Whisper invents a sentence. Reddit r/LocalLLaMA gets a top-front-page complaint about this every week. Parakeet (NVIDIA) was trained on 36k hours of noise and non-speech data; it doesn't hallucinate this way. SenseVoice ships built-in audio-event detection (music / applause / laughter) which acts as implicit VAD. FireRedASR2 ships FireRedVAD as a built-in module.

The successors are fixing Whisper's holes. The open-source frontends are still importing Whisper.

The Maintainers Themselves Already Admitted It

To stress-test that claim, I went and read the issue trackers. The pattern jumped out immediately — the maintainers have been writing this story themselves; I just hadn't been reading it.

Hyprnote / Anarlog (fastrepl team). Maintainer yujonglee wrote in their official blog:

"HyperLLM-V1 summarization is fine-tuned exclusively for English."

Issue 2444 was a user asking for Chinese support. Maintainer ComputelessComputer closed it with:

"closing this as we support custom endpoints"

Translation: Chinese isn't on the roadmap; bring your own.

Issue 4881 is a small revolt. Anarlog quietly swapped Parakeet V3 for V3 TDT in a release. User dandaka:

"still unusable for my case, quality of transcript is awful. any plans/ETA for bringing the prev model back? Issue closed, but nothing actually changed :( No plans to get back to a working state?"

28 comments, 4 reactions. No fix.

The Hyprnote Launch HN thread, user ljosa:

"the inability to tell who said what is a show stopper... You'll need to add speaker diarization for this to be useful for more than 1:1s."

Meetily. Issue 171, "Quality of meeting minutes." User daviddecorso:

"Yeah my transcriptions just became complete gibberish after some recent update. It's just random words interspersed with [MUSIC PLAYING] or [INAUDIBLE]."

Issue 228 — pick the Chinese model and the app crashes:

"Select the downloaded model, and then click 'Start Recording'. The application will crash."

Issue 233, "Multi-Language Support for Summaries," 12 reactions, 11 comments, open eight months. No PR.

Recap (the 703-star project I called "right idea, right architecture, just not stable"). Issue 15:

"The license of this software is not open-source. Please change it to a real open-source license or stop claiming this software is open-source."

The license is proprietary. The author was playing word games. I missed this last week — that's on me.

The pain points aren't implementation glitches. They're architectural — wrong base model, multilingual not on the roadmap, simplistic diarization, and in one case the license isn't even OSS.

Last week I read READMEs. READMEs are marketing. Issues are reality.

The Hack Everyone Overlooked: Two-Track Recording

Back to the user side.

Speaker diarization is hard because of one specific reason: a single mixed audio stream forces an algorithm to figure out who said what from voice features alone. Every OSS project competes on this — pyannote, Sortformer, silence-gap heuristics — and none of them is good enough for messy meetings.

But think about the actual setup on a Mac. The microphone captures me. The system audio captures everyone else. These two signals are physically separate before they get mixed.

If you record without mixing — left channel = mic, right channel = system audio — and output a stereo m4a — diarization collapses into a labeling problem. "Channel L is me. On Channel R, cluster speakers by voice." That's it.

Accuracy: 100% on the me-vs-everyone split. The hard algorithmic version of the problem disappears.

This isn't my idea. Ilia Zadyabin wrote a Medium post about an OBS dual-track setup, sending two tracks to WhisperX. But he hit a telling wall — WhisperX downmixes stereo to mono before transcribing. Whisper-era design philosophy: single input, single output, no stereo channel awareness.

Gemini 3 Pro multimodal doesn't downmix. You can prompt it directly: "Channel L is me, Channel R is everyone else. Transcribe verbatim and label speakers." One call, transcript + diarization + summary.

Simon Willison ran exactly this in November. A 3-hour-33-minute meeting recording, single prompt to gemini-3-pro-preview, got back transcript + speaker identification + summary + action items.

Total cost: $1.42.

That number is the entire argument. Three-hour meeting, one-buck-fortyish, end-to-end.

SOTA Backends Already Exist. No Polished Frontend Can Mount Them.

Now the technology side.

Open-source ASR has been moving fast. The last 6–12 months alone:

  • Voxtral 2 (Mistral, 2026-02). Apache-2.0, 13 languages including Chinese, native speaker diarization plus word-level timestamps. Drop-in Whisper replacement.
  • FireRedASR2-LLM (Xiaohongshu AI, 2025-Q4). Apache-2.0, independent paper measures 2.89% avg-Mandarin CER, beating Doubao-ASR (3.69%) and Qwen3-ASR-1.7B (3.76%).
  • NVIDIA Streaming Sortformer v2. Mandarin DER 9.2%, 214× real-time, CC-BY-4.0.
  • pyannote-precision-2. AMI 12.9%, DIHARD-3 14.7%, 25-30% DER reduction over pyannote 3.x.

The problem: no polished frontend can use any of these.

The 8 projects from last week all hardcode their ASR backend. Meetily supports BYO LLM summarizer but Whisper is welded in. Anarlog's whole sales pitch is BYO LLM, but ASR is sealed. pasrom/meeting-transcriber lets you pick from "three engines," but only the three he packaged — not hot-swap to an arbitrary endpoint.

I dispatched a sub-agent specifically to audit the source code of every viable frontend. Verdict was unambiguous:

"Zero polished frontend exposes a 'set OpenAI-compatible STT base URL → done' knob. ASR is bundled-with-the-binary in every mature option."

The only OSS project that wraps an arbitrary SOTA OSS model into an OpenAI-compatible /v1/audio/transcriptions endpoint is LocalAI. It even ships a dedicated /v1/audio/diarization endpoint and natively supports the Voxtral backend. But making Meetily talk to LocalAI requires patching issue 431 first — Meetily's OpenAI-compat transcription path returns gibberish without a specific header tweak.

"We are getting errors on our openai server that suggests we need to add a couple of specific headers to the api call."

The real opportunity isn't another meeting recorder — it's the middleware layer. A standard ASR proxy that wraps any model in OpenAI-compatible protocol so that frontends become backend-agnostic. The LLM ecosystem solved this years ago (OpenAI-compatible chat is now table-stakes). The audio ecosystem is ten steps behind.

This doesn't contradict last week's thesis — open source still eats this category. But it has to eat the middleware before the frontends can keep up.

Four Projects I Missed Last Week

Spent the night on GitHub trending, HN, awesome-lists. Four projects worth flagging:

Vexa (2k★, last commit 2026-05-03). Completely different paradigm — not a local-recording app. A Docker bot that actually joins the meeting. Meet, Teams, Zoom. Real-time WebSocket transcripts, MCP server, multi-tenant API. make all self-host. This is the open-source self-hosted Otter clone that none of the local-recording apps wanted to be. Downside: needs a GPU box.

Muesli (192★, last commit 2026-05-08). Mac-native Swift implementation, multi-backend. You can pick Parakeet v3, Cohere Transcribe 2B, or Qwen3-ASR with 52 languages including Chinese. Diarization runs pyannote via FluidAudio on the Apple Neural Engine. If your spec is "Mac native + Chinese + real diarization," try this first. Thirty minutes to install and run a real meeting through it.

Handy (21.3k★). Multi-backend dictation tool — Whisper / Parakeet V3 / Moonshine V2 / GigaAM v3 / SenseVoice / Breeze ASR / custom. Tauri cross-platform. But it's a dictation app, not a meeting app — no diarization, no meeting summary. Excellent for daily voice typing.

Voxtral 2 model. The model itself, not an app. Apache-2.0, 13 languages including Chinese, native diarization, word-level timestamps. This is exactly the "drop-in Whisper replacement" that Hyprnote issue 1354 has been begging for since 2025. Just download the weights.

Honourable mention: parakeet-diarized, a tight FastAPI proxy — NeMo Parakeet + pyannote, accepts a diarize param, returns verbose_json. The closest off-the-shelf SOTA proxy template.

Three Paths, Pick by Time Budget

A. The 90-minute path (strongly recommended first)

Reuse 80% of the watch-transcriber plumbing, extend it to multi-speaker meetings.

audiotee (CATapDescription, no driver) + ffmpeg mic
  → ffmpeg amerge into 2-channel m4a (L = me, R = others)
  → Hammerspoon Cmd-Opt-R toggle
  → launchd WatchPaths auto-fires
  → Gemini 3 Pro one prompt (transcribe + diarize + summarize)
  → claude -p delivers to notes

audiotee is the key piece — wraps macOS 14.4+ CATapDescription, no driver, no BlackHole, no virtual device. Filters by PID so you can record only Zoom.

Cost: ~$0.50 per meeting. Simon Willison's 3-hour-33-minute test came in at $1.42.

B. The 2-3 day path (if fully offline is non-negotiable)

[Meetily fork] → [LocalAI on remote GPU] → [Voxtral 2 + Sortformer]
   frontend UI    OpenAI-compat /v1/audio/transcriptions   SOTA backend

LocalAI is the only OSS proxy that exposes both /v1/audio/transcriptions and /v1/audio/diarization with native Voxtral backend support. Meetily fork: redirect WHISPER_HOST at LocalAI, plus PR to fix issue 431.

100% local, free forever. Requires a GPU box.

C. Chinese-heavy hybrid

If Chinese is the dominant case: dual-track recording → local SenseVoice via sherpa-onnx (mature CoreML on Mac, Mandarin / Cantonese / EN / JA / KR) → Gemini 3 Pro for summarization and cross-channel speaker clustering. Highest quality ceiling, heaviest engineering.

The Closed-Source Window Is Even Shorter Than I Said Last Week

Back to the thesis.

Last week's read: Granola and Krisp's subscription is a friction tax; open source will eat this.

This week's update: the open-source side has its own friction tax — a different one.

Not the user-facing "we installed BlackHole for you," but the developer-facing "the middleware isn't there yet, so frontends still hardcode Whisper."

Backend (SOTA OSS models) is iterating fast. Voxtral 2 — February 2026. FireRedASR2 — Q4 2025. Streaming Sortformer v2 — Q3 2025. Whisper's last large-scale update was November 2023 — three full model generations ago.

The gap will keep widening. But OSS frontends will take another product cycle or two to catch up — until OpenAI-compatible audio APIs become as standard as OpenAI-compatible chat APIs, until the cost of swapping ASR backends drops.

Until then, dual-track + one Gemini call is the smartest interim — bypasses old Whisper, bypasses immature middleware, bypasses ASR-hardcoding in polished frontends.

Last week's Zoom's Native AI Sucks was wrong because I read stars and READMEs. The lesson here is the meta-version: when surveying a tooling space, READMEs are marketing, issues are reality. The next time I write one of these I'll start from issue trackers.

Meeting recording was always supposed to be simple. Open source eating it is going to take longer than I said.

In the meantime, DIY beats picking a tool.

Tools & HacksPart 4 of 5
← PrevNext →

© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0