My Transcriber Heard Five People in a Two-Person Conversation

I have a small tool that's been running for months: it turns my voice memos into structured notes automatically. Record something, and a few minutes later the cleaned-up transcript is sitting there.

After half a year of use, transcription itself stopped being the problem long ago. The half I never got right is the other one — who is speaking.

Turning sound into words is solved; any model today can do it. But take a two-person conversation and label which line is A and which is B, and you're doing speaker diarization — and that's the genuinely hard half. How hard? I have a 28-minute recording with exactly two people in it from start to finish. It came back labeled with five speakers.

V1: Gemini chunking, and the seams fall apart

The first approach was Gemini. Long audio has a trap: hand it a whole file and once you pass ~15 minutes it starts skipping, looping, and inventing timestamps. So you chunk — split at silence boundaries into ~12-minute pieces and transcribe them in parallel, keeping each chunk inside the model's coherent range.

The transcription quality was fine. But chunking created a new mess: every chunk numbers its own speakers from zero. Chunk one's "speaker 0" might be chunk two's "speaker 1."

I built a patch for it — a 60-second overlap between adjacent chunks, then a vote based on who-said-what in the overlap region to align the numbering. It worked most of the time. But the moment one seam voted wrong, every speaker label after it shifted, and the count crept upward. The more seams, the worse the drift.

V2: global diarization, still splitting backchannels into people

If stitching across chunks isn't reliable, stop stitching — run one global diarization pass over the whole file. I wired in Senko, which runs CoreML on Apple Silicon, tens of times faster than the pyannote stack. It was steadier than stitching: at least the numbering stayed consistent end to end.

But it had a habit I couldn't train out: it treats short interjections as new people. Real conversations are full of "mm," "yeah," "right" — one person talking while the other chimes in. Senko routinely scored those one-word backchannels as a third or fourth speaker. On the 28-minute clip it managed to get 2, but on a few longer recordings it inflated immediately.

A detour: Doubao — better words, worse speakers

Along the way I tried Volcano Engine's Doubao ASR to see whether a domestic model handled Chinese-English code-switching more smoothly. The transcription quality was a genuine surprise — clean Chinese, clean mixed Chinese-English, proper punctuation, natural spoken rhythm. That was my original pain point, and it solved it better than the Gemini pipeline did.

But the speaker count was worse. Doubao's fast edition scored 5 people on that 28-minute clip; the 2.0 edition scored 4. Worse than Senko.

That's when it clicked: diarization done as a side effect by a general-purpose ASR model tends to run aggressive. It would rather over-split than merge. It can't tell whether "mm" is a backchannel or a new voice, so it calls everything a new voice.

Gotcha: Volcano has two auth systems — don't mix them up

The thing that actually cost me half a day testing Doubao was an auth trap, worth writing down for the next person.

Volcano's speech services have two account systems. The old console gives you an AppID plus an AccessToken; the new console gives you a single x-api-key. The catch: the two are bound to different sets of resources. I started by calling the 2.0 endpoint with the old AppID and kept getting resource not granted — the quota was clearly there, it just wouldn't run. It took a while to realize that 2.0, and the 妙记 model I'll get to next, both require the new console's x-api-key. Swapped the header, and it went through immediately.

The docs bury this. If you hit "resource not granted" on a Volcano speech endpoint, check whether you're using the wrong auth method first.

The landing: give diarization to a model that does it on purpose

What actually solved it was Volcano's 妙记 (Lark Minutes ASR). It's different from the others: diarization is done server-side and returned in the same call as the transcript, so there's nothing to stitch on my end.

I ran five real recordings through a comparison; four of them are two-person conversations:

Recording	Fast edition	2.0	妙记
28min	5	4	2
38min	3	3	2
57min	3	3	2
68min	4	3	2
3.45h (drama)	—	10	3

Four two-person conversations, and 妙记 hit exactly 2 on every one. The other two ran high the whole way. The 3.45-hour recording — the others either couldn't take it or returned 10 speakers — 妙记 swallowed in one pass and scored 3, which for a multi-character drama track is far closer than 10.

The lesson landed cleanly. Rather than letting a general model split speakers as an afterthought, hand the whole job to a model that treats it as a first-class task. Diarization shouldn't be a byproduct of transcription. It deserves to be its own job, taken seriously.

One more trap: 妙记 needs a public URL, and the cross-border upload was brutal

妙记 has a hard requirement: it won't take an uploaded file, only a publicly fetchable URL. So local audio has to go to object storage first, then you hand it the link.

I used Volcano's own TOS (object storage), with the bucket in Shanghai. 妙记 fetching the file from Shanghai is in-region and fast. The slow leg is the other one — me uploading from the US to Shanghai. Single-stream, it ran at 34 KB/s; an 8 MB file took nearly four minutes. Unusable.

Cross-border links have a quirk: the throttle is per-connection, not on your total bandwidth. So the fix is direct — parallel multipart upload, splitting the file into pieces sent at once. After switching, the same file finished in 5 seconds. Worth noting: when a cross-border upload crawls, don't blame the bandwidth first. Open more connections; often it just goes.

With all of that swapped in, the transcriber and its companion skill now default to 妙记. Audio goes in, and the transcript comes back with the right speaker labels.

From Gemini chunk-and-stitch, to Senko's global pass, to Doubao, to 妙记 — it was a long loop. Looking back, transcription was never the bottleneck. What blocked me for half a year was the who-said-it half — and the answer wasn't in a smarter stitching algorithm. It was in a model willing to treat diarization as a real job.

If you're building out a meeting-recording pipeline, I also wrote a survey of the open-source options worth reading alongside this.