ENZH

OpenClaw TTS Runbook: Voice Setup for Every Provider

About this series: Each post in this series is a complete technical runbook — not a story, but instructions for an agent to execute directly. Copy the runbook, replace the placeholders (YOUR_FISH_API_KEY, YOUR_VOLC_APP_ID, etc.), hand it to Claude Code, and let it run. The story behind this one is at OpenClaw Field Notes.

OpenClaw TTS: Voice Setup for Every Provider

This document covers text-to-speech setup for OpenClaw across all five supported providers: Fish Audio, Volcano Engine v2 (with emotion control), ElevenLabs, OpenAI, and Edge TTS. Each section is self-contained — jump to the provider you need.


Table of Contents

  1. Prerequisites
  2. Fish Audio Setup
  3. Volcano Engine v2 Setup (Emotion Control)
  4. ElevenLabs Setup
  5. OpenAI TTS Setup
  6. Edge TTS Setup (No API Key)
  7. Fallback Configuration
  8. Slash Commands Reference
  9. Troubleshooting Checklist

1. Prerequisites

Before configuring any TTS provider:

  • The framework is installed and running.
  • You have access to openclaw.json (the gateway configuration file).
  • You know how to restart the gateway or hot-reload with kill -USR1.

TTS config lives under messages.tts in openclaw.json. All provider sections are nested under this path.


2. Fish Audio Setup

Fish Audio outputs OGG/Opus — ideal for Telegram voice bubbles without transcoding.

Step 1. Get an API key

Create an account at fish.audio and get an API key from the dashboard.

Step 2. Pick a voice model

Browse the voice library or clone your own. Copy the reference ID.

Step 3. Add to openclaw.json

{
  messages: {
    tts: {
      auto: "always", // or "tagged" (LLM decides when) / "inbound" (voice replies only)
      provider: "fishaudio",
      fishaudio: {
        apiKey: "YOUR_FISH_API_KEY",       // or set env FISH_API_KEY
        referenceId: "YOUR_VOICE_REFERENCE_ID", // from fish.audio voice library
      },
    },
  },
}

Step 4. Restart the gateway

kill -USR1 $(pgrep -f openclaw)  # hot reload

Step 5. Verify

/tts status

Expected output: Provider: fishaudio (configured).

Test with:

/tts audio Hello from Fish Audio

Config fields

FieldEnv fallbackDescription
apiKeyFISH_API_KEYFish Audio API key
referenceIdVoice model ID from fish.audio (leave empty for default)

API details

POST https://api.fish.audio/v1/tts
Headers: Authorization: Bearer <apiKey>, model: s1
Body: { "text": "...", "reference_id": "...", "format": "opus", "opus_bitrate": 64 }

Returns raw Opus audio buffer. Output file is .ogg — Telegram sends it as a voice bubble.

Gotchas

  • Fish Audio does not support emotion markers. [brackets] will be spoken literally.
  • Output is OGG/Opus — Telegram shows it as a voice bubble automatically.
  • If referenceId is empty, Fish Audio uses its default voice.

3. Volcano Engine v2 Setup (Emotion Control)

Volcano Engine v2 uses the seed-tts-2.0 voice cloning model with LLM-driven emotion control via the context_texts API parameter. The LLM prepends [emotion] markers before each sentence, the framework parses them, calls the API per-sentence with individual emotion instructions, then concatenates the MP3 buffers into a single voice message.

Step 1. Create a Volcengine account

Go to console.volcengine.com.

Step 2. Enable the TTS service

Enable "语音合成" (Speech Synthesis). Get your App ID and Access Token from the console.

Clone a voice in the Volcano console. Note the speaker ID — it starts with S_ (e.g. S_EVeoGUVU1). Without a cloned voice, use a built-in speaker like zh_female_linzhiling_mars_bigtts.

Step 4. Add to openclaw.json

{
  messages: {
    tts: {
      auto: "tagged", // "tagged" = LLM decides when to voice; "always" = every reply
      provider: "volcano",
      volcano: {
        appId: "YOUR_VOLC_APP_ID",          // or set env VOLC_TTS_APP_ID
        accessKey: "YOUR_VOLC_ACCESS_KEY",   // or set env VOLC_TTS_ACCESS_TOKEN
        version: "v2",                        // REQUIRED for emotion control
        speaker: "YOUR_CLONED_VOICE_ID",     // e.g. "S_EVeoGUVU1"
        // resourceId auto-defaults to "volc.seedicl.default" for v2
      },
    },
  },
}

Step 5. Restart the gateway

kill -USR1 $(pgrep -f openclaw)  # hot reload

Step 6. Verify

/tts status

Expected output: Provider: volcano v2 (configured).

Test with:

/tts audio [开心]你好呀!

Config fields

FieldEnv fallbackDefaultDescription
appIdVOLC_TTS_APP_IDVolcano application ID
accessKeyVOLC_TTS_ACCESS_TOKENVolcano access token
version"v1"Must be "v2" for emotion control
resourceId"volc.seedicl.default"Model resource. Auto-detected from ID patterns if not set explicitly
speaker"zh_female_linzhiling_mars_bigtts"Speaker or cloned voice ID. Cloned voices use S_ prefix

Emotion marker syntax

The LLM uses three styles (all passed to context_texts):

Emotion labels (short keywords):

[开心]你好呀!今天天气真好!
[伤心]可是我的猫生病了。
[愤怒]这太过分了!

Voice commands (descriptive instructions):

[用温柔甜蜜的声音]晚安,好梦。
[用冷淡不耐烦的语气]随便你吧。
[用激动兴奋的声音]我们赢了!

Context descriptions (scene narration):

[她正在生气地质问对方]你到底去哪儿了?
[他刚收到好消息非常开心]太好了,我通过了!

Data flow (end to end)

LLM generates: "[开心]你好呀![伤心]我好难过。"
       |
       v
Step 1: System prompt hint — told LLM to use [brackets]
       |
       v
Step 2: Payload processing — strips [markers] from visible
       |                      text, keeps them for TTS input
       v
Step 3: TTS dispatch — detects v2, enters v2 path
       |
       v
Step 4: Emotion parsing — splits into segments
       |    segment 1: { emotion: "开心", text: "你好呀!" }
       |    segment 2: { emotion: "伤心", text: "我好难过。" }
       v
Step 5: Per-segment API calls with emotion context
       |    POST .../api/v3/tts/unidirectional
       |    req_params.additions = {"context_texts":["开心"]}
       v
Step 6: Buffer concatenation — merges MP3 buffers into one file
       |
       v
Single voice bubble on Telegram (audioAsVoice: true)
User sees: "你好呀!我好难过。" (no brackets)

API details

Endpoint (same for v1 and v2):

POST https://openspeech.bytedance.com/api/v3/tts/unidirectional

Request body per segment:

{
  "user": { "uid": "tts-client" },
  "req_params": {
    "text": "你好呀!今天天气真好!",
    "speaker": "S_EVeoGUVU1",
    "audio_params": { "format": "mp3", "sample_rate": 24000 },
    "additions": "{\"context_texts\":[\"开心\"]}"
  }
}

Response: streaming binary MP3 chunks (parsed from JSON-framed response, base64 data field).

Differences from v1

v1v2
Resource IDseed-tts-1.0volc.seedicl.default
Emotion controlNonecontext_texts per sentence
OutputSingle audio, no emotionConcatenated segments with emotion
Telegram voice bubbleNo (MP3, not voice-compatible)Yes (voiceCompatible: true)
Marker stripping[brackets] strippedStripped from display, preserved for TTS

Gotchas

  • version: "v2" is required. Without it, you get v1 behavior (no emotion, no voice bubble).
  • context_texts only affects the first sentence per API call. This is why the framework splits into per-sentence calls.
  • [bracket] markers must NOT be stripped before TTS. They are only stripped from the visible text shown to the user.
  • MP3 output is voice-compatible on Telegram. v2 sets voiceCompatible: true for the round voice bubble.

4. ElevenLabs Setup

Step 1. Get an API key

Sign up at elevenlabs.io and get your API key from the dashboard.

Step 2. Pick a voice

Browse the voice library or clone your own. Copy the voice ID.

Step 3. Add to openclaw.json

{
  messages: {
    tts: {
      auto: "always",
      provider: "elevenlabs",
      elevenlabs: {
        apiKey: "YOUR_ELEVENLABS_API_KEY",   // or set env ELEVENLABS_API_KEY
        voiceId: "YOUR_ELEVENLABS_VOICE_ID",
        modelId: "eleven_multilingual_v2",
        voiceSettings: {
          stability: 0.5,
          similarityBoost: 0.75,
          style: 0.0,
          useSpeakerBoost: true,
          speed: 1.0,
        },
      },
    },
  },
}

Step 4. Restart the gateway

kill -USR1 $(pgrep -f openclaw)  # hot reload

Step 5. Verify

/tts status

Expected output: Provider: elevenlabs (configured).

Test with:

/tts audio Hello from ElevenLabs

Config fields

FieldEnv fallbackDescription
apiKeyELEVENLABS_API_KEY / XI_API_KEYElevenLabs API key
baseUrlOverride API base URL (default: https://api.elevenlabs.io)
voiceIdVoice ID from ElevenLabs
modelIdModel (e.g. eleven_multilingual_v2)
seedInteger 0..4294967295 (best-effort determinism)
applyTextNormalizationauto, on, or off
languageCode2-letter ISO 639-1 (e.g. en, de)
voiceSettings.stability0..1 — lower = more expressive
voiceSettings.similarityBoost0..1 — higher = closer to original
voiceSettings.style0..1
voiceSettings.useSpeakerBoosttrue or false
voiceSettings.speed0.5..2.0 (1.0 = normal)

Output formats

  • Telegram: Opus voice note (opus_48000_64 — 48kHz / 64kbps, required for the round bubble).
  • Other channels: MP3 (mp3_44100_128 — 44.1kHz / 128kbps).

5. OpenAI TTS Setup

Step 1. Get an API key

Go to platform.openai.com and create an API key.

Step 2. Add to openclaw.json

{
  messages: {
    tts: {
      auto: "always",
      provider: "openai",
      openai: {
        apiKey: "YOUR_OPENAI_API_KEY",  // or set env OPENAI_API_KEY
        model: "gpt-4o-mini-tts",
        voice: "alloy",                 // alloy, echo, fable, onyx, nova, shimmer
      },
    },
  },
}

Step 3. Restart the gateway

kill -USR1 $(pgrep -f openclaw)  # hot reload

Step 4. Verify

/tts status

Expected output: Provider: openai (configured).

Test with:

/tts audio Hello from OpenAI

Config fields

FieldEnv fallbackDescription
apiKeyOPENAI_API_KEYOpenAI API key
modelTTS model (e.g. gpt-4o-mini-tts)
voiceVoice: alloy, echo, fable, onyx, nova, shimmer

Output formats

  • Telegram: Opus voice note (opus format).
  • Other channels: MP3 (mp3 format).

6. Edge TTS Setup (No API Key)

Edge TTS uses Microsoft Edge's online neural TTS service via the node-edge-tts library. No API key required. This is the simplest provider to set up and the default when no other API keys are configured.

Step 1. Add to openclaw.json

{
  messages: {
    tts: {
      auto: "always",
      provider: "edge",
      edge: {
        enabled: true,
        voice: "en-US-MichelleNeural",
        lang: "en-US",
        outputFormat: "audio-24khz-48kbitrate-mono-mp3",
        rate: "+10%",
        pitch: "-5%",
      },
    },
  },
}

Step 2. Restart the gateway

kill -USR1 $(pgrep -f openclaw)  # hot reload

Step 3. Verify

/tts status

Expected output: Provider: edge (configured).

Test with:

/tts audio Hello from Edge TTS

Config fields

FieldDefaultDescription
enabledtrueAllow Edge TTS usage
voiceen-US-MichelleNeuralEdge neural voice name
langen-USLanguage code
outputFormataudio-24khz-48kbitrate-mono-mp3Edge output format (see Microsoft docs)
rateSpeed adjustment (e.g. +10%, -20%)
pitchPitch adjustment (e.g. -5%, +10%)
volumeVolume adjustment (e.g. +50%)
saveSubtitlesWrite JSON subtitles alongside audio
proxyProxy URL for Edge TTS requests
timeoutMsRequest timeout override (ms)

Gotchas

  • Edge TTS is a public web service without a published SLA or quota. Treat it as best-effort.
  • Not all outputFormat values are supported by the Edge service. If the configured format fails, the framework retries with MP3.
  • Telegram sendVoice accepts OGG/MP3/M4A. Use OpenAI or ElevenLabs if you need guaranteed Opus voice notes.

To disable Edge TTS entirely

{
  messages: {
    tts: {
      edge: {
        enabled: false,
      },
    },
  },
}

7. Fallback Configuration

If multiple providers are configured, the framework uses the selected provider first and falls back to others automatically.

Example: OpenAI primary with ElevenLabs fallback

{
  messages: {
    tts: {
      auto: "always",
      provider: "openai",
      summaryModel: "openai/gpt-4.1-mini",
      openai: {
        apiKey: "YOUR_OPENAI_API_KEY",
        model: "gpt-4o-mini-tts",
        voice: "alloy",
      },
      elevenlabs: {
        apiKey: "YOUR_ELEVENLABS_API_KEY",
        voiceId: "YOUR_ELEVENLABS_VOICE_ID",
        modelId: "eleven_multilingual_v2",
      },
    },
  },
}

Provider priority (when provider is unset)

If provider is not specified, the framework picks automatically:

  1. openai (if OPENAI_API_KEY is set)
  2. elevenlabs (if ELEVENLABS_API_KEY is set)
  3. edge (always available, no key needed)

Auto-summary for long replies

When TTS is enabled and a reply exceeds maxLength (default: 1500 chars), the framework summarizes it first using summaryModel, then converts the summary to audio.

{
  messages: {
    tts: {
      auto: "always",
      maxTextLength: 4000,       // hard cap for TTS input (chars)
      timeoutMs: 30000,          // request timeout (ms)
      summaryModel: "openai/gpt-4.1-mini",
    },
  },
}

To disable auto-summary:

/tts summary off

Auto-TTS behavior

When enabled, the framework:

  • Skips TTS if the reply already contains media or a MEDIA: directive.
  • Skips very short replies (< 10 chars).
  • Summarizes long replies when enabled.
  • Attaches the generated audio to the reply.

Flow diagram

Reply -> TTS enabled?
  no  -> send text
  yes -> has media / MEDIA: / short?
          yes -> send text
          no  -> length > limit?
                   no  -> TTS -> attach audio
                   yes -> summary enabled?
                            no  -> send text
                            yes -> summarize -> TTS -> attach audio

Model-driven overrides

By default, the model can emit [[tts:...]] directives to override the voice for a single reply:

Here you go.

[[tts:voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]
[[tts:text]](laughs) Read the song once more.[[/tts:text]]

To enable provider switching via model directives:

{
  messages: {
    tts: {
      modelOverrides: {
        enabled: true,
        allowProvider: true,
      },
    },
  },
}

To disable all model overrides:

{
  messages: {
    tts: {
      modelOverrides: {
        enabled: false,
      },
    },
  },
}

8. Slash Commands Reference

Single command: /tts (Discord: /voice, since /tts is a built-in Discord command).

CommandEffect
/tts offDisable auto-TTS for this session
/tts alwaysEnable auto-TTS for every reply (alias: /tts on)
/tts inboundOnly voice-reply after an inbound voice note
/tts taggedOnly voice-reply when the LLM emits [[tts]] tags
/tts statusShow current TTS provider and configuration
/tts provider openaiSwitch provider (openai / elevenlabs / edge / volcano / fishaudio)
/tts limit 2000Set summary threshold (chars)
/tts summary offDisable auto-summary for long replies
/tts audio HelloGenerate a one-off audio reply (does not toggle TTS on)

Notes:

  • Commands require an authorized sender (allowlist/owner rules still apply).
  • commands.text or native command registration must be enabled.
  • off|always|inbound|tagged are per-session toggles.
  • limit and summary are stored in local prefs, not the main config.

Per-user preferences

Slash commands write local overrides to a local preferences file.

Stored fields: enabled, provider, maxLength, summarize.

These override messages.tts.* for that host.


9. Troubleshooting Checklist

No audio output at all:

  • Is TTS enabled? Check auto is not "off". Run /tts status.
  • Is the provider configured? Check the provider-specific section has required fields.
  • Are API keys set? Check openclaw.json or environment variables.
  • Is the reply too short? TTS skips replies under 10 chars.
  • Does the reply contain media? TTS skips replies with existing media or MEDIA: directives.

Audio but no emotion (Volcano v2):

  • Is version: "v2" set? Check /tts status shows volcano v2, not just volcano.
  • Is the LLM generating [bracket] markers? Check the system prompt includes emotion instructions.
  • Are markers being stripped before TTS? They should only be stripped from visible text.

Voice sent as document, not voice bubble (Telegram):

  • Fish Audio: output should be .ogg — check voiceCompatible: true.
  • Volcano v2: MP3 output should have voiceCompatible: true.
  • OpenAI/ElevenLabs: Telegram format should be Opus.

[brackets] showing in the user-visible text:

  • For Volcano v2: the emotion marker stripping step should remove them from display text.
  • For non-Volcano providers: [brackets] are not supported and will be spoken literally.

Duplicate voice messages:

  • The TTS tool returns "Audio delivered. Do not re-send." to prevent the LLM from re-sending via the message tool.

Edge TTS fails silently:

  • Edge TTS is a public service without SLA. It may be rate-limited or down.
  • Check if the configured outputFormat is supported. The framework retries with MP3 on failure.

The story behind this one is at OpenClaw Field Notes.


© Xingfan Xia 2024 - 2026 · CC BY-NC 4.0