🐾 claw-stack

Last updated: March 2026 Β· Source: voice-call

Executive Voice Interface

A voice conversation interface for OpenClaw β€” talk to your AI assistant over WebRTC from a browser or phone. Fully local speech-to-text (MLX-Whisper on Apple Silicon), free TTS (Edge-TTS), Claude via OpenClaw OAuth, and self-hosted LiveKit for audio routing.

← Module Overview

What

voice-call provides a voice conversation loop: a browser or phone connects via WebRTC, speech is detected by VAD, transcribed locally by Whisper, sent to Claude (via OpenClaw OAuth), and the response is synthesized by Edge-TTS and streamed back. Claude can call agent tools during the conversation β€” reading files, running commands, searching memory, and listing active sessions. The entire STT pipeline runs on-device; no audio leaves the local machine.

Why

Text interfaces require a screen and keyboard. Voice enables hands-free interaction while mobile, cooking, commuting, or in situations where typing is impractical. The obvious solution β€” OpenAI Realtime API, Gemini Live β€” sends audio to cloud servers, raising privacy concerns for anyone who discusses confidential projects, personal information, or unreleased work with their AI assistant.

This module keeps the STT pipeline entirely local. MLX-Whisper large-v3-mlx-4bit runs on Apple Silicon faster than real-time β€” a 10-second utterance transcribes in under a second. Only the transcribed text (and any tool results) travels to the Anthropic API. Spoken words never leave the machine.

Architecture

Four layers connected through LiveKit:

1. Audio Transport β€” LiveKit + WebRTC

LiveKit Server handles WebRTC audio routing. The browser or phone connects via WSS to the token server, which proxies to LiveKit. Tailscale provides a trusted TLS certificate so iOS Safari accepts the WebSocket connection from any network.

2. Speech-to-Text β€” MLX-Whisper (local)

Silero VAD detects speech in the audio stream. When speech ends, the STT module transcribes with Whisper large-v3-mlx-4bit running locally on Apple Silicon. No audio is sent to an external STT service.

3. LLM β€” Claude via OpenClaw OAuth

The LLM adapter loads the OAuth token from OpenClaw's credential store β€” no separate Anthropic API key needed. Claude can call agent tools during the conversation: file read, command execution, memory search, and session list.

4. Text-to-Speech β€” Edge-TTS (free)

Microsoft Edge-TTS synthesizes responses β€” free, no API key required β€” and streams audio back to the caller through LiveKit.

Source files:

File Responsibility
agent.py Main voice agent entry point β€” LiveKit agent loop
stt_mlx.py Custom STT plugin using MLX-Whisper large-v3-mlx-4bit
llm_anthropic.py LLM adapter β€” Claude via OpenClaw OAuth, with tool calling
tts_edge.py TTS via Microsoft Edge-TTS (free)
tools.py Agent tools: read_file, run_command, search_memory, list_sessions, think_carefully
token_server.py HTTPS server + WSS proxy to LiveKit + JWT token generation
gateway.py OpenClaw Gateway WebSocket client (device identity auth)
web/index.html Browser call UI

Agent tools available during voice conversations:

Tool What it does
read_file Read a file from the local filesystem (truncated to max_lines)
run_command Execute a shell command with a configurable timeout
search_memory Search OpenClaw memory via the qmd CLI
list_sessions List active OpenClaw sessions via Gateway WebSocket

Key Design Decisions

Local STT β€” audio never leaves the machine

MLX-Whisper runs on Apple Silicon faster than real-time. The privacy guarantee is architectural: the audio pipeline is entirely local, so there is no API call that could leak spoken content. Only the transcribed text travels to the Anthropic API β€” and even then, only if the operator is comfortable with that.

LiveKit for WebRTC β€” don't implement WebRTC directly

WebRTC signaling, ICE negotiation, and codec handling are complex. LiveKit provides a production-grade abstraction that handles multi-device routing, reconnection, and audio quality management. The alternative (raw WebRTC) would require maintaining significantly more infrastructure code.

Edge-TTS β€” free, no key, good quality

Microsoft's Edge TTS endpoint is free and produces natural-sounding speech without requiring an API key or billing account. The tradeoff: it requires an outbound connection to Microsoft's servers for each response. For fully air-gapped use, substitute an on-device TTS (e.g. Kokoro, or system TTS).

Tailscale for remote access β€” trusted TLS without a public domain

iOS Safari requires valid TLS for WebSocket connections. Getting a Let's Encrypt certificate for a home server normally requires a public domain. Tailscale Serve proxies the token server behind the Tailscale HTTPS endpoint β€” which has a valid Let's Encrypt certificate issued to the Tailscale DNS name β€” without exposing anything to the public internet.

JWT call links β€” single-use tokens per call

Each call link is a short-lived signed JWT that authorizes one participant to join one LiveKit room. There is no persistent login session. Links can be generated on demand and expire automatically, making it easy to share a call link with a phone without creating a persistent credential.

How to Build Your Own

1. Use VAD before STT β€” don't transcribe continuous audio

Voice Activity Detection (Silero VAD or WebRTC VAD) detects when someone is speaking and when they've finished. Only the speech segment gets passed to Whisper. Without VAD, you'd either transcribe silence (waste) or need the user to press a button to speak (awkward).

2. MLX-Whisper on Apple Silicon, faster-whisper on CUDA, API elsewhere

MLX-Whisper is Apple Silicon-specific. On CUDA hardware, faster-whisper achieves similar throughput. For cloud deployment or hardware without a GPU, use an API-based STT service and accept that audio will leave the device. Make the STT layer swappable β€” the rest of the architecture is independent of the STT choice.

3. Token server pattern β€” keep LiveKit internal

The token server is a small HTTPS server that generates JWT tokens for LiveKit room access and proxies WebSocket connections. It's the only service exposed externally (via Tailscale). LiveKit itself runs on localhost β€” it doesn't need to be reachable from the browser directly.

4. iOS Safari requires valid TLS β€” plan for this

Safari on iOS refuses WebSocket connections to servers with self-signed certificates, even if the user manually accepts the cert. Tailscale Serve solves this cleanly. Without Tailscale, you need a public domain and Let's Encrypt, or a reverse proxy with a valid cert on a cloud host.

5. Inject MEMORY.md at session start for context continuity

Voice conversations are stateless by default β€” each call starts fresh. To maintain continuity with ongoing projects, inject the agent's MEMORY.md at the start of each session's system prompt. The agent will have immediate context about active projects without needing to be briefed verbally at the start of every call.

Frequently Asked Questions

Is speech transcription done locally?

Yes. MLX-Whisper runs entirely on your Apple Silicon Mac β€” audio is never sent to an external STT service. Only the transcribed text (and any tool results) travels to the Anthropic API.

Do I need a paid Anthropic API key?

No. The LLM adapter authenticates using the OAuth token that OpenClaw manages. As long as you have an active OpenClaw session with a valid Anthropic OAuth credential, no separate API key is needed.

Does this work outside my home network?

Yes, via Tailscale. The start script configures Tailscale Serve to expose the token server on your Tailscale DNS name with a Let's Encrypt certificate. Any device in your Tailnet can then open a call link from any network.

Can I run this on Intel Mac or Linux?

MLX-Whisper requires Apple Silicon. On Intel Mac or Linux you would need to swap the STT module for a different provider β€” faster-whisper (CUDA), a system-level STT, or an API-based service. The rest of the architecture (LiveKit, Edge-TTS, the token server) is cross-platform.

Authors: Qiushi Wu & Orange 🍊