Last updated: March 2026 Β· Source: voice-call
Executive Voice Interface
A voice conversation interface for OpenClaw β talk to your AI assistant over WebRTC from a browser or phone. Fully local speech-to-text (MLX-Whisper on Apple Silicon), free TTS (Edge-TTS), Claude via OpenClaw OAuth, and self-hosted LiveKit for audio routing.
What
voice-call provides a voice conversation loop: a browser or phone connects via WebRTC, speech is detected by VAD, transcribed locally by Whisper, sent to Claude (via OpenClaw OAuth), and the response is synthesized by Edge-TTS and streamed back. Claude can call agent tools during the conversation β reading files, running commands, searching memory, and listing active sessions. The entire STT pipeline runs on-device; no audio leaves the local machine.
Why
Text interfaces require a screen and keyboard. Voice enables hands-free interaction while mobile, cooking, commuting, or in situations where typing is impractical. The obvious solution β OpenAI Realtime API, Gemini Live β sends audio to cloud servers, raising privacy concerns for anyone who discusses confidential projects, personal information, or unreleased work with their AI assistant.
This module keeps the STT pipeline entirely local. MLX-Whisper large-v3-mlx-4bit runs on Apple Silicon faster than real-time β a 10-second utterance transcribes in under a second. Only the transcribed text (and any tool results) travels to the Anthropic API. Spoken words never leave the machine.
Architecture
Four layers connected through LiveKit:
1. Audio Transport β LiveKit + WebRTC
LiveKit Server handles WebRTC audio routing. The browser or phone connects via WSS to the token server, which proxies to LiveKit. Tailscale provides a trusted TLS certificate so iOS Safari accepts the WebSocket connection from any network.
2. Speech-to-Text β MLX-Whisper (local)
Silero VAD detects speech in the audio stream. When speech ends, the STT module transcribes with Whisper large-v3-mlx-4bit running locally on Apple Silicon. No audio is sent to an external STT service.
3. LLM β Claude via OpenClaw OAuth
The LLM adapter loads the OAuth token from OpenClaw's credential store β no separate Anthropic API key needed. Claude can call agent tools during the conversation: file read, command execution, memory search, and session list.
4. Text-to-Speech β Edge-TTS (free)
Microsoft Edge-TTS synthesizes responses β free, no API key required β and streams audio back to the caller through LiveKit.
Source files:
| File | Responsibility |
|---|---|
| agent.py | Main voice agent entry point β LiveKit agent loop |
| stt_mlx.py | Custom STT plugin using MLX-Whisper large-v3-mlx-4bit |
| llm_anthropic.py | LLM adapter β Claude via OpenClaw OAuth, with tool calling |
| tts_edge.py | TTS via Microsoft Edge-TTS (free) |
| tools.py | Agent tools: read_file, run_command, search_memory, list_sessions, think_carefully |
| token_server.py | HTTPS server + WSS proxy to LiveKit + JWT token generation |
| gateway.py | OpenClaw Gateway WebSocket client (device identity auth) |
| web/index.html | Browser call UI |
Agent tools available during voice conversations:
| Tool | What it does |
|---|---|
| read_file | Read a file from the local filesystem (truncated to max_lines) |
| run_command | Execute a shell command with a configurable timeout |
| search_memory | Search OpenClaw memory via the qmd CLI |
| list_sessions | List active OpenClaw sessions via Gateway WebSocket |
Key Design Decisions
Local STT β audio never leaves the machine
MLX-Whisper runs on Apple Silicon faster than real-time. The privacy guarantee is architectural: the audio pipeline is entirely local, so there is no API call that could leak spoken content. Only the transcribed text travels to the Anthropic API β and even then, only if the operator is comfortable with that.
LiveKit for WebRTC β don't implement WebRTC directly
WebRTC signaling, ICE negotiation, and codec handling are complex. LiveKit provides a production-grade abstraction that handles multi-device routing, reconnection, and audio quality management. The alternative (raw WebRTC) would require maintaining significantly more infrastructure code.
Edge-TTS β free, no key, good quality
Microsoft's Edge TTS endpoint is free and produces natural-sounding speech without requiring an API key or billing account. The tradeoff: it requires an outbound connection to Microsoft's servers for each response. For fully air-gapped use, substitute an on-device TTS (e.g. Kokoro, or system TTS).
Tailscale for remote access β trusted TLS without a public domain
iOS Safari requires valid TLS for WebSocket connections. Getting a Let's Encrypt certificate for a home server normally requires a public domain. Tailscale Serve proxies the token server behind the Tailscale HTTPS endpoint β which has a valid Let's Encrypt certificate issued to the Tailscale DNS name β without exposing anything to the public internet.
JWT call links β single-use tokens per call
Each call link is a short-lived signed JWT that authorizes one participant to join one LiveKit room. There is no persistent login session. Links can be generated on demand and expire automatically, making it easy to share a call link with a phone without creating a persistent credential.
How to Build Your Own
1. Use VAD before STT β don't transcribe continuous audio
Voice Activity Detection (Silero VAD or WebRTC VAD) detects when someone is speaking and when they've finished. Only the speech segment gets passed to Whisper. Without VAD, you'd either transcribe silence (waste) or need the user to press a button to speak (awkward).
2. MLX-Whisper on Apple Silicon, faster-whisper on CUDA, API elsewhere
MLX-Whisper is Apple Silicon-specific. On CUDA hardware, faster-whisper achieves similar throughput. For cloud deployment or hardware without a GPU, use an API-based STT service and accept that audio will leave the device. Make the STT layer swappable β the rest of the architecture is independent of the STT choice.
3. Token server pattern β keep LiveKit internal
The token server is a small HTTPS server that generates JWT tokens for LiveKit room access and proxies WebSocket connections. It's the only service exposed externally (via Tailscale). LiveKit itself runs on localhost β it doesn't need to be reachable from the browser directly.
4. iOS Safari requires valid TLS β plan for this
Safari on iOS refuses WebSocket connections to servers with self-signed certificates, even if the user manually accepts the cert. Tailscale Serve solves this cleanly. Without Tailscale, you need a public domain and Let's Encrypt, or a reverse proxy with a valid cert on a cloud host.
5. Inject MEMORY.md at session start for context continuity
Voice conversations are stateless by default β each call starts fresh. To maintain continuity with ongoing projects, inject the agent's MEMORY.md at the start of each session's system prompt. The agent will have immediate context about active projects without needing to be briefed verbally at the start of every call.
Frequently Asked Questions
Is speech transcription done locally?
Yes. MLX-Whisper runs entirely on your Apple Silicon Mac β audio is never sent to an external STT service. Only the transcribed text (and any tool results) travels to the Anthropic API.
Do I need a paid Anthropic API key?
No. The LLM adapter authenticates using the OAuth token that OpenClaw manages. As long as you have an active OpenClaw session with a valid Anthropic OAuth credential, no separate API key is needed.
Does this work outside my home network?
Yes, via Tailscale. The start script configures Tailscale Serve to expose the token server on your Tailscale DNS name with a Let's Encrypt certificate. Any device in your Tailnet can then open a call link from any network.
Can I run this on Intel Mac or Linux?
MLX-Whisper requires Apple Silicon. On Intel Mac or Linux you would need to swap the STT module for a different provider β faster-whisper (CUDA), a system-level STT, or an API-based service. The rest of the architecture (LiveKit, Edge-TTS, the token server) is cross-platform.
Authors: Qiushi Wu & Orange π