Yumi leverages a real-time WebSocket connection to transmit raw user microphone audio, system status updates, synthesized speech segments, and emotional visuals.
- Default Endpoint:
ws://localhost:8000/ws - Data Types:
- Client to Server: Transmits binary PCM arrays (recording chunks) and status metadata.
- Server to Client: Transmits JSON event payloads and base64-encoded audio frames.
1. Client to Server (Input)
Raw Microphone Chunks
- Format: Raw Binary buffers.
- Audio Coding:
Int16(PCM 16-bit) single-channel mono, sampled at 16,000Hz. - Interval: Sent every 250 - 300 milliseconds.
2. Server to Client (Output)
The backend streams structured JSON events to all active WebSocket connections.
A. Speech Interruption (interrupt)
Fired instantly when the local VAD detects speech onset while Yumi is speaking:
{
"type": "interrupt"
}
B. Streaming Audio Chunk (audio_chunk)
Sent recursively as raw TTS voice chunks are synthesized by the voice provider (e.g. CAMB.ai):
{
"type": "audio_chunk",
"data": "UklGRtbY... [base64 PCM16 bytes]"
}
C. Streaming Audio End (audio_end)
Fired once the complete TTS vocal track has finished streaming, signaling the client to gracefully hide subtitles:
{
"type": "audio_end"
}
D. Legacy Static Audio Package
If using static TTS providers like ElevenLabs (non-streaming), Yumi transmits the full package in a single transaction:
{
"text": "Hello! I am Yumi, your companion.",
"expression": "smile",
"motion": "greeting",
"audio": "UklGRtbY... [Full base64 MP3/WAV binary payload]"
}
E. Error Events
Fired when a cloud provider fails or credentials are invalid, displaying the issue on the client UI:
{
"error": "ElevenLabs API Key is invalid or expired."
}
Proceed to the REST API Reference page to learn about HTTP endpoints!