WebSocket Events Protocol

Yumi leverages a real-time WebSocket connection to transmit raw user microphone audio, system status updates, synthesized speech segments, and emotional visuals.

Default Endpoint: ws://localhost:8000/ws
Data Types:
- Client to Server: Transmits binary PCM arrays (recording chunks) and status metadata.
- Server to Client: Transmits JSON event payloads and base64-encoded audio frames.

1. Client to Server (Input)

Raw Microphone Chunks

Format: Raw Binary buffers.
Audio Coding: Int16 (PCM 16-bit) single-channel mono, sampled at 16,000Hz.
Interval: Sent every 250 - 300 milliseconds.

2. Server to Client (Output)

The backend streams structured JSON events to all active WebSocket connections.

A. Speech Interruption (`interrupt`)

Fired instantly when the local VAD detects speech onset while Yumi is speaking:

{
  "type": "interrupt"
}

B. Streaming Audio Chunk (`audio_chunk`)

Sent recursively as raw TTS voice chunks are synthesized by the voice provider (e.g. CAMB.ai):

{
  "type": "audio_chunk",
  "data": "UklGRtbY... [base64 PCM16 bytes]"
}

C. Streaming Audio End (`audio_end`)

Fired once the complete TTS vocal track has finished streaming, signaling the client to gracefully hide subtitles:

{
  "type": "audio_end"
}

D. Legacy Static Audio Package

If using static TTS providers like ElevenLabs (non-streaming), Yumi transmits the full package in a single transaction:

{
  "text": "Hello! I am Yumi, your companion.",
  "expression": "smile",
  "motion": "greeting",
  "audio": "UklGRtbY... [Full base64 MP3/WAV binary payload]"
}

E. Error Events

Fired when a cloud provider fails or credentials are invalid, displaying the issue on the client UI:

{
  "error": "ElevenLabs API Key is invalid or expired."
}

Proceed to the REST API Reference page to learn about HTTP endpoints!