Yumi does not use static sound threshold levels to listen to your voice. Standard volume thresholds easily fail when you click a mechanical keyboard, sigh, or have a fan running in the background.
Instead, Yumi utilizes Silero VAD, a lightweight, high-performance neural speech detector.
Technical Mechanics of Silero VAD
Silero VAD is loaded locally inside src/yumi/audio/stt.py using torch.hub. It runs a highly optimized 4-layer convolutional neural network that analyzes incoming audio chunks and outputs a speech probability score between 0.0 and 1.0.
[ Mic Stream ] ──► [ 512-sample Chunks ] ──► [ Silero VAD Model ] ──► Probability >= 0.5?
│
[ Trigger STT ] ◄── [ Speech Boundary Finalized ] ◄──────────────────────────┘
- Audio Streaming: The web interface captures raw microphone data and streams it via WebSocket in 16kHz mono
Int16(PCM) chunks. - Probability Thresholding: The backend collects these samples. If the model scores a speech probability above
0.5, it registers that speech has started. - Silence Confirmation: When the probability drops below
0.35for more than 500 milliseconds, Yumi registers that speech has ended, packages the accumulated audio buffer, and triggers the transcription pipeline.
Speech Interruption (Barge-In)
The VAD is also the trigger for the barge-in mechanism.
Because the VAD runs continuously (even while Yumi is active and speaking), she can hear you speak over her voice:
- When you speak, Silero VAD immediately flags speech onset (within 30-50ms).
- The engine immediately triggers
self.interrupt_event.set()inengine.py. - A WebSocket broadcast goes to the frontend:
{"type": "interrupt"}. - The browser instantly aborts the current audio node, closes Yumi's lips, and transitions back to listening, ready for your new input.
Proceed to the Local Whisper page to see how this speech is transcribed offline!