While Local Whisper is fully private, processing speech locally on older CPUs can introduce a latency of 1 to 2 seconds.
For a truly instantaneous, real-time response model, Yumi supports Groq Cloud Whisper.
How Groq Whisper Works
When Groq is selected as your STT provider in the config, Yumi's local VAD still handles local voice boundaries:
- VAD Processing: Your mic streams audio to the backend. Silero VAD detects the exact boundary where you stop speaking.
- Audio Export: The backend converts the accumulated floating-point audio buffer into a highly compressed in-memory WAV file (PCM 16-bit, 16kHz mono).
- API Streaming: Yumi sends this small byte stream via an HTTP POST request to Groq's high-speed endpoint:
https://api.groq.com/openai/v1/audio/transcriptions - Hardware Acceleration: Groq processes the audio on custom LPU (Language Processing Unit) hardware, returning the completed JSON text transcription in under 150 milliseconds.
Configuration & Key Sharing
To configure Groq Whisper, run the config wizard:
yumi --config
Under Listening Settings, choose Groq Whisper.
If you have already configured Groq as your LLM provider, Yumi is smart enough to reuse your existing GROQ_API_KEY stored in your keychain — no duplicate setup required.
Proceed to the Thinking (LLM) Providers page to see how Yumi formulates a response!