Local Whisper (Offline STT)

If you want a fully private, offline companion experience, Yumi's local speech-to-text pipeline runs entirely on your own machine. No voice data or recordings are transmitted to the cloud.

This is powered by faster-whisper, a highly optimized translation of OpenAI's Whisper model utilizing CTranslate2.

Model Sizes & Quantization

To run Whisper efficiently on home CPUs, faster-whisper quantizes the floating-point weights into 8-bit integers (int8). This reduces the memory footprint and accelerates matrix multiplications on standard processors by up to 4×.

When attuning Yumi, you can choose from three model scales:

| Model Scale | Disk Footprint | CPU Memory Usage | Latency (Per Sentence) | Accuracy Vibe | | :--- | :--- | :--- | :--- | :--- | | tiny | ~75 MB | ~150 MB | ~200 - 500ms | Blazing fast; struggles with accents. | | base (Default) | ~140 MB | ~250 MB | ~400 - 800ms | Recommended — excellent balance. | | small | ~460 MB | ~800 MB | ~1 - 2.5s | Highly accurate; requires modern multi-core CPU. |

Speed Optimizations inside Yumi

To keep local transcription latencies under 1 second, Yumi implements specific greedy configurations inside src/yumi/audio/stt.py:

Greedy Decoding (beam_size=1): Instead of exploring multiple transcription paths (beam search), the engine performs greedy decoding, selecting the absolute most probable token at each step. This drops latency significantly with a negligible accuracy trade-off.
VAD Pre-Filtering: Raw audio is already trimmed of leading/trailing silence by Silero VAD before it is fed to Whisper. Whisper does not waste CPU cycles analyzing silent audio frames.

Proceed to the Groq Whisper page to see how cloud acceleration reduces this latency even further!