Yumi is engineered with an emphasis on local performance, visual fluidness, and cryptographic security.
🎙️ Real-time Voice Capture (VAD & STT)
- Zero-Config Microphone Streams: Handled via standard HTML5 media devices over WebSockets.
- Silero Neural speech detection: Runs locally on CPU with extremely high accuracy, ignoring keyboard clicks and room echoes.
- Ultra-Fast Whisper Inference: Choose local Faster-Whisper (quantized to
int8for fast CPU processing) or Cloud-based Groq Whisper for 150ms transcription latencies.
🧠 State-of-the-Art Brain (LangGraph)
- Structured Output Control: Leverages Pydantic schemas to bound LLM output format strictly, preventing structural errors.
- Tool Integration: LangChain tools allow Yumi to call local system tools, fetch weather, read dates, and coordinate functions.
- Hot-Swappable Persona Matrix: Instantly change Yumi's behavior mid-sentence. Features six customized personalities.
🗣️ Lifelike Lip Sync & Expressive Visuals (Live2D)
- Sub-Second Streaming Audio: ElevenLabs or CAMB.ai streaming audio chunks are pushed over WebSockets directly to the web client.
- Real-time Waveform RMS Lip Sync: Computes the amplitude of playing sound buffers on the fly to open/close her mouth naturally in perfect sync with the voice.
- Fluid Visuals: Powered by PixiJS 6 and Cubism SDK for high-performance GPU-accelerated rendering inside the browser.
🔐 Hardware-Encrypted Security (OS Keychain)
- Zero Plaintext Keys on Disk: Unlike common setups that write API keys to
.envor configuration JSON files, Yumi leverages thekeyringpackage. - OS-Level Vault Storage: Saves keys securely inside:
- Windows: Windows Credential Manager.
- macOS: macOS Keychain Access.
- Linux: GNOME Keyring / KWallet via
libsecret.
Ready to get started? Head directly to Attuning Senses to set up your APIs!