Yumi is not just a voice text reader. To feel like an active presence, she must move her body, blink, gesture, and show emotions that match the words she says.
This co-expressive capability is achieved via Structured Outputs bound through Pydantic.
The YumiResponse Schema
Instead of returning raw text paragraphs, the LLM is forced (using native tool calling / structured response APIs) to populate the YumiResponse schema defined in src/yumi/agent/llm.py:
class YumiResponse(BaseModel):
response_text: str # The spoken reply (concise, <= 3 sentences)
expression: str # smile, angry, sad, surprise, scared, shy, normal
motion: str # nod, shakehead, tilthead, fidget, forward, lookaway, greeting, idle
- Pydantic Enforcement: This schema ensures that the model cannot output plain strings or malformed JSON.
- Expression Restrictions: The prompt restricts the LLM to select ONLY from the valid expressions and motion states supported by the active avatar model.
How the WebUI Executes the Emotion
When the structured response is returned by the LangGraph workflow:
- The Python server packages it as an event and streams it over the WebSocket.
- The browser client receives the JSON payload.
- The frontend client translates these generic states using
EXPRESSION_MAPandMOTION_MAP:const EXPRESSION_MAP = { "smile": "baozhen", "angry": "angry", "sad": "cry", "surprise": "baozhen", "scared": "cry", "shy": "qizi1", "normal": null }; - It immediately feeds these parameter identifiers to the PixiJS Live2D model renderer:
live2dModel.expression("baozhen"); // Plays the smile expression live2dModel.motion("nod"); // Triggers a nodding motion
This results in Yumi smiling, nodding, or looking shy exactly at the moment she begins speaking her thoughts!
Proceed to Voice Synthesis to see how her voice audio is generated and synced!