R04Exploratory

Multimodal Room Reading for Contextual Response

Integrating facial expression, gesture, and vocal tone

multimodalvisionemotioncontext

Abstract

Current AI systems process text or speech but ignore the rich contextual signals humans use: facial expressions, micro-gestures, gaze direction, and vocal tone. We are researching how to integrate these modalities to infer emotional state, engagement level, and conversational context, using them to shape more appropriate AI responses.

Problem Statement

A user saying 'that's fine' with a smile versus a flat expression versus crossed arms conveys three different meanings. Text-only or audio-only models miss this context, leading to responses that are tonally inappropriate. The challenge is not just recognition but integration—how should detected confusion or frustration change the AI's response strategy?

Approach

We use a multi-stream architecture: video encoders for face and gesture, audio encoders for tone and prosody, fused in a temporal attention mechanism. The output is not just emotion labels but contextual embeddings that condition the language model's generation style—simplifying explanations when confusion is detected, offering breaks when fatigue is detected.

Visual processing

Face detection runs at 15fps using a lightweight model. We extract action units (facial muscle activations), gaze direction, head pose, and blink patterns. Privacy is maintained through on-device processing—raw video never leaves the local machine. Only high-level feature vectors are transmitted.

Audio processing

Beyond transcription, we extract pitch contour, speech rate, energy dynamics, and spectral features associated with emotional valence and arousal. These features are aligned temporally with the visual stream using cross-modal attention.

Fusion architecture

Early fusion (pixel + waveform) is computationally expensive. We use late fusion with a temporal coherence layer—each modality is encoded separately, then combined in a transformer that models inter-modal relationships over 5-second windows. The output is a contextual state vector.

Contextual response conditioning

The state vector conditions the LLM through adapter layers that modulate attention heads. Detected confusion triggers explanatory elaboration. Detected engagement prompts deeper exploration. Detected stress triggers simplification and supportive language. The conditioning is subtle—responses should adapt without explicitly announcing the adaptation.

Privacy and ethics

Emotion recognition has known biases across demographics. We continuously evaluate performance across age, gender, and ethnicity subgroups. Users have explicit control: visual input can be disabled while keeping audio, and all processing is local-first with opt-in cloud features only.