30% off every model — launch pricing
← Back to research
R02Active research

Implicit Turn-Taking in Conversational AI

Inferring speech readiness without explicit wake words

interactionacousticspredictionreal-time

Abstract

Wake words and push-to-talk create friction in human-AI interaction. Humans detect turn-taking cues from prosody, breathing patterns, and conversational context. We are building models that predict when a user has finished speaking or is inviting AI contribution, enabling truly conversational interfaces.

Problem Statement

Current voice assistants use explicit triggers—wake words, buttons, or visual indicators. These mechanisms interrupt the conversational flow and prevent natural back-and-forth. The alternative, endpoint detection based on silence, fails in noisy environments and creates false triggers during pauses.

Approach

We model turn-taking as a classification problem over acoustic and linguistic features. The system observes acoustic cues (prosodic boundaries, breathing patterns, filler sounds) and linguistic cues (syntactic completion, discourse markers, question intonation) to predict whether the speaker is yielding, holding, or inviting response.

Feature extraction

Acoustic features include F0 contour, energy envelope, pause duration, and spectral tilt. We use a 1-second sliding window with 100ms stride. Linguistic features are extracted from partial ASR hypotheses, including dependency parse depth, presence of conjunctions, and utterance-final POS tags.

Model architecture

A lightweight CNN-Transformer hybrid processes acoustic features in real-time. A separate language model head consumes partial transcripts and predicts syntactic completion probability. The two streams fuse in a cross-attention layer before the final turn-taking classifier.

Training data

We collected 340 hours of human-human task-oriented dialogue in English, Japanese, and Spanish. Annotations mark turn boundaries, backchannels, overlaps, and pauses. For AI-specific training, we use Wizard-of-Oz simulations where one participant believes they are speaking to AI.

Latency requirements

The model must emit a prediction within 300ms of a potential turn boundary to feel responsive. False positive rate (AI speaking when user not done) must stay below 5% to maintain trust. These constraints drive our model size and architecture choices.

Current results

On held-out dialogue, the model achieves 94% turn-prediction accuracy with 287ms average latency. False positive rate is 4.2% in quiet conditions, rising to 7.1% with background noise. The system correctly identifies 78% of implicit invitations (questions, direct address) without explicit cues.