Implicit Turn-Taking in Conversational AI
Inferring speech readiness without explicit wake words
Abstract
Wake words and push-to-talk create friction in human-AI interaction. Humans detect turn-taking cues from prosody, breathing patterns, and conversational context. We are building models that predict when a user has finished speaking or is inviting AI contribution, enabling truly conversational interfaces.
Problem Statement
Current voice assistants use explicit triggers—wake words, buttons, or visual indicators. These mechanisms interrupt the conversational flow and prevent natural back-and-forth. The alternative, endpoint detection based on silence, fails in noisy environments and creates false triggers during pauses.
Approach
We model turn-taking as a classification problem over acoustic and linguistic features. The system observes acoustic cues (prosodic boundaries, breathing patterns, filler sounds) and linguistic cues (syntactic completion, discourse markers, question intonation) to predict whether the speaker is yielding, holding, or inviting response.
Feature extraction
Acoustic features include F0 contour, energy envelope, pause duration, and spectral tilt. We use a 1-second sliding window with 100ms stride. Linguistic features are extracted from partial ASR hypotheses, including dependency parse depth, presence of conjunctions, and utterance-final POS tags.
Model architecture
A lightweight CNN-Transformer hybrid processes acoustic features in real-time. A separate language model head consumes partial transcripts and predicts syntactic completion probability. The two streams fuse in a cross-attention layer before the final turn-taking classifier.
Training data
We collected 340 hours of human-human task-oriented dialogue in English, Japanese, and Spanish. Annotations mark turn boundaries, backchannels, overlaps, and pauses. For AI-specific training, we use Wizard-of-Oz simulations where one participant believes they are speaking to AI.
Latency requirements
The model must emit a prediction within 300ms of a potential turn boundary to feel responsive. False positive rate (AI speaking when user not done) must stay below 5% to maintain trust. These constraints drive our model size and architecture choices.
Current results
On held-out dialogue, the model achieves 94% turn-prediction accuracy with 287ms average latency. False positive rate is 4.2% in quiet conditions, rising to 7.1% with background noise. The system correctly identifies 78% of implicit invitations (questions, direct address) without explicit cues.