Real-Time Fact Correction in Generated Speech
Detecting and correcting factual errors without breaking flow
Abstract
Large language models hallucinate. In spoken dialogue, the cost of an error is high and the window for correction is short. We are researching how to detect likely factual errors in real-time generated speech and issue corrections that feel like natural repair sequences rather than interruptions.
Problem Statement
When AI speaks a factual error, the standard approaches are: (1) ignore it, damaging trust; (2) stop and correct explicitly, breaking conversational flow; or (3) prevent errors through heavy retrieval augmentation, adding latency. We want a fourth option: detect, correct, and continue—like a human would.
Approach
We use a speculative execution model. As the LLM generates, a fact-checking module queries a knowledge base in parallel with token generation. High-latency facts are predicted heuristically. When a likely error is detected mid-utterance, the system inserts a repair sequence ('actually...', 'correction...') rather than aborting.
Speculative fact checking
The system maintains a buffer of the last N generated tokens. For each entity mention, it initiates a knowledge base query in parallel with continued generation. Entity linking runs on partial text using fast heuristics. Queries are batched and cached aggressively.
Confidence modeling
Not all errors need correction. We model correction necessity as a function of: error severity (factual vs nuanced), user expertise (novice vs expert), and conversational context (instructional vs casual). The model learns from human feedback which errors warrant interruption.
Repair strategies
Human dialogue uses specific repair patterns: same-turn self-correction ('I mean...'), next-turn other-correction, and embedded corrections. We implement these as learned strategies, generating the repair in the same prosodic contour as the surrounding speech to minimize disruption.
Latency engineering
The critical path is entity linking and knowledge retrieval. We use approximate nearest neighbor search over entity embeddings (10ms), speculative caching of likely follow-up facts, and tiered verification (fast check vs deep check). Current end-to-end detection latency is 240ms.
Evaluation challenges
Standard NLP benchmarks do not capture the conversational repair dynamic. We use simulated dialogue where confederates introduce errors and measure: detection rate, correction acceptance, flow disruption (measured by user turn latency), and trust restoration. Human evaluators rate correction naturalness on a 5-point scale.