Voice Identity Preservation in Neural Translation
Preserving speaker characteristics across language boundaries
Abstract
Most neural translation systems optimize for semantic accuracy and discard paralinguistic features—accent, prosody, rhythm, and vocal identity. We are researching how to preserve these characteristics when translating speech across languages, enabling real-time communication that sounds like the speaker, not a generic synthetic voice.
Problem Statement
Current speech-to-speech translation pipelines use separate ASR, MT, and TTS components. The TTS stage typically uses a language-specific voice or a speaker-agnostic model, stripping away the source speaker's vocal fingerprint. This creates an uncanny valley where the words are correct but the voice is wrong.
Approach
We treat voice as a separable latent representation. Our approach encodes speaker identity into a compact embedding that is language-agnostic, then conditions a multilingual neural vocoder on this embedding during synthesis. The challenge is disentangling content, language, and speaker while maintaining real-time latency constraints.
Architecture
The system consists of three trainable components: a content encoder that extracts phonetic representations, a speaker encoder that captures timbre and prosody, and a multilingual decoder that reconstructs speech conditioned on both encodings plus a language token.
Disentanglement challenges
Perfect disentanglement is impossible—accent carries both speaker and language information. We use adversarial training to minimize language information in the speaker embedding, with a trade-off parameter controlled by a language classifier loss. Current results show 73% speaker similarity retention while maintaining translation BLEU scores within 2 points of baseline.
Real-time constraints
For live translation, we use a chunked streaming approach with 500ms look-ahead. The speaker embedding is computed once at conversation start and cached. Each chunk passes through the content encoder and decoder independently, with prosody smoothing applied across chunk boundaries.
Evaluation methodology
We evaluate on three axes: translation quality (BLEU, chrF++), speaker similarity (cosine similarity of speaker embeddings, ABX tests), and naturalness (MOS scores). Human evaluation uses pairwise comparison: same speaker speaking language A natively versus translated from language B.
Current limitations
Emotional prosody transfer remains challenging—translated speech sounds neutral even when source speech is emotional. Code-switching within utterances causes speaker embedding drift. Extreme pitch ranges (very high or very low voices) show degradation in similarity metrics.