R01Active research

Voice Identity Preservation in Neural Translation

Preserving speaker characteristics across language boundaries

speechtranslationttsvoice-cloning

Abstract

Most neural translation systems optimize for semantic accuracy and discard paralinguistic features—accent, prosody, rhythm, and vocal identity. We are researching how to preserve these characteristics when translating speech across languages, enabling real-time communication that sounds like the speaker, not a generic synthetic voice.

Problem Statement

Current speech-to-speech translation pipelines use separate ASR, MT, and TTS components. The TTS stage typically uses a language-specific voice or a speaker-agnostic model, stripping away the source speaker's vocal fingerprint. This creates an uncanny valley where the words are correct but the voice is wrong.

Approach

We treat voice as a separable latent representation. Our approach encodes speaker identity into a compact embedding that is language-agnostic, then conditions a multilingual neural vocoder on this embedding during synthesis. The challenge is disentangling content, language, and speaker while maintaining real-time latency constraints.

Architecture

The system consists of three trainable components: a content encoder that extracts phonetic representations, a speaker encoder that captures timbre and prosody, and a multilingual decoder that reconstructs speech conditioned on both encodings plus a language token.

Disentanglement challenges

Perfect disentanglement is impossible—accent carries both speaker and language information. We use adversarial training to minimize language information in the speaker embedding, with a trade-off parameter controlled by a language classifier loss. Current results show 73% speaker similarity retention while maintaining translation BLEU scores within 2 points of baseline.

Real-time constraints

For live translation, we use a chunked streaming approach with 500ms look-ahead. The speaker embedding is computed once at conversation start and cached. Each chunk passes through the content encoder and decoder independently, with prosody smoothing applied across chunk boundaries.

Evaluation methodology

We evaluate on three axes: translation quality (BLEU, chrF++), speaker similarity (cosine similarity of speaker embeddings, ABX tests), and naturalness (MOS scores). Human evaluation uses pairwise comparison: same speaker speaking language A natively versus translated from language B.

Current limitations

Emotional prosody transfer remains challenging—translated speech sounds neutral even when source speech is emotional. Code-switching within utterances causes speaker embedding drift. Extreme pitch ranges (very high or very low voices) show degradation in similarity metrics.