Speech to SPeech Translation


Speech-to-speech translation (S2ST) systems aim to facilitate communication between speakers of different languages. Unlike speech-to-text translation models, such systems enable more natural and direct conversations between several persons. Applications are limitless: from live video conferences between businesses worldwide to live international events.

Early S2ST systems were divided into three components: automatic speech recognition, machine translation, and text-to-speech synthesis. Those cascaded systems fall into three main pitfalls: error propagation, bad latency, and most importantly the loss of speaker emotions and voice characteristics. To solve those issues, direct models, also called end-to-end models, transforming an audio in a language to an audio in another language using a single sequence-to-sequence model have been largely adopted in recent years. The last years have also seen the development of large multimodal models combining several AI tasks including end-to-end S2ST.

What Sets Our Data Apart?

Both end-to-end S2ST models and multimodal models face a significant challenge due to their size and the complexity of their tasks, requiring vast amounts of data. To partially overcome the scarcity of S2ST data, these models utilize auxiliary ASR and MT losses during training and leverage large quantities of weakly labeled S2ST data. For instance, text-to-speech models can be used to generate synthetic voices, which are then used as pseudo-labels during training. While such data offers a vast volume of parallel hours, it often lacks consistency and expressiveness in the speakers' voices for a given pair of parallel audio. At Landospeech, we provide datasets that address these challenges.

same speaker

The primary objective of speech translation systems is to faithfully reproduce not only the speaker's words but also their voice, rhythm, and emotional expression. Our S2ST datasets offer users audio pairs recorded by the same speaker under identical expression and recording conditions.

THousands of hours

Non-synthetic S2ST data is exceptionally scarce, and datasets containing parallel audio from the same speaker are often limited to just a few hours. At Landospeech, we are actively collecting thousands of hours of such data across numerous language pairs.

DIversity

Our dataset comprises a diverse range of speech types, including read, spontaneous, conversational and meeting speech, featuring speakers from various backgrounds discussing different topics in diverse environments. This diversity enables the training of robust models that are resilient to distribution shifts.

TEXT labels and metadata

The training complexity of S2ST is mitigated by using intermediate losses on ASR and MT. To help with training, we provide all audio transcriptions and detailed metadata about speakers, language proficiency, and talk topics.

Elevate Your Performance Metrics

In the following table, we compare the characteristics of our datasets with the most commonly used public S2ST datasets. We have excluded well-known speech translation datasets like Fisher-Callhome, Must-C, or Covost2 because they lack target speech, containing only translated texts as targets. The table highlights that the unique features of LandoSpeech datasets significantly enhance your model's performance, as measured by metrics such as ASR-BLEU, speaker embedding similarity, prosody consistency, and mean opinion score.

  • The unparalleled diversity, exceptional translation quality, and extensive volume of our datasets will significantly enhance your translation accuracy.

  • Training on the same speakers for two languages, the speaker embeddings extracted from your model will exhibit very high cosine similarity for each audio pair.

  • Since our speakers maintained consistent tone and prosody across both languages for each utterance pair, your model's prosody consistency will be significantly enhanced. Moreover, since the speaking rhythm and pauses are consistently matched in each pair, the rhythm of your model will be greatly improved.

  • Trained on a vast array of accents, ages, and tones from real human voices, the speech's naturalness, clarity, and quality will be unmatched.

Want to learn more?