Speech-to-Speech Translation



The goal of the Speech-to-Speech Translation (SST) is to enable real-time, interpersonal communication via natural spoken language for people who do not share a common language. It aims at translating a speech signal in a source language into another speech signal in a target language.

The evaluation of SST systems can be considered as a extended task of MT evaluation (namely Spoken Language Translation, SLT), in including speech recognition and speech synthesis in the evaluation loop as an end-to-end evaluation.



SLT component usually operate on output produced by ASR component and provide input for the speech synthesis component. The speech translation evaluation can be single component or end-to-end. The former uses the respective component output to provide quality evaluation, while the latter uses the final output of the whole system to provide its quality evaluation.

End-to-end evaluations examine a system in its whole configuration and functionality. Single component evaluations are focused on the different speech translation modules: speech recognition, speech synthesis, and machine translation. The own component metrics are then used, although the interpretation might remains different.



According to different evaluation criteria, several measures can be used for the end-to-end evaluation, which are typically merged into two main categories: the first one estimate the audio quality of the output, while the second one estimate its meaning preservation. The evaluation of the audio quality is rather simple since it uses very similar metrics from the speech synthesis evaluation. Meaning preservation is more complex and can be done either with subjective or objective measures.

Subjective evaluation uses human judges assessments (from users and/or experts) to compute the loss of preservation between the input, in the source language, and the output, in the target language. Several ways can be employed, like asking questions about the content, rewrite what the judge heard, etc. Generally, the SST system is compared (directly or not) o a reference, likely a human interpreter.

Objective evaluation produces the same kind, but without assessment. One or several experts check the SST output, in going by a reference, in order not to bias the results by human behaviour (such as fatigue, noises, etc.)




- TC-STAR, Technology and Corpora for Speech to Speech Translation, 6th FP project (2004-2007).

- LC-STAR, Lexica and Corpora for Speech-to-Speech Translation Components (2002-2005).

- NESPOLE!, NEgotiating through SPOken Language in E-commerce, 5th FP project (2000-2002).

- TONGUES, Rapid Development of Speech-to-Speech Translation System (2000-2002).

- Verbmobil, German project on Mobile Speech-to-Speech Translation of Spontaneous Dialogs (1996-2000).






For further information on research, campaigns, conferences, software and data regarding speech-to-speech translation and its evaluation, please refer to Machine Translation Archive