Speech Synthesis, also often referred to as Text-To-Speech (TTS) processing, consists in converting written input into spoken output by automatically generating synthetic speech.
TTS systems generally consist of 3 modules:
- Text Processing,
- Prosody Generation,
- Acoustic Synthesis.
The first step in a TTS system is text processing. The input text is analyzed and transformed into a linguistic representation containing all the necessary information needed in the subsequent TTS steps. Typical text processing operations are:
- Special words or symbols (numbers, acronyms, abbreviations, etc.) are identified in the input text and normalized (usually expanded in full text form).
- Each word in the input text is assigned a part-of-speech category (a POS tag) that determines its grammatical function.
- A phonetic lexicon and a set of rules are used to produce the appropriate phonetic transcription of the input text (Grapheme-to-phoneme conversion).
Prosody is the set of speech features that makes that a same phonetic sound can be uttered in very different ways. These features include intonation (tone, pitch contour), speech rate, segment duration, phrase break, stress level and voice quality. Prosody plays a fundamental role to elicit the meaning, attitude and intention and to produce natural speech. The objective of the prosodic TTS module is to generate prosodic features that will make the intonation of the final synthesized speech as close as possible to a natural human voice intonation. In most TTS applications it is of essentiel to produce expressive speech.
The acoustic module physically generates the final speech signal (the synthesized voice) by implementing the appropriate sequence of phonetic units and the desired prosodic features resulting from the previous, afore-mentioned processing steps.
A first approach is to evaluate separately the components of these different modules (glass box evaluation):
- Evaluation of the Text Processing components,
- Evaluation of the Prosody Generation module,
- Evaluation of the Acoustic Synthesis module.
Another (complementary) approach consists in measuring the global overall quality of the synthesized speech (black box evaluation).
TTS evaluation campaigns generally combine both approaches to investigate all objective and subjective aspects of speech synthesis technologies.
The complexity of TTS evaluation comes from the fact that it consists of separate evaluation tasks, each requiring a specific protocol and test collection.
In addition, other specific methods are required to evaluate other TTS-related research tasks (voice conversion, expressive speech synthesis, etc.).
The evaluation of the text processing components is done through automatic metrics (objective measures) by comparing the outputs with a reference:
- Normalization of Non-Standard-Words (NSWs): Word Error Rate (percentage of words not correctly disambiguated);
- End-of-Sentence Detection: Sentence Error Rate (percentage of sentences not correctly segmented);
- POS Tagging: POS-tag Error Rate (percentage of incorrect tags);
- Grapheme-to-Phoneme Conversion: Phoneme Error Rate (percentage of erroneous phonemes) and Word Error Rate (percentage of words containing at least one erroneous phoneme).
Subjective Listening Tests
The global (black-box) evaluation and the evaluation of the other modules (Prosody and Acoustic Synthesis) mainly rely on subjective tests conducted by human judges.
A typical subjective evaluation procedure is as follows:
- Test sentences (input text) are processed by the system.
- Resulting synthesized speech excerpts are collected.
- Subjective judgment tests are performed by human listeners.
Subjects are asked to rate the quality of the synthesized sentences they listen to, according to a series of pre-defined criteria (naturalness, intelligibility, pleasantness, etc.). The TTS systems or modules under scrutiny are compared based on these scores.
MUSSLAP (Multimodal Human Speech and Sign Language Processing for Human-Machine Communication)
HUMAINE (Human-Machine Interaction Network on Emotion)
TC-STAR (Technology and Corpora for Speech to Speech Translation) included a text-to-speech task
EvaSy (in French "Evaluation des systèmes de Synthèse de parole": Speech Synthesis System Evaluation): Evaluation of speech synthesis in French.
MBROLA: a toolkit to build TTS systems in many different languages.
Festival: University of Edinburgh’s Festival Speech Synthesis Systems is a free software multi-lingual speech synthesis workbench.
Festvox tools: Festvox documentation and scripts.
Praat: speech analysis, synthesis, and manipulation package which can perform general numerical and statistical analysis.
- H. Höge, Z. Kacic, B. Kotnik, M. Rojc, N. Moreau, H.-U. Hain, "Evaluation of Modules and Tools for Speech Synthesis - The ECESS Framework - ", LREC 2008, Marrakech, Marocco, 2008.
- Luengo, I., Saratxaga, I., Navas, E., Hernáez, I., Sanchez, J., Sainz, I., "Evaluation of Pitch Detection Algorithms under Real Conditions", ICASSP 2007, Honolulu, Hawaii, USA, 2007.
- A. Bonafonte, H. Höge, I. Kiss, A. Moreno, U. Ziegenhain, H. van den Heuvel, H.-U. Hain, X. S. Wang, M. N. Garcia, "TC-STAR: Specifications of Language Resources and Evaluation for Speech Synthesis", LREC 2006, Genoa, Italy, 2006.
- Mostefa Djamel, Garcia Marie-Neige, Hamon Olivier, Moreau Nicolas (2006). "Evaluation report, Technology and Corpora for Speech to Speech Translation (TC-STAR) project". Deliverable D16, June 2006.