Machine Translation



Machine Translation (MT) technologies convert text from a source language (L1) into a target language (L2).

One of the most difficult things in Machine Translation is the evaluation of a proposed system. The problem with language is that language has some degree of ambiguity which makes it hard to run an objective evaluation. For example, with Machine Translation one problem is that there is not only one good translation for a given source text.

Van Slype (1979) distinguished macro evaluation, designed to measure product quality and micro evaluation, assess the improvability of the system.

The macro evaluation, also called total evaluation enables comparison of the performance of two translation systems or two versions of the same system. The micro evaluation, also known as detailed evaluation seeks to assess the improvability of the translation system.



The performance of a translation system is usually measured by the quality of its translated texts. Since there is no absolute translation for a given text, the challenge of the machine translation evaluation is to provide an objective and economic assessment.

Given the difficulty of the task, most of the translation quality assessments were based on human judgement in the history of MT evaluation. However, automatic procedures allow a quicker, repeatable, objective and cheaper evaluation.

Automatic MT evaluation consists in comparing the MT system output to one or more human reference translations. Human scores (manual evaluation) are assigned according to the adequacy, the fluency or the informativeness of the translated text.

In automatic evaluation, the fluency and adequacy of MT output can be measured by n-gram analysis.



Some of the most common automatic evaluation metrics are:

Metrics Description Reference
BLEU IBM BLEU for BiLingual Evaluation Understudy is an n-gram co-occurrence scoring procedure. (Papineni et al., 2001)
NIST A variation of BLEU used in NIST HLT evaluation (Doddington, 2002)
EvalTrans Tool for the automatic and manual evaluation of translations (Niessen et al., 2000)
GTM General Text Matcher based on accuracy measures as precision, recall and F-measure (Turian et al., 2003)
mWER Multiple reference Word Error Rate is the average number of MT system output and several human reference translation (Niessen et al., 2000)
mPER Multiple reference Position independent word Error Rate (Tillmann et al., 1997)
METEOR Metric for Evaluation of Translation with Explicit ORdering, based on the harmonic mean of unigram precision and recall (Banerjee & Lavie, 2005)
ROUGE Recall-Oriented Understudy for Gisting Evaluation based on N-gram co-occurrence measure (Lin, 2004)
TER Translation Error Rate (Snover et al., 2006)

For human evaluation, Fluency and adequacy are two commonly used translation quality notions (LDC2002, White et al. 1994). Fluency refers to the degree to which the system output is well-formed according the target language’s grammar. Adequacy refers to the degree to which the output communicates the information present in the reference translation. Recently, other measures have been tested, such as the comprehensibility of a MT translated segment (NIST MT09), or the preference between MT translations of different systems (NIST MT08).



- WMT (2006-2011): 2006, 2007, 2008, 2009, 2010, 2011.

- EuroMatrixPlus (2009-2012)

- The annual IWSLT (2004-2009): 2004, 2005, 2006, 2007, 2008, 2009, 2010.


- EuroMatrix (2006-2009)

- FEMTI, the Framework for the Evaluation of Machine Translation in ISLE (2001-2009).

- GALE evaluation (2006-2008): 2007, 2008, 2009.

- The TC-STAR evaluation campaigns, Technology and Corpora for Speech to Speech Translation, 6th FP project (2004-2007): 2005, 2006, 2007.

- Swiss National Fund: Quality models and resources for the evaluation of MT (2004 - 2008).

- The CESTA evaluation campaigns (in French), Evalda project, French Technolangue program (2002-2006).

- The annual NIST Open Machine Translation Evaluation (2001-2009): 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2009.

- The C-STAR evaluation campaigns (2001, 2002, 2003).

- EAGLES, Evaluation of Natural Language Processing Systems (1993-1995).

- 863 Evaluation, HTRDP Evaluation of Chinese Language Processing and Intelligent Human Machine Interface (1986).



- ACL 2010, joint Fifth Workshop on Statistical Machine Translation and Metrics MATR

- Machine Translation Summit XII.

- AMTA 2009

- EACL 2009, Fourth Workshop on "Statistical Machine Translation".

- IWSLT 2009.

- ACL-IJCNLP 2009.

- AMTA 2008, Workshop on "Metrics MATR: NIST Metrics for Machine Translation Challenge".

- LREC 2008, Tutorial on "Evaluating Machine Translation in Use: From theory to practice".

- ACL 2008, Third Workshop on "Statistical Machine Translation".

- IWSLT 2008.

- MT Summit XI (2007), Workshop on "Automatic Procedures in MT Evaluation".

- MT Summit XI (2007), Tutorial on "Context-based evaluation of MT systems: Principles and Tools".

- ACL 2007, Second Workshop on "Statistical Machine Translation".

- IWSLT 2007.

- AMTA 2006, Workshop on "MT Evaluation: the Black Box in the Hall of Mirrors".

- HLT-NAACL 2006, Workshop on "Statistical Machine Translation".

- IWSLT 2006.

- HLT Evaluation Workshop in Malta (2005).

- IWSLT 2005.

- IWSLT 2004.

- MT Summit IX (2003), Workshop on "Towards Systematizing MT Evaluation".

- LREC 2002, Workshop on "Machine Translation Evaluation: Human Evaluators meet Automatic Metrics".

- MT Summit VIII (2001), Workshop on "Who did What to Whom".

- LREC 2000, Workshop on "The evaluation of Machine Translation".

- AMTA 2000, MT Evaluation Workshop on "Hands-on Evaluation".

- MT Summit VI (1997), Tutorial on "MT Evaluation: Old, New and Recycled".

- AMTA 1998, Tutorial on "MT Evaluation: Old, New and Recycled".

- Machine Translation Vol. 8, nos. 1-2 (1993), Special Issue On Evaluation Of MT Systems.

- AMTA 1992, Workshop on "MT Evaluation: Basis for Future Directions an NSF-Sponsored Workshop".



Open-source Machine Translation Systems

- Apertium open-source machine translation platform

- GenPar Toolkit for Research on Generalized Parsing

- JosHUa open-source decoder for parsing-based machine translation

- Matxin open-source transfer machine translation engine

- Moses open-source statistical machine translation system

Automatic Metrics


- EvalTrans







For further information on research, campaigns, conferences, software and data regarding statistical machine translation and its evaluation, please refer to the European Association for Machine Translation

The Machine Translation Archive is also offering a repository and bibliography about machine translation.


  • Lin C.-Y., Cao G., Gao J., Nie J.-Y. (2006). An information-theoretic approach to automatic evaluation of summaries. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, p.463-470, New York, New York
  • Snover M., Dorr B., Schwartz R., Micciulla L., and Makhoul J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Biennial Conference of the Association for Machine Translation in the Americas (AMTA-2006), Cambridge, Massachusetts.
  • Banerjee S. et Lavie A. (2005). METEOR : An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  • Turian J. P., Shen L., and Dan Melamed I. (2003). Evaluation of Machine Translation and Its Evaluation. Proceedings of MT Summit 2003: 386-393. New Orleans, Luisiana.
  • Doddington G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of ARPA Workshop on Human Language Technology.
  • Papineni K., Roukos S., Ward T. et Zhu W.-J. (2001). Bleu : a method for automatic evaluation of machine translation. Rapport technique, IBM Research Division, Thomas J. Watson Research Center.
  • Niessen S., Och F. J., Leusch G. et Ney H. (2000). An evaluation tool for machine translation : Fast evaluation for mt reseach. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece.
  • Tillmann C., Vogel S., Ney H., Zubiaga A., and Sawaf H. (1997). Accelerated DP based search for statistical translation. In Fifth European Conf. on Speech Communication and Technology, pages 2667–2670, Rhodos, Greece, September.
  • White J. S., O’Connel T. A. and O’Maraf (1994). The arpa mt evaluation methodologies : evolution, lessons, and future approaches. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, USA.
  • Van Slype G. (1979). Critical study of methods for evaluating the quality of machine translation. Rapport technique Final report BR 19142, Brussels : Bureau Marcel van Dijk.