Multilingual Text Alignment

Description

 

Multilingual texts alignment consists in identifying correspondences between different text units, e.g., words, sentences, paragraphs, etc. in parallel texts.

Approach

 

The main approach of alignment evaluation is to compare a system-computed alignment with a manually produced reference alignment, usually called a gold standard. Different tasks have been defined in previous evaluation exercises such as Blinker, ARCADE, HLT-NAACL and ACL.

Measures

 

Alignment evaluations were generally performed by using traditional IR measures:
- Precision
- Recall
- F-measure
- AER (Och and Ney, 2000), Alignment Error Rate, derived from F-measure

Projects

 

Past

- ARCADE I (1996-1999) and ARCADE II, multilingual text alignment evaluation campains (2003-2006).
- Blinker (1998-2001).

LRs

 

- ARCADE II Evaluation package
- Data from HLT-NAACL 2003 workshop on parallel texts (English, Romanian, French)
- The Bible, parallel biblical texts available in several languages, among which Chinese, Danish, English, French, Greek, Swahili.
- The MULTEXT corpora (English, French, German, Italian and Spanish) and MULTEXT-East corpora (English, Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovenian).
- The ARCADE/ROMENSEVAL multilingual corpora (English, French, German, Italian, Spanish, Arabic, Chinese, Japanese, Greek, Persian, Russian)
- Data from ACL 2005 workshop on Building and Using Parallel Texts (English, Inukitut, Romanian and Hindi).

References

 
  • Och F. J. and Ney H. (2000) A Comparison of Alignment models for statistical machine translation. In Proceedings of the 18th International Conference on Computational Linguistics (COLING-ACL 2000), p1086-1090, Saarbr├╝cken, Germany.