Information Extraction

Description

 

Information Extraction (IE) is a technology which extracts pieces of information that are salient to the user’s needs. The kinds of information that systems extract vary in detail and reliability: named entities, attributes, facts and events.

Approach

 

Due to the complexity of the IE task, the limited performance of tools, there are few comparative evaluations in IE.

One can consider the Message Understanding Conference (MUC) as the starting point where most of IE evaluation methodology was defined.
The performance of a system is measured by scoring filled templates with the classical information retrieval (IR) evaluation metrics: precision, recall and the F-measure. Another evaluation metric, based on the classification error rate is also used for IE evaluation. The annotated data are required for training and testing.

Measures

 

Given a system response and a human-generated answer key, the system’s precision is defined as the number of slots it filled correctly, divided by the number of slots it attempted. Recall is defined as the number of slots it filled correctly, divided by the number of possible correct fills taken from the human-generated key. One general issue is how to define filler boundaries which is related to the question of how to assess an extracted fragment? Three criteria for matching reference occurrences and extracted ones are proposed (Freitag 1998):

- The system output matches exactly a reference
- The system output strictly contains a reference and at most k neighbouring tokens
- The system output overlaps a reference

In Automatic Content Extraction (ACE) and MUC evaluation conferences, the criteria used for assessing each system output item are: correct, partial, incorrect, spurious, missing and non committal.

Projects

 

- EVALITA 2009: two task related to Named Entity Recognition, and Local Entity Detection and Recognition
- BOEMIE (2006-2008)
- ACE (1999-2008)
- MUC (1987-1998)

Tools

 

- BALIE
- JULIE Labs NLP Toolsuite

LRs

 

- Domain-independent annotated corpora:

  • MUC corpora (newswire articles, also available in Spanish, Chinese and Japanese for multilingual entity task)
  • ACE corpora (broadcast news, newswire, translated documents from Chinese and Arabic Treebank)
    - Domain-specific annotated data:
  • Job postings from WWW (Califf 1988)
  • seminar announcements (Freitag 1998)

References

 

- RISE

Bibliography

  • Maynard D., Peters W., and Li Y. (2006). Metrics for evaluation of ontology-based information extraction. In WWW 2006 Workshop on "Evaluation of Ontologies for the Web" (EON), Edinburgh, Scotland.
  • Freitag D. (1998). Information Extraction From HTML: Application of a General Learning Approach. In the Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98).
  • Califf M. E. and Mooney R. J. (1998). Relational Learning of Pattern-Match Rules for Information Extraction. In Proceedings of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, Stanford, CA, March.