Information Retrieval



Information Retrieval (IR) deals with the representation, storage, organization of, and access to information items. The representation and organisation of the information items should provide the user with easy access to the information in which he is interested.

IR systems allow a user to retrieve the relevant documents which (partially) match his information need (expressed as a query) from a data collection. The system yields a list of documents, ranked according to their estimated relevance to the user’s query. It is the user’s task to look for the information within the relevant documents themselves once they are retrieved. An IR system is generally optimized to perform in a specific domain: newswire, medical reports, patents, law texts, etc.

In recent years, the impressive growth of available multimedia data (audio, video, photos…) has required the development of new Multimodal IR strategies in order to deal with:
- annotated image collections (images with captions, etc.).
- multimedia documents combining text and pictures.
- speech transcriptions (e.g. transcribed TV programs), etc. Multimedia and audio-visual data are processed by combining information extracted from different modalities: text, audio transcriptions, images, video key-frames, etc.

Moreover, in a globalized world, IR systems have more and more to cope with multilingual information sources. In a multilingual context, we talk of Cross-Language Information Retrieval (CLIR) [1]. The language of the query (source language) is not necessarily the same as the language(s) used in the documents (target language(s)).

Question Answering (QA) is another, particular approach to IR. In a QA system information needs are expressed as natural language statements or questions. In contrast to classical IR where complete documents are considered relevant to the information need, QA systems return concise answers. Often, automated reasoning is needed to identify potential correct answers. The explosive demand for better information access for a large public of users fosters the R&D for QA systems. The interest of QA is to provide inexperienced users with a flexible access to information allowing them to write a natural question and obtain directly a concise answer.

Other types of applications are often considered to be part of the IR domain: Information Extraction, Document Filtering, etc.



Most IR evaluation campaigns carried out until now rely on a comparative approach. Unlike objective evaluation (How well does a method work?), comparative evaluation focuses on the comparison of the results obtained with different systems (Which method works best?). To be compared, IR systems must be tested under similar conditions.

An IR comparative evaluation usually relies on a test collection consisting of:

  • a set of documents to be searched,
  • a set of test queries,
  • if available: the set of relevant documents for each query.

Once a test collection has been created, the general evaluation methodology is done in 3 main steps:

  1. Evaluation run: each IR system to evaluate searches the test collection using the pre-defined test queries. It yields a ranked list of document for each test query.
  2. Relevance judgments: human evaluators examine each retrieved document and decide if it is relevant or not, i.e. if it satisfies or not the information need expressed by the query. If the set of relevant documents for each query is known a priori, this can be done automatically by comparing the set of retrieved documents with the reference set of relevant documents.
  3. Scoring: performance measures are computed based on the relevance judgments.

As long as they are tested on the same test collection (same set of documents and queries) the performance of different systems can be compared based on their final performance measures.

The human relevance judgment step represents the most time- and resource-consuming part of an IR evaluation procedure:

  • It requires the hiring of a team of objective experts who have to behave as if they were real users, and judge the relevance of each retrieved document with regard to the queries.
  • A human evaluation framework (computer interface, evaluation guidelines, training sessions) must be carefully designed to ensure that all evaluators work under the same conditions.



Early in 1966, Cleverdon  [2] listed six measurable features that reflect users’ ability to use an IR system:

  1. Coverage of information;
  2. Form of output presentation;
  3. Time efficiency;
  4. Effort required for the user;
  5. Precision;
  6. Recall.

In general, the objective evaluation of IR performance relies on the 2 last effectiveness measures (Precision and Recall), based on the number of relevant documents retrieved.

Considering the ranked list of retrieved document for a given query, these 2 values are computed by considering the first N retrieved documents only (let’s call it the N-list):

  • Precision is defined as the number of relevant documents retrieved in the N-list divided by N (i.e. the proportion of relevant items in the N-list of retrieved documents).
  • Recall is defined as the number of relevant documents retrieved in the N-list divided by the total number of existing relevant documents in the collection (i.e. the proportion of retrieved items in the set of all relevant documents).

Precision and Recall measures are computed for different values of N resulting in a Precision/Recall curve that reflects the IR effectiveness.

Usually, a single value metric is derived from the Precision/Recall plots and used as a final indicator of retrieval efficiency. Usual metrics are:

  • Mean Average Precision (MAP);
  • Expected search length (rank of first relevant);
  • E-measure;
  • F-measure.

These metrics can be computed for a single query, but they are generally averaged over the whole set of test queries.

Detailed descriptions of classical IR evaluation measures can be found in [3]

Other specific IR tasks may require other performance measures. For example, the performance of QA systems is measured upon the percentage of correct answers obtained from a set of test questions.



- TREC (Text Retrieval Conference): TREC-1, TREC-2, TREC-3, TREC-4, TREC-5, TREC-6, TREC-7, TREC-8, TREC-9, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011
- CLEF (Cross-Language Evaluation Forum), cross-language IR and QA, European languages: 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011
- NTCIR (NII Test Collection for IR Systems), cross-language IR and QA, Asian languages: NTCIR-1, NTCIR-2, NTCIR-3, NTCIR-4, NTCIR-5, NTCIR-6, NTCIR-7, NTCIR-8, NTCIR-9
- Quaero Collaborative R&D program on Multimedia Information Retrieval Evaluation (2008-2013).
- TrebleCLEF 2-year EC project supporting CLEF activities (2008-2010).
- FIRE (Forum for Information Retrieval Evaluation): IR initiative for Indian languages
- CLIA (Cross Language Information Access): Consortium project focusing on IR, Summarization, and machine translation for Indian languages
- INEX (Initiative for the Evaluation of XML Retrieval).
- QALL-ME (Question Answering Learning technologies in a multiLingual and Multimodal Environment)
- AQUAINT (Advanced Question Answering for Intelligence)
- TRECVID Automatic Segmentation, Indexing, and Content-based Retrieval of Digital Video, originated by TREC.
- CHORUS FP6 Coordination Action on Multimedia Content Search Engines.

- ImagEVAL 2006 Evaluation of Content-Based Image Retrieval (CBIR)
- TIPSTER Document Detection, Information Extraction and Summarization
- much.more Cross-lingual information access for the medical domain
- AMARYLLIS IR Evaluation for French
- EQueR (Evaluation campaign for Question-Answering systems): evaluation of QA in French
- TIDES (Translingual Information Detection Extraction and Summarization).

TIDES included several evaluation projects:

  • Information Retrieval: HARD (High Accuracy Retrieval from Documents).
  • Information Detection: TDT (Topic Detection and Tracking).
  • Information Extraction: ACE (Automatic Content Extraction).
  • Summarization: DUC (Document Understanding Conference).




- CLEF 2010 (CLEF Conference).
- SIGIR 2010 (Conference of the ACM’s Special Interest Group on Information Retrieval).
- ECDL 2010 (European Conference on Research and Advanced Technology for Digital Libraries)
- NTCIR-8 (the 8th NTCIR Workshop)
- ECIR 2010 (European Conference on Information Retrieval)
- FIRE 2010 (Conference of the Forum for Information Retrieval Evaluation)
- MIR 2010 (ACM SIGMM International Conference on Multimedia Information Retrieval)
- CBMI’2010 (8th International Workshop on Content-Based Multimedia Indexing)
- CIVR 2010 (ACM International Conference on Image and Video Retrieval)


- AIRS 2009 (Asia Information Retrieval Symposium)
- CLEF 2009 (10th CLEF evaluation Campaign).
- ECDL 2009 (European Conference on Research and Advanced Technology for Digital Libraries)
- SIGIR’09 Conference (ACM’s Special Interest Group on Information Retrieval).
- CIVR 2009 (ACM International Conference on Image and Video Retrieval).
- JDCL 2009 (Joint Conference on Digital Libraries)
- ECIR 2009 (European Conference on Information Retrieval)
- IRF Symposium
- NTCIR-7 (the 7th NTCIR Workshop)
- FIRE 2008 (Conference of the Forum for Information Retrieval Evaluation)
- TRECVID 2008 (TREC Video Retrieval Evaluation Workshop).
- MIR 2008 (ACM International Conference on Multimedia Information Retrieval).
- RIAO 2007 (Conference on Large-Scale Semantic Access to Content: Text, Image, Video and Sound)



- trec_eval is the most commonly used scoring tool for IR evaluations.

- QASTLE is a tool created in Perl by ELDA to perform human evaluation of Question-Answering systems.



Test Collections

- CLEF Evaluation Packages: CLEF test suites are distributed through the ELRA Catalogue of Language Resources.

- TREC Collections: TREC mostly deals with information retrieval in English.

- NTCIR Test Collections: News corpora in English (Taiwan News, China Times English News, Hong Kong Standard, etc.) and other evaluation corpora: collections of patent application documents, web crawls…

- Amaryllis Test Collections: news articles in French; plus titles and summaries of scientific articles. Distributed through the ELRA Catalogue of Language Resources.

- EQueR Test Collections: news articles in French and domain specific medical corpus (scientific articles and guidelines for good medical practice). Distributed through the ELRA Catalogue of Language Resources.

Domain Specific Corpora

- OSUMED collection in medicine.

- Cranfield collection in aeronautics.

- CACM collection (Communications of the Association for Computing Machinery ACM) in computer science;

- ISI collection (Institute of Scientific Information) in library science also referred to as CISI.

- Chinese Web test collection, composed of documents, queries and relevance judgments.


  • Moreau, N. et al., "Best Practices in Language Resources for Multilingual Information Access", Public report of the TrebleCLEF project (Deliverable 5.2), March 2009.
  • Cleverdon, C., Keen, M., "Factors Affecting the Performance of Indexing Systems", Vol 2. ASLIB, Cranfield Research Project. Bedford, UK: C. Cleverdon, 1966, 37-59.
  • Baeza-Yates R., Ribiero-Neto B., "Modern Information Retrieval", Ed. Addison-Wesley, 1999.