Task Description


The aim of this task is to provide an evaluation framework for the objective evaluation and comparison of Word Sense Disambiguation and Induction algorithms in an end-user application, namely Web Search Result Clustering.

Given an ambiguous query, the top ranking snippets returned by a search engine will be provided as bags of words. Systems will be asked to associate a sense id with each bag of words, one for each snippet. Snippets will be clustered based on the output sense associations (either cluster ids for WSI systems or dictionary sense ids for WSD systems) and an evaluation will be performed against a gold-standard clustering obtained on the basis of the manual associations of snippets with Wikipedia senses.



The aim of this Semeval task is to provide a framework for the objective evaluation and comparison of Word Sense Disambiguation and Induction algorithms in an end-user application, namely Web Search Result Clustering.

Word Sense Disambiguation (WSD) is the task of automatically associating meaning with words. In WSD the possible meanings for a given word are drawn from an existing sense inventory (e.g., from WordNet). In contrast, Word Sense Induction (WSI) aims at automatically identifying the meanings of a given word from raw text (see (Navigli, 2009) for a survey of both paradigms). But while WSD can be easily evaluated in vitro by means of popular measures of precision, recall and F1, the performance of WSI systems can be hardly evaluated in an objective way. In fact, all the proposed measures in the literature for in vitro evaluation tend to favour specific cluster shapes (e.g., singletons or all-in-one clusters) of the sense groups produced as output. Indeed, WSI evaluation is actually an instance of the more general and difficult problem of evaluating clustering algorithms.

To deal with the above issue and to enable a fair comparison of word sense disambiguation and induction algorithms, we propose to evaluate such systems in a single framework, driven by an end-user application. The proposed application is Web Search Result Clustering, a task consisting of grouping into clusters the results returned by a search engine for an input query. Results in a given cluster are assumed to be semantically related to each other and each cluster is expected to represent a specific meaning of the input query (even though more clusters can possibly represent the same meaning). For instance, given the query beagle and the following 3 snippets:

  1. Beagle is a search tool that ransacks your...
  2. ...the beagle disappearing in search of game...
  3. Beagle indexes your files and searches...

The task consists of producing a clustering of snippets that groups snippets conveying the same meaning of the input query beagle i.e., ideally {1, 3} and {2}.

A WSI system will be asked to identify the meaning of the input query and cluster the snippets into semantically-related groups according to their meanings. Instead, a WSD system will be requested to sense-tag the above snippets with the appropriate senses of the input query and this, again, will implicitly determine a clustering of snippets (i.e., one cluster per sense).

WSD and WSI systems will then be evaluated in an end-user application, i.e., according to their ability to diversify the search results for the input query. This evaluation scheme, previously proposed for WSI by Navigli and Crisafulli (2010) and Di Marco and Navigli (2013), is extended here to WSD and WSI systems and is aimed at overcoming the limitations of in vitro evaluations. In fact, the quality of the output clusters will be assessed in terms of their ability to diversify the snippets across the query meanings.


Details on the dataset

No training data will be provided. The test data will be created by:

  • manually choosing ambiguous queries of different lengths;
  • querying Google;
  • retrieving the top 64 results for each query;
  • associating each resulting snippet with the most appropriate Wikipedia sense (i.e., page) for that query. The annotations will be obtained by crowdsourcing+further checks by the authors.



Evaluation methodology

We will obtain a gold-standard cluster from the manual association of each snippet with the most appropriate Wikipedia sense. Nothing would prevent us from using WordNet instead, but given the nature of the end-user application (Information Retrieval) we need an up-to-date sense inventory covering both concepts and named entities.

For evaluation we will use the following measures (see (Di Marco and Navigli, 2013) for details):

  • To assess clustering quality, we will use the classical measures of (Adjusted) RandIndex, Jaccard Index and F1. However, this evaluation measures will suffer from the above-mentioned problems.
  • To assess the ability of systems to diversify search results, we will use Subtopic Recall@K and Subtopic Precision@r. These measures determine the ability of search engines to diversify the top-ranking K snippets. Given that WSD/WSI systems produce clusters of snippets, we will flatten the output clustering to a list as follows: we add to the initially empty list the first element of each cluster; then we iterate the process by selecting the second element of each cluster (if any) and so on. The remaining snippets not included in any cluster (but returned by the search engine) are appended to the bottom of the list in their original order. In order to perform the flattening procedure, WSD/WSI must provide snippets in each cluster already sorted by the confidence according to which the snippet belongs to the cluster, and must rank clusters according to their diversity.

The reason why Subtopic Recall@K and Precision@r should provide an objective evaluation of WSD and WSI systems is that they measure what the user expects from the search engine, i.e., its ability to retrieve snippets diversified across the query meanings. Thus all-in-one clusters will not win unless there is a single meaning for the query and singleton clusters will not win unless they are ranked in a way that diversifies their content.



Trial data and the evaluator for the task can be found on the Data page

Contact Info


Roberto Navigli
 Sapienza University of Rome, Italy
Daniele Vannella
 Sapienza University of Rome, Italy

Other Info


  • February 22, 2013: Registration Deadline
  • March 5, 2013: Start of evaluation period
  • The files need to be uploaded by 11:59pm PST today (3 system runs per team at most.)
  • March 15, 2013: End of evaluation period
  • April 15, 2013: Paper submission deadline
  • April 22, 2013: Reviews Due
  • April 29, 2013 Camera ready Due
  • June 14-15, 2013 , SemEval Workshop, (and, June 13-14, *Sem Conference), Atlanta, Georgia