Task Description



Numerous past tasks have focused on leveraging the meaning of word types (noun categorization, TOEFL test), or of words in context (WSD, metonymy resolution, lexical substitution). These tasks have enjoyed a lot success. A natural progression is the pursuit of models that can perform similar tasks in the face of multiword expressions and complex compositional structure.


We present two subtasks which are designed to evaluate such phrasal models. Participating systems may attempt any or all of the subtasks, in any or all of the languages provided in the datasets. However, it is expected that systems which perform well at the more basic tasks will provide a good starting point for dealing with the harder ones, and also language-independent  models are of special interest. The sub-tasks (introduced in detail further below) are as follows:

a) Semantic similarity of words and compositional phrases

b) Evaluating the compositionality of phrases in context


Expected Scientific Benefit

The aim of these tasks is two-fold. Firstly, considering that there is a spread interest lately in phrasal semantics in its various guises, it provides an opportunity to draw together approaches to numerous related problems under a common set of evaluations. It is intended that after the competition, the evaluation setting and the datasets will comprise an ongoing benchmark for the evaluation of these phrasal models. Secondly, we anticipate that these tasks - by bridging the gap between established lexical semantics and full-blown linguistic inference - will stimulate increased interest the general issue of phrasal semantics (as opposed to say lexical compounds or compositional semantics). This could provoke very novel approaches to certain established tasks such as lexical entailment and paraphrase identification, and ultimately lead to improvements in a wide range of applications in Natural Language Processing, e.g. document retrieval, clustering and classification, question answering, query expansion, synonym extraction, relation extraction, automatic translation, or textual advertisement matching in search engines, all of which depend on phrasal semantics.


The range of applicable methods is deliberately not limited to any specific branch of methods (e.g. distributional or vector models of semantic compositionality), as we believe that the tasks herein can be tackled from different directions. Indeed we expect a great deal of the scientific benefit to lie in the comparison of very different approaches, as well as how these approaches can be combined.



All subtasks are based on items drawn from the large-scale, freely available WaCky corpora (Baroni et al., 2009). As the evaluation data only contains very small annotated samples from freely available web documents, and the original source is provided, we can provide them without violating copyrights.


The size of these corpora allows for reliable distributional models to be trained. Sentences in the corpora are also already lemmatized and POS tagged. Participants whose approaches make use of distributional methods, POS tags or lemmas, are strongly encouraged to use these corpora and their shared preprocessing, to ensure the highest possible comparability of results. Additionally, this may considerably reduce the work-load on the participants' side. To further lower the boundaries of entering the task, we will provide UIMA-based components for working with the data.


Targeted Languages

  • English
  • German
  • One romance language



(a) Semantic Similarity of Words and Compositional Phrases

The aim of this subtask is to evaluate how well systems can judge the semantic similarity of a word and a short sequence of (two or more) words. For example, in the word-sequence pair:

(contact, close interaction)

the meaning of the sequence as a whole is semantically close to the meaning of the word. Contrarily, in the sequence-word pair:

(megalomania, great madness)

the meaning of the word is semantically different to the meaning of the sequence, although it is not entirely unrelated. This subtask addresses a core problem, since satisfactory performance appears to be fundamental for compositional models of meaning,and is arguably the basis for the subsequent subtasks.

Participants will be provided with a training and a held-out test set of word-sequence pairs whose components occur in the large-scale freely available WaCky (Baroni et al., 2009) corpora. Each training set pair will be annotated as positive or negative. Participating systems will be judged as successful if they predict correctly whether the components of each test instance, i.e. word-sequence pair, are semantically similar or distinct.

Systems are allowed to use or ignore the training data, i.e. can be supervised or unsupervised. Unsupervised systems can use the training data for development and parameter tuning. Since this is a core task, participating systems will not be able to use dictionaries or other prefabricated lists to address it. Instead, they might use distributional similarity models, selectional preferences, measures of semantic similarity and others.


For each language, the organisers will provide positive and negative word-sequence pairs, 60% annotated as positive/negative for training and, 40% unannotated for testing. The training set will be provided much earlier than the test set. All pairs will occur with useful frequency in the WaCky corpus.

Positive word-sequence equivalences will be extracted from readily available resources, e.g. from definitions in lexica and dictionaries, such as WordNet or Wiktionary. Negative sequences will be approximated by random equivalences. The validity of extracted equivalences, will be checked manually by human curators.

Evaluation Criteria

System's responses will be scored in terms of precision/recall/F1. Systems are encouraged to submit solutions for all languages, but submissions for fewer languages are accepted. Overall performance scores will be computed for systems participating to more than one language.

Subtask Organizers
  • Yannis Korkontzelos (National Centre for Text Mining, University of Manchester, UK)
  • Fabio Massimo Zanzotto (University of Rome "Tor Vergata", Italy)


(b) Semantic Compositionality in Context

An interesting sub-problem of semantic compositionality is to decide whether a phrase is used in its literal or figurative meaning in a given context. For example “big picture” might be used literally as in

“Click here for a bigger picture”

or figuratively as in

“To solve this problem, you have to look at the bigger picture.”

Another example is “old school” which can also be used literally or figuratively:

“He will go down in history as one of the old school, a true gentlemen.”

“During the 1970's the hall of the old school was converted into the library.”


Being able to detect whether a phrase is used literally or figuratively is e.g. especially important for information retrieval, where figuratively used words should be treated separately to avoid false positives. For example, the example sentence “He will go down in history as one of the old school, a true gentlemen.” should probably not be retrieved for the query “school”. Rather, the insights generated from subtask a could be utilized to retrieve sentences using a similar phrase such as “gentleman-like behaviour”.


Participants of this subtask will be provided with a list of target phrases together with real usage examples sampled from the large-scale freely available WaCky (Baroni et al., 2009) corpora.1 For each usage example, the task is to make a binary decision whether the target phrase is used literally or figuratively in this context.


Participants might make use of pre-fabricated lists of phrases annotated with their probability of being used figuratively from publicly available sources. They might use selectional preferences or deep semantic parsing for deciding whether a phrase might be used figuratively in most cases (e.g. “kick the bucket”). Assessing how well the phrase suits its context might be tackled using measures of semantic relatedness as well as distributional models learned from the underlying corpus. The task may also be of interest to the related research fields of metaphor detection and idiom identification.



For each language, the organizers will extract about 100 target phrases and a set of usage contexts from the WaCky corpus. Each context will be annotated using the binary classification scheme (literally vs. figuratively) by native speakers of the respective language. The annotators will be recruited through web-based services such as Amazon Mechanical Turk or similar tools. The organizers have previously carried out similar annotation work before in context of the DiSCo 20112 shared task and other data acquisition projects.


The data will be split into training, validation and test sets. We define two different subsets that will be scored separately and in combination:

  • One subset of target phrases will be accompanied by a large number of contexts, which allows learning single classifiers for each phrase. The validation/test sets will only contain phrases that occur in the training set. This is comparable to the lexical sample task, which encourages the participation of supervised systems.
  • The other subset will contain target phrases together with a smaller number of usage contexts, probably favoring unsupervised approaches. The validation/test sets will contain new target phrases. This is comparable to the all-word task. Here, systems must grasp a notion of literal vs. idiomatic use in general, without training classifiers for each phrase.
  • The training and validation portions will be made available to the participants, together with a scoring infrastructure. For the challenge, participants submit their system's output on the test sets to the task organizers, who score the systems and provide the official scores.


Evaluation Criteria

System performance will be measured in terms of precision/recall/F1 of the classifier. Systems are encouraged to use both data subsets, but submissions using a single subset are accepted. For systems tackling both data subsets or both languages, combined scores reflecting the overall performance will be computed.

Besides the overall performance score on the full test set, we will provide sub results for phrases already seen in the training data as well as for the new target phrases.

Subtask Organizers
  • Chris Biemann, UKP Lab, Technische Universität Darmstadt, Germany
  • Eugenie Giesbrecht, Karlsruhe Institute of Technology, Germany
  • Torsten Zesch, UKP Lab, Technische Universität Darmstadt, Germany



  • Yannis Korkontzelos, National Centre for Text Mining, University of Manchester, UK
  • Torsten Zesch, UKP Lab, Technische Universität Darmstadt, Germany



  1. M. Baroni, S. Bernardini, A. Ferraresi and E. Zanchetta. 2009. The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43 (3): 209-226.

Contact Info

Yannis Korkontzelos
Torsten Zesch

email (subtask b): zesch@ukp.informatik.tu-darmstadt.de

Other Info


  • under construction