SemEval-2010 Word Sense Induction & Disambiguation Task

Task Description

For each target word, whose senses need to be induced, participants will be given a training and a testing set. Both of these will not be sense annotated. Both the training and the tesing set will consist of target word instances.

Participants will be required to learn the senses of the target word using only the training set. No other resources are allowed. However, all NLP components for morphology and syntax that have been built using supervised/unsupervised methods will be allowed. The restriction will only apply to data used for acquiring the word senses. Only the data that we will provide may be used for acquiring word senses. The parameters of systens can be tuned manually, for example, to generate the 'right' number of clusters or to set certain thresholds. However, if participants are using some other development sets for parameter tuning thereby making their system somewhat supervised then this needs to be declared in the description paper. We recommend to participants to use the induced senses acquired during training to assign a sense ID to each test instance. The sense assignment can also be probabilistic (see below). However note that finding clusters in the training data is not required. Given that we only evaluate systems' output on the test data, a participant can treat the training data as development data (e.g. for parameter tuning), and apply his/her learning method on the test data.

The testing set will only be used for evaluation, and should not be used as complementary to the training set. Participants are required to tag each instance of the target word in the testing set with one of its induced senses (assuming senses have been learned using the training sense. The tagged instances will be sent to organisers to perform the evaluation.

The output of systems should follow the usual Senseval-3 & SemEval-2007 WSI task format. The labels for learned senses can be arbitrary names, however the labels of each induced sense must be unique. For instance, assume that one participant system has induced 2 senses for the verb "absorb", i.e. absorb.cluster.1 and absorb.cluster.2. These are example outputs for two instances of the word absorb:

absorb.v absorb.v.1 absorb.cluster.1

absorb.v absorb.v.2 absorb.cluster.1/0.8 absorb.cluster.2/0.2

In the first line the system assigns sense absorb.cluster.1 to instance absorb.v.1 with weight 1 (default). In the second line the system assigns to instance absorb.v.2 i.) sense absorb.cluster.1 with weight equal to 0.8 and ii.)sense absorb.cluster.2 with weight equal to 0.2.

We recommend that participants return all induced senses per instance with associated weights, as this will enable a more objective supervised evaluation.

Evaluation Setting

The evaluation scheme consists of the following assessment methodologies:

  • Unsupervised Evaluation

    The induced senses are evaluated as clusters of examples, and compared to sets of examples, which have been tagged with gold standard (GS) senses. The evaluation metric used, V-measure (Rosenberg & Hirschberg, 2007), attempts to measure both coverage and homogeneity of a clustering solution, where a perfect homogeneity is achieved if all the clusters of a clustering solution contain only data points, which are elements of a single Gold Standard (GS) class. On the other hand, a perfect coverage is achieved if all the data points, which are members of a given class are also elements of the same cluster. Homogeneity and completeness can be treated in similar fashion to precision and recall, where increasing the former often results in decreasing the latter (Rosenberg & Hirschberg, 2007).

  • Supervised Evaluation

    The second evaluation setting, supervised evaluation, assesses WSI systems in a WSD task. Particularly, the training corpus is used to create matrix, mapping clusters to GS senses. The mapping matrix is then used to tag each instance in the testing corpus (which has also been tagged with an automatically induced cluster) with the most probable GS sense as defined by mapping matrix. The usual recall/precision measures for WSD are then used. This evaluation methodology was a part of the SemEval-2007 WSI task (Agirre & Soroa,2007).

    For more information, please refer to (Manandhar & Klapaftis, 2009)


    Andrew Rosenberg and Julia Hirschberg. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) Prague, Czech Republic, (June 2007). ACL.

    Eneko Agirre and Aitor Soroa. Semeval-2007 task 02: Evaluating word sense induction and discrimination systems. In Proceedings of the Fourth International Workshop on Semantic Evaluations, pp. 7-12, Prague, Czech Republic, (June 2007). ACL.

    Suresh Manandhar & Ioannis P. Klapaftis , "SemEval-2010 Task 14: Evaluation Setting for Word Sense Induction & Disambiguation Systems" In NAACL-HLT 2009 Workshop on Semantic Evaluations: Recent Achievements and Future Directions. , Boulder, Colorado, USA (2009).