SemEval-2013 Task 4: Free Paraphrases of Noun Compounds
Join the Google group:
The task of classifying English noun compounds explores the idea of interpreting the semantics of noun compounds via paraphrases. Given a two-word noun compound, the participants are asked to produce an explicitly ranked list of its free paraphrases. The list will be automatically compared and evaluated against a similarly ranked list of paraphrases proposed by human annotators, recruited and managed through Amazon's Mechanical Turk. The comparison of raw paraphrases will be sensitive to syntactic and morphological variations. The ranking of paraphrases will be based on their relative popularity among annotators. To achieve a reliable ranking, highly similar paraphrases will be grouped so as to downplay superficial differences in syntax and morphology.
I. Introduction and Motivation
A noun compound (NC) is a sequence of nouns that act as a single noun [Downing:1977], as in these examples: colon cancer, suppressor protein, colon cancer tumor suppressor protein. NCs are frequent in English, where compounding is a very productive process. NCs comprise 3.9% and 2.6% of all tokens in the Reuters corpus and the British National Corpus (BNC), respectively [Baldwin:Tanaka:2004]. Because the frequency spectrum of compound types follows a Zipfian distribution [Ó Séaghdha:2008], many NC tokens belong to a "long tail" of low-frequency types. Over half of the two-noun types in the BNC occur just once [Kim:Baldwin:2006]. Their high frequency and high productivity make robust NC interpretation an important goal for broad-coverage semantic processing. Systems which ignore NCs discard salient information about the semantic relationships implicit in a text. At the same time, compositional interpretation is the only way to achieve broad NC coverage, since it is not feasible to list in a lexicon all compounds which one is likely to encounter. Even for relatively frequent NCs occurring 10 times or more in the BNC, static English dictionaries provide only 27% coverage [Tanaka:Baldwin:2003].
Understanding the syntax and semantics of NCs is important for many natural language processing applications. NCs may appear superficially similar, like caffeine headache and ice-cream headache, but have very different meaning: a lack of caffeine causes the former, while an excess of ice-cream causes the latter. Different interpretations can lead to different inferences, query expansions, paraphrases, translations, and so on. For example, a question-answering system may need to determine whether protein acting as a tumor suppressor is a good paraphrase for tumor suppressor protein, and an information extraction system might need to decide whether neck vein thrombosis and neck thrombosis could co-refer when used in the same document. A machine translation system might paraphrase the unknown NC WTO Geneva headquarters as Geneva headquarters of the WTO or as WTO headquarters located in Geneva. Given a query such as migraine treatment, an information retrieval system could use suitable paraphrasing verbs like relieve and prevent for page ranking and query refinement.
Most work on NC interpretation has focused on the most common two-word NCs, and there have been, broadly speaking, two directions. One line of attack derives the semantics of an NC from the semantics of its component nouns [Finin:1980, Rosario:Hearst:2001, Moldovan et al. 2004, Ó Séaghdha:Copestake:2009,Tratz:Hovy:2010]; the other models the relationship between the nouns directly [Vanderwende:1994, Girju:2007, Nakov:2008a, Nakov:2008b, Nakov:2008c, Butnariu et al. 2010]. The semantics of NCs is typically expressed using one or more abstract relations, such as CAUSE (malaria mosquito), SOURCE (olive oil) and PURPOSE (migraine drug). While relations may come from a small fixed list, some researchers have argued for a more fine-grained, even open-ended, inventory [Downing:1977]. In this endeavour, strong verbs can be particularly useful, because they capture elements of meaning which abstract relations cannot. Many NCs can be roughly paraphrased by generic patterns, but it may be advantageous to consider more specific patterns, for example be squeezed from (rather than the generic be made of) for orange juice, and be topped with (rather than the generic be composed of) for bacon pizza. The idea of using fine-grained paraphrasing verbs for NC semantics has grown in popularity lately [Butnariu:Veale:2008, Nakov:2008a, Nakov:2008b, Nakov:2008c], leading to a paraphrase-oriented shared task for NC interpretation (Task 9) at SemEval-2010 [Butnariu et al. 2010].
The present task builds on SemEval-2010 Task 9. We aim for a shared evaluation of broader significance for the community of those who study lexical semantics (and beyond). In Task 9, annotators were required to obey a restrictive template for their paraphrases, and participating systems were simply asked to rank a set of possible paraphrases for each compound. The current task will give both annotators and systems more freedom in the production of paraphrases. The design of the task is informed by previous work on compound annotation and interpretation. It is also influenced by related efforts, such as the English Lexical Substitution task at SemEval-2007 [McCarthy:Navigli:2007] and various evaluation exercises in the fields of paraphrasing and machine translation.
We believe that the overall advancement of the field will be significantly helped by a public-domain dataset of free-style paraphrases of NCs. That is why we pose as our primary objective the challenging task of preparing and releasing such a dataset to the research community. The common evaluation task which we establish will also enable researchers to compare their algorithms and their empirical results.
II. Task Description
This is an English noun compound interpretation task, which explores the idea of interpreting the semantics of noun compounds via free paraphrases. Given a two-word noun compound such as onion tears, the participants are asked to produce an explicitly ranked list of free paraphrases, as in the following example:
1 tears from onions
2 tears due to cutting onion
3 tears induced when cutting onions
4 tears that onions induce
5 tears which come from chopping onions
6 tears that sometimes flow when onions are chopped
7 tears which raw onions give you
Such a list is then automatically compared and evaluated against a similarly ranked list of paraphrases proposed by human annotators (recruited and managed using Amazon’s Mechanical Turk). The comparison of raw paraphrases is sensitive to syntactic and morphological variations. The ranking of paraphrases is based on their relative popularity among different annotators. To achieve a reliable ranking, highly similar paraphrases are grouped so as to downplay superficial differences in syntax and morphology.
Training Data As training data, we have released paraphrases generated for 200 noun compounds, each paraphrased by about 30 people.
Test Data The test data will consist of another 200 noun compounds, each paraphrased by about 30 people.
License: All data is released under the Creative Commons Attribution 3.0 Unported license.
IV. Evaluation Method
Noun compounding is a generative aspect of language, but so too is the process of noun-compound interpretation: human speakers typically generate a range of possible interpretations for a given compound, each emphasising a different aspect of the relationship between the nouns. Our evaluation framework reflects the belief that there is rarely a single right answer for a given noun-noun pairing. Participating systems will thus be expected to demonstrate some generativity of their own, and will be scored not just on the accuracy of individual interpretations, but on the overall breadth of their output.
Evaluation is performed using a Scorer implemented as a Java class
(see the README file). For each noun compound to be evaluated, the
Scorer compares a list of your system's suggested paraphrases against
a gold-standard reference list, compiled and rank-ordered from
paraphrases suggested by human judges. The score assigned to your
system is the mean of your system's performance across all test
compounds. The score assigned to your system for a specific compound
is calculated in two different ways: using isomorphic matching of your
paraphrases to gold-standard reference paraphrases (on a one-to-one
basis); and using non-isomorphic matching of your system's paraphrases
to reference paraphrases (in a potentially a many-to-one mapping).
Isomorphic mapping rewards both precision and recall. It rewards your
system for accurately reproducing the paraphrases suggested by human
judges, and it rewards your system for reproducing as many of these as
it can, and in much the same order (the reference paraphrases in the
gold standard are ordered by rank, where the highest rank is assigned
to the most frequently-suggested paraphrases by human judges). Please
see the README for the Scorer to better appreciate the nuances of
isomorphic mapping (e.g., because your system's paraphrases are
matched 1-to-1 to reference paraphrases on a first-come basis, the
ordering of your system's paraphrases is very important in an
Non-isomorphic mapping rewards just precision. It rewards your system
for accurately reproducing the top-ranked human paraphrases in the
gold standard. Your system will achieve a higher score in a
non-isomorphic match if it reproduces the top-ranked human paraphrases
as opposed to lower-ranked human paraphrases. The ordering of your
system's paraphrases is not important for non-isomorphic matching. See
the README for the Scorer to better appreciate the nuances of
Your system will be evaluated using the Scorer in both modes,
isomorphic and non-isomorphic. We expect most systems will focus on
achieving a high non-isomorphic score (i.e. we expect most systems to
emphasize precision over precision-and-recall), though some may also
aim to achieve high isomorphic scores too.
August 1, 2012 Training data released
September 12, 2012 First Call for participation
February 15, 2013 Test set ready
February 15, 2013 Registration Deadline [for Task Participants]
March 7, 2013 onwards Start of evaluation period [Task Dependent]
March 15, 2013 End of evaluation period
April 9, 2013 Paper submission deadline [TBC]
May 4, 2013 Camera ready Due [TBC]
June13-14, 2013 Workshop co-located with ACL or NAACL [TBC]
Iris Hendrickx (email@example.com) Universidade de Lisboa
Zornitsa Kozareva (firstname.lastname@example.org) University of Southern California, ISI
Preslav Nakov (email@example.com) QCRI, Qatar Foundation
Diarmuid Ó Séaghdha (firstname.lastname@example.org) University of Cambridge
Stan Szpakowicz (email@example.com) University of Ottawa
Tony Veale (firstname.lastname@example.org) University College Dublin
[Baldwin:Tanaka:2004] Baldwin, Timothy and Takaaki Tanaka (2004). Translation by machine of compound nominals: Getting it right. In Proceedings of ACL-2004 Workshop on Multiword Expressions: Integrating Processing. 24-31.
[Butnariu:Veale:2008] Butnariu, Cristina and Tony Veale (2008). A Concept-Centered Approach to Noun-Compound Interpretation. In Proceedings of COLING-2008, 81-88.
[Butnariu et al. 2010] Butnariu, Cristina, Su Nam Kim, Preslav Nakov, Diarmuid Ó Séaghdha, Stan Szpakowicz and Tony Veale (2010). SemEval-2 Task 9: The Interpretation of Noun Compounds Using Paraphrasing Verbs and Prepositions, In Proceedings of the 5th International Workshop on Semantic Evaluation, 39-44.
[Downing:1977] Downing, Pamela (1977). On the creation and use of English compound nouns. Language vol. 53, 810-842.
[Finin:1980] Finin, Timothy (1980). The Semantic Interpretation of Nominal Compounds. AAAI, 310-312.
[Girju:2007] Girju, Roxana (2007). Improving the interpretation of noun phrases with cross-linguistic information. In Proceedings of the 45th Annual Meeting of the ACL, 568-575.
[Kim:Baldwin:2006] Kim, Su Nam and Timothy Baldwin (2006). Interpreting Semantic Relation in Noun Compound via Verb Semantics. In Proceedings of ACL/COLING-2006, 491-498.
[McCarthy:Navigli:2007] McCarthy, Diana and Roberto Navigli (2007). SemEval-2007 Task 10: English Lexical Substitution Task, In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), 48-53.
[Moldovan et al. 2004] Moldovan, Dan, Adriana Badulescu, Marta Tatu, Daniel Antohe, and Roxana Girju (2004). Models for the semantic classification of noun phrases. In Proceedings of the HLT-NAACL 2004: Workshop on Computational Lexical Semantics, Boston. 60-67.
[Nakov:2008a] Nakov, Preslav (2008). Improved Statistical Machine Translation Using Monolingual Paraphrases. In Frontiers in Artificial Intelligence and Applications, vol. 178 (ECAI-2008), 338-342.
[Nakov:2008b] Nakov, Preslav (2008). Noun Compound Interpretation Using Paraphrasing Verbs: Feasibility Study. In AIMSA-2008, LNAI 5253. 103-117.
[Nakov:2008c] Nakov, Preslav and Marti Hearst (2008). Solving Relational Similarity Problems Using the Web as a Corpus. In Proceedings of ACL-2008. 452-460.
[Ó Séaghdha:2008] Ó Séaghdha, Diarmuid (2008). Learning Compound Noun Semantics. PhD Thesis, University of Cambridge.
[Ó Séaghdha:Copestake:2009] Ó Séaghdha, Diarmuid and Ann Copestake (2009). Using lexical and relational similarity to classify semantic relations. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009). Athens, Greece.
[Rosario:Hearst:2001] Rosario, Barbara and Marti Hearst (2001). Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy. In Proceedings of the 6th Conference on Empirical Methods in Natural Language Processing, 82-90.
[Tanaka:Baldwin:2003] Tanaka, Takaaki and Tim Baldwin (2003). Noun-Noun Compound Machine Translation: A Feasibility Study on Shallow Processing, In Proceedings of the ACL-2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan, 17-24.
[Tratz:Hovy:2010] Tratz, Stephen and Eduard Hovy (2010). A taxonomy, dataset, and classifier for automatic noun compound interpretation. In Proceedings of the 48th Annual Meeting of the ACL, Uppsala, 678-687.
[Vanderwende:1994] Vanderwende, Lucy (1994). Algorithm for automatic interpretation of noun sequences. In Proceedings of the 15th Conference on Computational Linguistics, Kyoto, 782–788.
[Wubben et al. 2010] Wubben, S., Bosch, A. van den, and Krahmer, E.J. (2010). Paraphrase generation as monolingual translation: Data and evaluation. In Proceedings of the 10th International Natural Language Generation Conference, ACL. 203-208.
[JNLEspecial:2013] Journal of Natural Language Engineering, Special Issue on the Semantics of Noun Compounds (planned for early 2013), http://sites.google.com/site/jnlencsemantics2013/
VIII. Contact Person
Iris Hendrickx (email@example.com)