Update!

SemEval Task #7 has concluded: see the full RESULTS

Task Description

Abstract

We present a new evaluation for open-domain commonsense reasoning (COPA), focusing specifically on commonsense causal reasoning about everyday events: What was the cause of the event? What were its effects? The tasks is to select the alternative that is more plausibly the cause (or effect) of the situation described by the premise, as in the following example:

Premise: The man broke his toe. What was the CAUSE of this?
Alternative 1: He got a hole in his sock.
Alternative 2: He dropped a hammer on his foot.

We provide development and test sets of 500 questions each. Systems will be evaluated in terms of statistical significance test which will determine whether two systems are significantly different in their performance on the COPA datasets. The significance test is based on approximate randomization that involves a stratified shuffling approach in order to build a distribution of differences in performance between the two systems.

I. Introduction and Motivation

Open-domain commonsense reasoning is one of the grand challenges of artificial intelligence, and has been the subject of research since the inception of the field. Until recently, this research history has been dominated by formal approaches, where logical formalizations of commonsense theories were hand-authored by expert logicians and evaluated using a handful of commonsense challenge problems (Morgenstern, 2011). Progress via this approach has been slow, both because of the inherent difficulties in authoring suitably broad-coverage formal theories of the commonsense world and the lack of evaluation metrics for comparing systems from different labs and research traditions.

Radically different approaches to the commonsense reasoning problem have recently been explored by Natural Language Processing researchers. Speer et al. (2008) describe a novel reasoning approach that applies dimensionality reduction to the space of millions of English-language commonsense facts in a crowd-sourced knowledge base (Liu et al., 2004). Gordon et al., (2010) describe a method for extracting millions of commonsense facts from parse trees of English sentences. Jung et al. (2010) describe a novel approach to the extraction of commonsense knowledge about activities by mining online how-to articles. We believe that these new NLP-based approach hold enormous potential for overcoming the knowledge acquisition bottleneck that has limited progress in commonsense reasoning in previous decades.

Given the growth and enthusiasm for these new approaches, there is increasing need for a common metric for evaluation. A common evaluation suite would allow researchers to gauge the performance of new versions of their own systems, and to compare their approaches with those of other research groups. Evaluations for these new NLP-based approaches should themselves be based in natural language, and must be suitably large to truly evaluate the breadth of different reasoning approaches. Still, each evaluation should be focused on one dimension of the overall commonsense reasoning task, so as not to create a new challenge that no single research group could hope to succeed.

In this task, we present a new evaluation for open-domain commonsense reasoning, focusing specifically on commonsense causal reasoning about everyday events: What was the cause of the event? What were its effects?

II. Task Description

The Choice Of Plausible Alternatives (COPA) is a tool for evaluating open-domain commonsense causal reasoning. COPA consists of a large set of 2-choice questions, formulated as a premise and two alternatives written as simple English sentences. The tasks is to select the alternative that is more plausibly the cause (or effect) of the situation described by the premise, as in the following examples:

Premise: The man broke his toe. What was the CAUSE of this?
Alternative 1: He got a hole in his sock.
Alternative 2: He dropped a hammer on his foot.

Premise: I tipped the bottle. What happened as a RESULT?
Alternative 1: The liquid in the bottle froze.
Alternative 2: The liquid in the bottle poured out.

Premise: I knocked on my neighbor's door. What happened as a RESULT?
Alternative 1: My neighbor invited me in.
Alternative 2: My neighbor left his house.

III. Differences from Other Tasks

The COPA evaluation is most similar in style to the Recognizing Textual Entailment challenge, but differs in its focus on causal implication rather than entailment. In this respect, COPA overlaps in its aims with the task of recognizing causal relations in text through automated discourse processing. However, instead of trying to learn the lexical-syntactic patterns that are most correlated with causal relations (e.g. Penn Discourse Treebank), COPA encourages competitors to capture commonsense causal knowledge from any available corpus or existing knowledge repository. 

IV. Dataset Creation

We have authored one thousand COPA questions with the correct alternative validated in each question by two human raters. The methodology used to author these questions is described in details in the following publication:

Roemmele, M., Bejan, C., and Gordon, A. (2011) Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning, Stanford University, March 21-23, 2011.

The set has been divided into development and test sets of 500 questions each. The correct alternative has been randomized so that the expected performance of randomly guessing is 50%. 

The COPA evaluation aims to advance research on corpus-based approaches to the extraction and application of open-domain commonsense knowledge. Rather than providing a training corpus, competitors are encouraged to experiment with any available corpus, existing knowledge repository, knowledge extraction technique, and automated reasoning method.

We have shown in the paper that humans can easily achieve nearly 100% on the COPA tasks, whereas a reasonable baseline barely scores above random chance. Given the ease in which humans can quickly identify the correct alternative, there is nothing to be gained from delaying the release of the test set, as might be appropriate in other SemEval-3 tasks. 

V. Evaluation Methodology

To determine whether two systems are significantly different in their performance on the COPA datasets, we provide software that implements a statistical significance test. The significance test is based on approximate randomization that involves a stratified shuffling approach in order to build a distribution of differences in performance between the two systems.

The software is implemented in java and can be executed with a bash script (copa-eval.sh) that receives as arguments the file with the gold-standard answers and the files with the choices of two reasoning systems to compare.

We have also implemented an answer sets for a strong baseline system as described in Roemmele et al. (2011). This baseline gathers Pointwise Mutual Information statistics between words in COPA premises and alternatives from all English-language documents available in Project Gutenberg, and selects the alternative where the average PMI with words in the premise is the highest. This baseline approach significantly outperforms the random baseline, scoring 58.8% on the test set. We will provide this baseline to the public.

We expect that other systems will be created and they will outperform the baselines between now and the time that the SemEval-2012 workshop is held. Where possible, the answer sets for systems described in published reports will also be made available to SemEval-2012 competitors for comparison.

Organizers

Andrew S. Gordon, University of Southern California (gordon@ict.usc.edu)
Zornitsa Kozareva, University of Southern California (kozareva@isi.edu)
Melissa Roemmele, Indiana University (msroemme@gmail.com)

References

Jung, Y., Ryu, J., Kim., K. and Myaeng, S.(2010). Automatic Construction of a Large-Scale Situation Ontology by Mining How-to Instructions from the Web. Journal of Web Semantics 8(2-3):110-124.
 
Gordon, J., Van Durme, B., and K. Schubert, L. (2010) Learning from the Web: Extracting General World Knowledge from Noisy Text. Proceedings of the AAAI 2010 Workshop on Collaboratively-built Knowledge Sources and Artificial Intelligence (WikiAI 2010).
 
Liu, H. & Singh, P. (2004) ConceptNet: A Practical Commonsense Reasoning Toolkit. BT Technology Journal 22(4):211-226.
 
Morgenstern, L. (2011) Common Sense Problem Page. Retrieved April 2011 at http://www-formal.stanford.edu/leora/commonsense/
 
Roemmele, M., Bejan, C., and Gordon, A. (2011) Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning, Stanford University, March 21-23, 2011.
 
Speer, R., Havasi, C. and Lieberman, H. (2008) AnalogySpace: Reducing the Dimensionality of Common Sense Knowledge. Proceedings of AAAI 2008.