Task Description


The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge at SemEval 2013 is the result of a conjunct effort of both educational technology and textual inference communities in order to present a unified scientific challenge addressing researchers in both fields.



General computational linguistics and educational technology perspective

The goal of the task is to produce an assessment of student answers to explanation and definition questions typically asked in problems seen in practice exercises, tests or tutorial dialogue.

Specifically, given a question, a known correct "reference answer" and a 1- or 2-sentence "student answer", the goal is to evaluate the answer’s accuracy. This task offers the opportunity to evaluate the usefulness of approaches for semantic analysis (especially textual entailment) for e-learning applications.

The main challenges are likely to be the small amount of domain-specific training data available, which would require finding ways to use larger out-of-domain resources in combination with domain-specific training data, and also dealing with ill-formed student utterances.

Previous approaches to student response analysis include methods based on text classification, latent semantic analysis and other semantic similarity measurements, textual entailment, and, in small domains, parsing and rule-based methods. We invite participants to use any methods they consider suitable. 

Textual Entailment perspective

From a textual inference perspective, the Student Resoponse Analysis task closely relates to the notion of textual entailment and therefore is likely to benefit from entailment recognition capabilities.

According to the standard definition of Textual Entailment, given two text fragments called 'Text' (T) and 'Hypothesis' (H), it is said that T entails H (t ⇒ h) if, typically, a human reading T would infer that H is most likely true (Dagan et al., 2006).

In a typical answer assessment scenario, we expect that a correct student answer would entail the reference answer, while an incorrect answer would not. However, students often skip details that are mentioned in the question or may be inferred from it, while reference answers often repeat or make explicit information that appears in or is implied from the question. Hence, a more precise textual entailment scenario would be to consider the entailing text T as consisting of both the student answer and the original question, while H consists of the reference answer.

Still, even though there is a clear relation between textual entailment and answer correctness, the correlation between answer assessment judgments and entailment judgments in this setting is not perfect. Sometimes a teacher would regard a student answer as correct even though it does not strictly entail the reference answer (even along with the information in the question). This may happen, for example, because the assessing teacher realizes that the student does understand the correct answer, but relies for this judgment on external background material of the domain that is not available in the given textual parts. Hence, we cannot regard teachers' assessments of student answers as strict textual entailment judgments, but rather as “student understanding” entailment judgments. However, the correlation between assessment judgments of the two types is high, and we further expect that in most correct answers at least a substantial portion of the hypothesis will be entailed by the text, even if not all of it. Thus, we challenge the textual inference community to address the answer assessment task at varying levels of granularity, using textual entailment techniques, and explore how well these techniques can help in this real-world educational setting.


Given a question, a known correct "reference answer" and a 1- or 2-sentence "student answer", the main task consists of assessing the correctness of a student’s answer at different levels of granularity, namely:

- 5-way task, where the system is required to classify the student answer according to one of the following judgments:

  • correct
  • partially correct but incomplete
  • contradictory (it contradicts content in the reference answer )
  • irrelevant (it does not contain information directly relevant to the answer)
  • not in the domain (e.g., expressing a request for help).

- 3-way task, where the system is required to classify the student answer according to one of the following judgments:

  • correct
  • contradictory (if it contains information contradicting the content of the reference answer)
  • incorrect (conflating the categories partially correct but incomplete, irrelevant  and not in the domain in the 5-way classification)

- 2-way task, where the system is required to classify the student answer according to one of the following judgments:

  • correct
  • incorrect (conflating the categories contradictory and incorrect in the 3-way classification)

Participants can opt to carry out the task at any level of granularity, using whatever approach they think more suitable for the task. Textual entailment engine developers are encouraged to exploit textual entailment techniques to test the potential contributions of their systems to the student response assessment problem.


For more details please visit the Main task page.

Training data are currently available at the  Data Page.


From both the entailment technology perspective and the educational setting perspective, this applied scenario offers an opportunity to explore notions of partial entailment. Therefore, an additional pilot task on partial entailment is offered as part of the challenge, where systems may recognize that specific parts of the hypothesis are entailed by the text, even though entailment is not recognized for the hypothesis as a whole.

Such recognition of partial entailment may have various utilities in the educational setting based on identifying the missing parts in the student answer, and may similarly have value in other applications such as summarization or question answering.


For more details please visit the Pilot task page.

Training data are currently available at the  Data Page.




  • November 1, 2012: Full Training Data available for participants
  • February 15, 2013:  REGISTRATION DEADLINE


          - MAIN TASK test set release: February 25, 2013

          - MAIN TASK submissions: March 6, 2013

          - PILOT TASK test set release: March 7, 2013

          - PILOT TASK  submissions: March 15, 2013

          - Results to participants: March 20, 2013

  • April 9, 2013  Paper submission deadline 
  • April 23, 2013  Reviews Due 
  • April 29, 2013   Camera ready Due 
  • June 14-15, 2013 [TBC]: SemEval Workshop associated with *Sem Conference (June 13-14, 2013 TBC) co-located with  NAACL HTL 2013, Atlanta, Georgia, USA




You can register to the Joint Challenge filling in the online registration form at the SemEval 2013 website.


If you are interested in the task, please also join the  Joint Challenge discussion group.




Ido Dagan, Oren Glickman, and Bernardo Magnini. (2006).  The PASCAL Recognising Textual Entailment Challenge. In J. Quiñonero-Candela, I. Dagan, B. Magnini, F. d'Alché-Buc (Eds.),  Machine Learning Challenges. Lecture Notes in Computer Science, Vol. 3944, Springer.


Myroslava O. Dzikovska, Gwendolyn Campbell, Charles Callaway, Natalie Steinhauser, Elaine Farrow, Johanna Moore, and Leslie Butler and Colin Matheson. (2008). "Diagnosing natural language answers to support adaptive tutoring" In Proceedings of the 21st FLAIRS Conference, Miami, Florida, May 2008.


Myroslava O. Dzikovska, Diana Bental, Johanna D. Moore, Natalie Steinhauser, Gwendolyn Campbell, Elaine Farrow, and Charles B. Callaway (2010). "Intelligent tutoring with natural language support in the BEETLE II system". In Proceedings of Fifth European Conference on Technology Enhanced Learning (EC-TEL 2010), October, Barcelona.


Art Graesser, Phanni Penumatsa, Matthew Ventura, Zhiqiang Cai, and Xiangen Hu. (2007). Using LSA in AutoTutor: Learning through mixed initiative dialogue in natural language. In T. Landauer, D. McNamara, S. Dennis, and W. Kintsch (Eds.), Handbook of Latent Semantic Analysis (pp. 243-262). Mahwah, NJ: Erlbaum. http://psyc.memphis.edu/graesser/publications/Graesserlsa.doc.


Pamela Jordan, Maxim Makatchev and Kurt VanLehn.(2004). Combining Competing Language Understanding Approaches in an Intelligent Tutoring System. In Proceedings of Intelligent Tutoring Systems Conference, Maceo, Brazil, 2004, Springer LNCS, vol 3220, pp 346-357.


Lawrence Hall of Science. (2005). Full Option Science System (FOSS), University of California at Berkeley, Delta Education, Nashua, NH.


Philip M. McCarthy, Vasile Rus, Scott A. Crossley, Arthur C. Graesser, Danielle S. McNamara: Assessing Forward-, Reverse-, and Average-Entailer Indices on Natural Language Input from the Intelligent Tutoring System, iSTART. FLAIRS Conference 2008: 165-170.


Rodney D. Nielsen, Wayne Ward, James H. Martin, and Martha Palmer. (2008). Annotating students' understanding of science concepts. In Proceedings of the Sixth International Language Resources and Evaluation Conference, (LREC'08), Marrakech, Morocco, May 28-30, 2008. Published by the European Language Resources Association, (ELRA), Paris, France.


Rodney D. Nielsen, Wayne Ward and James H. Martin. (2008). Learning to assess low-level conceptual understanding. In David Wilson and H. Chad Lane (Eds.): Proceedings of the Twenty-First International Artificial Intelligence Researchers Society Conference, (FLAIRS-08), pp 427-432, Coconut Grove, Florida, May 15-17, 2008. Published by the Association for the Advancement of Artificial Intelligence, (AAAI Press), Menlo Park, California.

Contact Info


Other Info