• System runs available in the Data page
  • Task was selected to be the Shared task of *SEM 2013
  • Check related DARPA workshop 
  • Proceedings available online 
  • Semeval program posted (2012 May 16)
  • Updated results now available (2012 April 10)
  • (Superseded) Results now available (2012 April 4)
  • Gold standard now available (2012 April 4)
  • Instructions for submission of system description papers with new deadline - 16th April (2012 Mar. 27)
  • Instructions for registration made available (2012 Mar. 10)
  • Instructions for participation made available (2012 Feb. 7)
  • Train data updated, with examples from SMT evaluation  (2012 Feb. 7)
  • New dates! Following the changes in SemEval 2012 (2012 Feb. 7)
  • Train data updated, with new code for official scorer - bug fixed (2012 Jan. 20)
  • Train data now available!! (2012 Jan. 13)
    (We plan to release Machine Translation data in a few days, please stay tuned)
  • Release of training data delayed to end of Dec. (2011 Dec. 8)
  • Trial data now available (2011 Oct. 20)



We solicit participation in the first Semantic Textual Similarity (STS) shared task. Participants will submit systems that examine the degree of semantic equivalence between two sentences. The goal of the STS task is to create a unified framework for the evaluation of semantic textual similarity modules and to characterize their impact on NLP applications. We particularly encourage submissions from the lexical semantics, summarization, machine translation evaluation metric, and textual entailment communities.  



Semantic Textual Similarity (STS) measures the degree of semantic equivalence. We are proposing this STS task as an initial attempt at creating a unified framework that allows for an extrinsic evaluation of multiple semantic components that otherwise have historically tended to be evaluated independently and without characterization of impact on NLP applications. As a pilot task for Semeval 2012, we will focus on refining the task definition as well as producing experimental results on how well existing approaches to semantic equivalence perform. In parallel, we will gather feedback from the community about establishing a shared software framework for building STS annotation systems. The shared STS framework will allow researchers across the globe to more easily replicate and improve upon innovations developed at other sites.

STS is related to both Textual Entailment (TE) and  Paraphrase, but differs in a number of ways and it is more directly applicable to a number of NLP tasks.  STS is  different from TE inasmuch as it assumes bidirectional graded equivalence between the pair of textual snippets. In the case of TE the equivalence is directional, e.g. a car is a vehicle, but a vehicle is not necessarily a car. STS also differs from both TE and Paraphrase in that, rather than being a binary yes/no decision (e.g. a vehicle is not a car), STS is a graded similarity notion (e.g. a vehicle and a car are more similar than a wave and a car). This graded bidirectional nature of STS is useful for NLP tasks such as MT evaluation, information extraction, question answering, and summarization.

Current textual similarity systems are limited in the scope of similarity they can address, mostly lexical and syntactic similarity. Some other linguistic phenomena have rarely been addressed in isolated efforts, e.g. metaphorical or idiomatic language [John spilled his guts to Mary, vs. John told Mary all about his stories/life], scoping and under-specification [Every representative of the company saw every sample], sentences where the structure is very divergent [The annihilation of Rome in 2000 BC was incurred by an insurgency of the slaves. Vs. The slaves' revolution 2 millennia before Christ destroyed the capital of the Roman Empire.], and various modality phenomena such as committed belief, permission or negation. The STS task would like to foster the conjunction of these, to date,  fragmented efforts.

We envision that several semantic tasks already in SEMEVAL could be used as modules in the STS framework, such as word sense disambiguation and induction, lexical substitution, semantic role labeling, multiword expression detection and handling, anaphora and coreference resolution, time and date resolution and named-entity handling, inter alia. In addition, we would like to add  additional components such as modules that address underspecification, hedging, semantic scoping and discourse analysis. In the Semeval 2012 task we will focus on defining the task, and in 2013 we plan to defice the Unified Framework for the Evaluation of Modular Semantic Components. We plan to provide both vanilla components as well as a vanilla combination pipeline, which will allow for a low learning curve, making participation easy for the community at large. We will also encourage participants to share their resources and components within an online STS resource sharing website.

The STS systems will be tuned and evaluated by reusing pairs of sentences drawn from paraphrasing datasets and machine translation evaluation datasets.

By design, we are currently including only English sentence pairs and not addressing this problem within a multilingual framework, though STS is naturally extensible to a multilingual setting.


Description of 2012 pilot task

Given two sentences, s1 and s2, participants will quantifiably inform us on how similar s1 and s2 are, resulting in a similarity score. Participants will also  provide a confidence score indicating their confidence level for the result returned for each pair. Participants will be asked to explicitly characterize why a pair is considered similar, i.e. which semantic component(s) contributed to the similarity score.

The output of participant systems will be compared to the manual scores, which range from 5 (semantic equivalence) to 0 (no relation). 

Please check the detailed information (also available in the train data) for details.


Instructions for participation

At the start of the evaluation period (18 March) the test pairs will be made available. Participants will need to register in the submission system (details will be announced when ready). Once the test pairs are downloaded, participants will have 5 days (120 hours) to upload the results of their systems. In other words, participants choose when to download the test pairs, but they have a 5 day window to upload the results. The system will cease to accept results on the 1st of April - 23:59 UTC-11). 

The test dataset will contain pairs of sentences from the same collections as the training data (ca. 750 pairs from each),  plus one or two surprise sentence pair datasets drawn from other sources (ca. 750 pairs from each). 

The system will accept three different runs per participant.


Instructions for registration

The registration for SemEval-2012 is now open. Please visit the SemEval-2012 website.

The registration mechanism is there to allow all task participants to receive any test data + scripts and to allow participants to upload their results. The registration is open till 13 March. You only need to register once for multiple tasks. By 14 March the organizers will email all participants with task specific passwords for downloading test data. The organizers will also send you individual passwords to upload your results into a ftp server.


Important dates

Trial dataset ready: 20 October
Call for participation : 25 October
Full training dataset + test scripts ready - 31 December
Start of Evaluation period - 18 March
End of Evaluation period - 1 April (23:59 UTC-11)
Paper due - 16 April
Reviews due - 23 April
Camera Ready - 4 May



Use the following mailing list: sts-semeval googlegroups com

If interested in the task please join the mailing list for updates



Eneko Agirre, University of the Basque Country, Basque Country
Dan Cer, Stanford University, USA
Mona Diab, Columbia University, USA
Bill Dolan, Microsoft Research, USA