Task Description

Abstract:
 

In the past decade, new forms of communication, such as microblogging and text messaging have emerged and become ubiquitous.  While there is no limit to the range of information conveyed by tweets and texts, often these short messages are used to share opinions and sentiments that people have about what is going on in the world around them.  We propose this task and the development of a twitter sentiment corpus to promote research that will lead to a better understanding of how sentiment is conveyed in tweets and texts. There will be two sub-tasks: an expression-level task and a message-level task; participants may choose to participate in either or both tasks.
 
Task A: Contextual Polarity Disambiguation
Given a message containing a marked instance of a word or phrase from a sentiment lexicon, determine whether that instance is positive, negative or neutral in that context.
 
Task B: Message Polarity Classification
Given a message and its topic, classify whether the message is of positive, negative, or neutral sentiment towards the topic. For messages conveying both a positive and negative sentiment toward the topic, whichever is the stronger sentiment should be the classification.


I. Introduction and Motivation

In the past decade, new forms of communication, such as microblogging and text messaging have emerged and become ubiquitous.  While there is no limit to the range of information conveyed by tweets and texts, often these short messages are used to share opinions and sentiments that people have about what is going on in the world around them.

Working with these informal text genres presents challenges for natural language processing beyond those typically encountered when working with more traditional text genres, such as newswire data.  Tweets and texts are short: a sentence or a headline rather than a document.  The language used is very informal, with creative spelling and punctuation, misspellings, slang, new words, URLs, and genre-specific terminology and abbreviations, such as, RT for “re-tweet” and #hashtags, which are a type of tagging for Twitter messages.  How to handle such challenges so as to automatically mine and understand the opinions and sentiments that people are communicating has only very recently been the subject of research (Jansen et al., 2009; Barbosa and Feng, 2010; Bifet and Frank, 2010; Davidov et al., 2010; O’Connor et al., 2010; Pak and Paroubek, 2010; Tumasjen et al., 2010; Kouloumpis et al., 2011).

Another aspect of social media data such as Twitter messages is that it includes rich structured information about the individuals involved in the communication. For example, Twitter maintains information of who follows whom and re-tweets and tags inside of tweets provide discourse information. Modeling such structured information is important because: (i) it can lead to more accurate tools for extracting semantic information, and (ii) because it provides means for empirically studying properties of social interactions (e.g., we can study properties of persuasive language or what properties are associated with influential users).

To promote research that will lead to a better understanding of how sentiment is conveyed in tweets and texts, a freely available, annotated corpus that can be used as a common testbed is needed.  Our primary goal will be to create such a resource: a corpus of tweets and texts with sentiment expressions marked with their contextual polarity and message-level polarity annotated with respect to topic.  The few corpora with detailed opinion and sentiment annotation that have been made freely available, e.g., the MPQA Corpus (Wiebe et al., 2005) of newswire data, have proved to be valuable resources for learning about the language of sentiment.  While a few twitter sentiment datasets have been created, they are either small and proprietary, such as the i-sieve corpus (Kouloumpis et al., 2011), or they rely on noisy labels obtained from emoticons or hashtags.  Furthermore, no twitter or text corpus with expression-level sentiment annotations has yet to be made available.

II. Task Description

There will be two sub-tasks: an expression-level task and a message-level task.  Participants may choose to participate in either or both tasks.

Task A: Contextual Polarity Disambiguation
Given a message containing a marked instance of a word or phrase from a sentiment lexicon, determine whether that instance is positive, negative or neutral in that context.  

Task B: Message Polarity Classification
Given a message and its topic, classify whether the message is of positive, negative, or neutral sentiment towards the topic. For messages conveying both a positive and negative sentiment toward the topic, whichever is the stronger sentiment should be the classification.


III. Data

Collectively, we have access to several large datasets of tweets and a corpus of SMS messages (http://wing.comp.nus.edu.sg:8080/SMSCorpus/).  From this data, we propose to create a corpus of 12-20K messages on a range of topics.  Topics will include a mixture of entities (e.g., Gadafi, Steve Jobs), products (e.g., kindle, android phone), and events (e.g., Japan earthquake, NHL playoffs).  Keywords and twitter hashtags will be used to identify messages relevant to the selected topic.

Using Amazon Mechanical Turk, we will annotate the corpus for sentiment expressions and message-level polarity (positive, negative, neutral).  Assuming a cost of $0.05 per HIT (Human Intelligence Task), 3-5 annotations per HIT, and 5 annotators per HIT, we estimate these annotations will cost from $1000-$1500.  This nominal annotation cost will be covered using existing research funds of the organizers.

The message corpus will then be divided as follows:
trial data: 1000 twitter messages
training data: 8000-12,000 twitter messages
test data #1: 2000-4000 twsitter messages
test data #2: 2000-4000 SMS messages

Participants will be notified that there will be two test datasets, one composed of twitter messages and another composed of message data for which they would not be receiving explicit training data.  The purpose of having a separate test set of SMS messages is to see how well systems trained on twitter data will generalize to other types of message data.

IV. Evaluation Method

Each participating team will initially have access to the training data only.  Later, the unlabelled test data will be released.  After SemEval-3, the labels for the test data will be released. 

The metric for evaluating the participants’ systems will be average F-measure, as well as F-measure for each class (Positive, Negative, Neutral), which can be illuminating when comparing performance between systems.  We will ask the participants to submit their predictions, and the organizers will calculate the results for each participant. For each sub-task, systems will be ranked based on their average F-measure.  Separate rankings for each test dataset will be produced.

V. Schedule (tentative)

June 15, 2011 - Sample topics and messages collected; Mechanical Turk HIT design finalized
July 15, 2011 - Topics selected and messages for corpus collected
August 30, 2011 - Trial data (~1000 messages) and scorer released
April 10, 2012 - Training data released
January 9, 2013 - Test data released
January 16, 2013 - Results submission deadline
February 1, 2013 - Organizers send test results
March 1, 2013 - Paper submission deadline


References

Barbosa, L. and Feng, J. 2010. Robust sentiment detection on twitter from biased and noisy data.  Proceedings of Coling.

Bifet, A. and Frank, E. 2010. Sentiment knowledge discovery in twitter streaming data.  Proceedings of 14th International Conference on Discovery Science.

Davidov, D., Tsur, O., and Rappoport, A. 2010.  Enhanced sentiment learning using twitter hashtags and smileys.  Proceedings of Coling.

Jansen, B.J., Zhang, M., Sobel, K., and Chowdury, A. 2009.  Twitter power: Tweets as electronic word of mouth.  Journal of the American Society for Information Science and Technology 60(11):2169-2188.

Kouloumpis, E., Wilson, T., and Moore, J. 2011. Twitter Sentiment Analysis: The Good the Bad and the OMG! Proceedings of ICWSM.

O’Connor, B., Balasubramanyan, R., Routledge, B., and Smith, N. 2010.  From tweets to polls: Linking text sentiment to public opinion time series.  Proceedings of ICWSM.

Pak, A. and Paroubek, P. 2010.  Twitter as a corpus for sentiment analysis and opinion mining.  Proceedings of LREC.

Tumasjan, A., Sprenger, T.O., Sandner, P., and Welpe, I. 2010.  Predicting elections with twitter: What 140 characters reveal about political sentiment.  Proceedings of ICWSM.

Janyce Wiebe, Theresa Wilson and Claire Cardie (2005). Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, volume 39, issue 2-3, pp. 165-210.

VIII. Contact Person

Theresa Wilson, Research Scientist
Human Language Technology Center of Excellence
Johns Hopkins University
810 Stieff Building
Baltimore, MD 21211

email: taw@jhu.edu
phone: 410-516-8244