Data

 

Downloads

Trial data (released on 31 July 2012) are available here.

Test data, official keys and baselines, and all participating submissions' keys are available here.  (updated May 2014 with minor fix)

 

Datasets and formats

Trial and test set format

The trial and test datasets will adhere to the following format:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE corpus SYSTEM "coarse-all-words.dtd">
<corpus lang="en">
<text id="d001" source="wsj_2465"> 
<sentence id="d001.s001">
.
.
</sentence>
.
.
<sentence id="d001.s010">
  <wf lemma="they" pos="PP">They</wf>
  <wf lemma="bomb" pos="VVD">bombed</wf>
  <wf lemma="the" pos="DT">the</wf>
  <instance id="d001.s010.t001" lemma="Bogota" pos="NP">Bogota</instance>
  <instance id="d001.s010.t002" lemma="office" pos="NNS">offices</instance>
  <wf lemma="last" pos="JJ">last</wf>
  <instance id="d001.s010.t003" lemma="month" pos="NN">month</instance>
  <wf lemma="," pos=",">,</wf>
  <wf lemma="destroy" pos="VVG">destroying</wf>
  <wf lemma="its" pos="PP$">its</wf>
  <instance id="d001.s010.t004" lemma="computer" pos="NN">computer</instance>
  <wf lemma="and" pos="CC">and</wf>
  <wf lemma="cause" pos="VVG">causing</wf>
  <wf lemma="$" pos="$">$</wf>
  <wf lemma="@card@" pos="CD">2.5</wf>
  <instance id="d001.s010.t005" lemma="million" pos="CD">million</instance>
  <wf lemma="in" pos="IN">in</wf>
  <instance id="d001.s010.t006" lemma="damage" pos="NN">damage</instance>
  <wf lemma="." pos="SENT">.</wf>
</sentence>
.
.
</text>
<text id="d002" source="wsj_2466"> 
.
.
</text>
</corpus>


where each <text> tag specifies a text source whose identifier is provided by the id attribute. <sentence> tags represent single sentences within each text (again identified with an id attribute). Each <sentence> tag contains zero, one or more target words, each tagged with an <instance> element. Each instance specifies its unique identifier (id), lemma (lemma) and part of speech tag (pos). Instances are assumed to have an appropriate sense in the adopted sense inventory, i.e. any of BabelNet, WordNet or Wikipedia (see below). Words with no corresponding sense in the inventory are instead enclosed within wf tags, which also specify a lemma (lemma) and part of speech tag (pos).

Note on pre-processing and non-English files

Lemmas and part-of-speech information are automatically obtained by pre-processing the data with TreeTagger, a state-of-the-art PoS tagger available in many languages. In addition, please note that the data format described above is the same for all languages. For instance, the French dataset will contain the following data (corresponding to the English snippet above).

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE corpus SYSTEM "coarse-all-words.dtd">
<corpus lang="fr">
<text id="d001" source="wsj_2465">
<sentence id="d001.s001">
.
.
</sentence>
.
.
<sentence id="d001.s010">
  <wf lemma="il" pos="PRO:PER">Ils</wf>
  <wf lemma="avoir" pos="VER:pres">ont</wf>
  <wf lemma="bombarder" pos="VER:pper">bombardé</wf>
  <wf lemma="le" pos="DET:ART">les</wf>
  <instance id="d001.s010.t001" lemma="bureau" pos="NOM">bureaux</instance>
  <wf lemma="de" pos="PRP">de</wf>
  <instance id="d001.s010.t002" lemma="Bogota" pos="NAM">Bogota</instance>
  <wf lemma="le" pos="DET:ART">le</wf>
  <instance id="d001.s010.t003" lemma="mois" pos="NOM">mois</instance>
  <wf lemma="dernier" pos="ADJ">dernier</wf>
  <wf lemma="," pos="PUN">,</wf>
  <wf lemma="détruire" pos="VER:ppre">détruisant</wf>
  <wf lemma="le" pos="DET:ART">les</wf>
  <instance id="d001.s010.t004" lemma="ordinateur"
                                pos="NAM">ordinateurs</instance>
  <wf lemma="et" pos="KON">et</wf>
  <wf lemma="causer" pos="VER:ppre">causant</wf>
  <wf lemma="@card@" pos="NUM">2,5</wf>
  <instance id="d001.s010.t005" lemma="million" pos="NOM">millions</instance>
  <wf lemma="de" pos="PRP">de</wf>
  <instance id="d001.s010.t006" lemma="dollar" pos="NOM">dollars</instance>
  <wf lemma="en" pos="PRP">en</wf>
  <instance id="d001.s010.t007" lemma="dommage" pos="NOM">dommages</instance>
  <wf lemma="." pos="SENT">.</wf>
 </sentence>
.
.
</text>
<text id="d002" source="wsj_2466"> 
.
.
</text>
</corpus>

 

Multilingual sense inventory

We will annotate the data using BabelNet [1] as sense inventory. Since BabelNet is obtained by merging WordNet with Wikipedia, we provide a comprehensive sense inventory for all three resources in a separate file. Each line of the file contains (1) a word; (2) its language; (3) the number of its senses in BabelNet and their listing (as Babel synset offsets); (4) the number of its senses in WordNet and their listing (as WordNet sense key); (5) the number of its senses in Wikipedia and their listing (as Wikipedia page titles). A sample list of entries from the sense inventory  follows:

president#n EN 16 bn:01432541n .. 6 president%1:18:00:: .. 19 President_(U.S.) ..
journal#n FR 11 bn:00051830n .. 0 2 Journal_(système_de_fichiers) ..

where EN and FR denote the language codes for English and French, respectively.

 

BabelNet

The evaluation will use the 1.1.1 version of BabelNet, available at http://babelnet.org/.  All BabelNet-based solutions should use the synset keys provided in version 1.1.1.

 

Answer file format

The answer format follows that of the previous Senseval evaluation exercises. 

doc_id instance_id sense_label !! lemma=lemma_string#n

where doc_id and instance_id are the document and instance ids associated with each target word, sense_label is the stringified sense representation associated with it, and !! indicates a comment (e.g., used to optionally provide the target word's lemma). We allow as sense labels any of the WordNet sense keys, Wikipedia page titles or Babel synset offsets found within the sense inventory (see above). Systems will be evaluated separately depending on the type of sense label they output. A sample answer for our example French sentence using Babel synset offsets as sense labels follows:

d001 d001.s010.t001 bn:00014169n !! lemma=bureau#n
d001 d001.s010.t002 bn:00011812n !! lemma=Bogota#n
d001 d001.s010.t003 bn:00014710n !! lemma=mois#n
d001 d001.s010.t004 bn:00021464n !! lemma=ordinateur#n
d001 d001.s010.t005 bn:00000013n !! lemma=million#n
d001 d001.s010.t006 bn:00028114n !! lemma=dollar#n
d001 d001.s010.t007 bn:00003333n !! lemma=dommage#n

 

Evaluation

Evaluation will be performed in terms of standard precision, recall and F1 scores using the official Senseval scorer.

 

References

  1. Roberto Navigli and Simone Paolo Ponzetto. BabelNet: The Automatic Construction, Evaluation and Application of a Wide-Coverage Multilingual Semantic Network. Artificial Intelligence 193, 217-250.

Contact Info

Organizers

Roberto Navigli
 Sapienza University of Rome, Italy
David A. Jurgens
 Sapienza University of Rome, Italy

Other Info

Announcements

  • March 3, 2013: BabelNet 1.1.1 Released for Task Sense Inventory.
  • July 2012: Trial data are out! You can find them here.
  • July 2013: Test data is now publicly available. Please download the data here.