Datasets description

Reuters-128 (salience dataset)

The Reuters-128 salience corpus is an extension of the entity linking corpus Reuters-128, which is part of the N3 datasets collection [1]. The Reuters-128 dataset is an English corpus in the NLP Interchange Format (NIF) and it contains 128 economic news articles. The dataset contains information for 880 named entities with their position in the document (beginOffset, endOffset) and a URI of a DBpedia resource identifying the entity. The salience dataset extends the Reuters-128 dataset also with 3,551 common entities.

Entity salience information was obtained by crowdsourcing salience information using the CrowdFlower platform. For each named and common entity in the Reuters-128 dataset, we collected at least three judgement. Only judgements from contributors with trust score higher than 70% were considered as trusted judgements. If the trust score of a contributor falls bellow 70%, all his/her judgements were disregarded. The annotators were asked to carrefully read the document, and then determine how salient is the given entity for the document. The annotators classified the entities in one of the following classes:

Most Salient - Entities with the highest focus of attention in the article. The document is mostly about the these entities, or the entities play a prominent role in the content of the article.
Less Salient - Entities with less focus of attention in the article. The entities play an important role in some parts of the content of the article.
Not Salient - The article is really not about the entities.

Download location: https://github.com/KIZI/ner-eval-collection/blob/master/reuters-128-entity-salience.ttl

[1] Michael Röder, Ricardo Usbeck, Sebastian Hellmann, Daniel Gerber, and Andreas Both. N3 - A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format. In Porceedings of the 9th edition of the Language Resources and Evaluation Conference, 26-31 May, Reykjavik, Iceland (link)

New York Times (salience dataset)

The New York Times datasets is an entity salience dataset provided by Google [2]. The salience annotations in the NYT dataset have been automatically generated by aligning the entities in the abstract and the document and considering that every entity which occurs in the abstract is salient. The New York Times dataset consists of two partitions. A training partition which consists of about 90% of the data, and a testing partition con- sisting of the remaining 10%. The NYT dataset [5] provides only information about the begin and end index of the entities, the entity name, document ID and salience information. The annotations are shared without the underlying document’s content. We converted the dataset in the NLP Interchange Format (NIF).

Download location:

training partition in NIF - http://ner.vse.cz/datasets/nyt-salience/nyt-train.ttl
evaluation partition in NIF - http://ner.vse.cz/datasets/nyt-salience/nyt-eval.ttl

[2] Dan Gillick and Jesse Dunietz. A new entity salience task with millions of training examples. In Proceedings of the European Association for Computational Linguistics, 2014. (link)

Entity Salience Collection

Datasets description

Reuters-128 (salience dataset)

New York Times (salience dataset)