Evaluation Framework for Benchmarking NER Systems

This page provides downloads of the two benchmark datasets.

News dataset (download) - consists of a small number of standard-length news articles freely available from the BBC and New York Times. The dataset is a derivative of the originally published WEKEX dataset. This dataset is released under the Creative Commons BY-SA 3.0 license.
Tweets dataset (download) - contains a large number of very short texts (tweets). The dataset is a derivative of the originally published MSM (Making Sense of Microposts Challenge) dataset. This dataset is released under the Creative Commons BY-NC-SA 3.0 license.

Both datasets were reannotated to fit the needs of evaluation of systems performing Wikipedia-based entity classification and Wikipedia-based entity linking: Entities recognized in the original datasets were enriched with a link to Wikipedia and the most specific type from the DBpedia Ontology. The annotations were created by two annotators and a judge.

Size metrics for the Tweets and News datasets
	Num of documents	Total number of entities	Entities with a CONLL type	Entities with Dbpedia Ontology type	Entities with a Wikipedia URL
News	10	588	580	367	440
Tweets	1044	1523	1523	1379	1354

Description of fields

Common fields

Entity: name of the entity as appears in the text.
Link: Link to English Wikipedia.
Tag: a CONLL category Tag (Person, Location, Organization, Miscellaneous).
Type: a class from DBpedia Ontology 3.8.
Most frequent sense: if the correct Wikipedia page is found merely as the first hit of Wikipedia search for the entity name, otherwise empty.

News dataset specific fields

id_entity: id of the entity from the original WEKEX dataset.
article: URL to the article from which this entity was extracted.
Common entity: 1 if the entity is not a named entity, empty otherwise.
Full name: if this specific entity is a part of a full entity name, which appears in the article, then this fields lists the full entity name.
Partial: if the recognized string is a part of the entity name, and this part does not appear as a full standalone reference to the entity in the document.
Text: content of the article. Due to copyright restrictions the content of the article are not distrubuted along with the News dataset. The content can be retrieved from textual content is freely available from the BBC’s and NYTimes’s official websites and added to the "Text" field.

Tweet dataset specific fields

EntityID: id of the entity from the original MSM dataset.
TweetID: id of the tweet from the original MSM dataset.
Text: Content of the tweet.
Incorrect capitalization:1 if there is at least one letter in the entity name has incorrect case, otherwise empty.