This page provides downloads of the two benchmark datasets.
- News dataset (download) - consists of a small number of standard-length news articles freely available from the BBC and New York Times. The dataset is a derivative of the originally published WEKEX dataset. This dataset is released under the Creative Commons BY-SA 3.0 license.
- Tweets dataset (download) - contains a large number of very short texts (tweets). The dataset is a derivative of the originally published MSM (Making Sense of Microposts Challenge) dataset. This dataset is released under the Creative Commons BY-NC-SA 3.0 license.
Both datasets were reannotated to fit the needs of evaluation of systems performing Wikipedia-based entity classification and Wikipedia-based entity linking: Entities recognized in the original datasets were enriched with a link to Wikipedia and the most specific type from the DBpedia Ontology. The annotations were created by two annotators and a judge.
Num of documents | Total number of entities | Entities with a CONLL type | Entities with Dbpedia Ontology type | Entities with a Wikipedia URL | |
---|---|---|---|---|---|
News | 10 | 588 | 580 | 367 | 440 |
Tweets | 1044 | 1523 | 1523 | 1379 | 1354 |
Description of fields
Common fields
- Entity: name of the entity as appears in the text.
- Link: Link to English Wikipedia.
- Tag: a CONLL category Tag (Person, Location, Organization, Miscellaneous).
- Type: a class from DBpedia Ontology 3.8.
- Most frequent sense: if the correct Wikipedia page is found merely as the first hit of Wikipedia search for the entity name, otherwise empty.
News dataset specific fields
- id_entity: id of the entity from the original WEKEX dataset.
- article: URL to the article from which this entity was extracted.
- Common entity: 1 if the entity is not a named entity, empty otherwise.
- Full name: if this specific entity is a part of a full entity name, which appears in the article, then this fields lists the full entity name.
- Partial: if the recognized string is a part of the entity name, and this part does not appear as a full standalone reference to the entity in the document.
- Text: content of the article. Due to copyright restrictions the content of the article are not distrubuted along with the News dataset. The content can be retrieved from textual content is freely available from the BBC’s and NYTimes’s official websites and added to the "Text" field.
Tweet dataset specific fields
- EntityID: id of the entity from the original MSM dataset.
- TweetID: id of the tweet from the original MSM dataset.
- Text: Content of the tweet.
- Incorrect capitalization:1 if there is at least one letter in the entity name has incorrect case, otherwise empty.