Linked Hypernyms Dataset

v2016-04
The downloads are provided as N-Triples (gzipped). The numbers correspond to instances count (in thousands).
Download highlights
Dataset Dutch English German
Core Dataset
Most accurate - result of pattern matching
nt
nt
nt
Inference Dataset
Types are in the DBpedia ontology namespace - merge of Core, STI
nt
nt
nt
Extension Dataset
Types are in the DBpedia resource namespace - highest type specificity
nt
nt
nt
Raw "Plain Text" Dataset
All hypernyms are string literals (the original extracted word).
nt
nt
nt

Core dataset

This file contains reliable triples with DBpedia Ontology types. These were obtained by linking hypernyms parsed from the first sentence of the article to the DBpedia Ontology by strict match.

Accuracy
According to our evaluation done on 3.8 release in [1], the dataset provides a type to entities untyped in DBpedia with accuracy of 0.94. If an entity already has a type in DBpedia, it assigns a different type with accuracy 0.82. In another evaluation done on 3.9 release [2], we found the quality of the types in the Core dataset to be comparable to DBpedia infobox-based extraction: Core had 0.1 (absolute) higher precision in determining the exact type of an entity than DBpedia, and 0.05 lower hierarchical precision ( more relaxed match was permitted).
Time to generate
2-3 days on a single CPU (one language).
State of automation
Fully automated, available on github.

Inference dataset

This file contains less reliable triples with DBpedia Ontology types. This file was generated by our STI algorithm, which used as input i) the hypernym from the first sentence. It is analogous with the DBpedia Heuristics dataset, however due to different source data (article text vs links in Heuristics), the overlap with the Heuristics dataset is small.

Accuracy
In evaluation done on 3.9 release [2], the hierarchical precision was less than 0.1 absolute points lower for the Inference dataset than for statements in DBpedia.
Time to generate
3-4 days on a single CPU (one language).
State of automation
STI is already part of LHD framework available on github.
Languages
English, German, Dutch (STI)

Extension dataset

This file contains triples with DBpedia Resource types. It covers the same set of instances as Core+Inference dataset, the difference is that no mapping to DBpedia Ontology was performed, the types are in the dbpedia/resource namespace, which ensures highest specificity of the types.

Accuracy
Not directly estimated, will be likely between those reported for Core and Inference dataset.
Time to generate
Is a byproduct of "Core" dataset generation
Languages
English, German, Dutch
State of automation
Is a byproduct of "Core" dataset generation.
Open issues
The type (DBpedia resource) is connected with the subject (DBpedia entity=Wikipedia article) using a dummy <?> predicate. The DBpedia community may wish to use either rdf:type or more specific predicate such as http://linguistics-ontology.org/gold/2010/hypernym.

Raw "Plain text" dataset

This file contains hypernyms in a plain text form as they were obtained by the pattern matching component.

Accuracy
Greater than 0.9 for all languages (German, English, Dutch) [1].
Time to generate
Is a byproduct of "Core" dataset generation
Languages
English, German, Dutch
State of automation
Is a byproduct of "Core" dataset generation.

Publications

  1. T. Kliegr. Linked Hypernyms: Enriching DBpedia with Targeted Hypernym Discovery. Web Semantics, Volume 31, March 2015, Pages 59-69 (paper)
  2. T. Kliegr, O. Zamazal. Towards Linked Hypernyms Dataset 2.0: complementing DBpedia with hypernym discovery. In 9th International Language Resources and Evaluation Conference (LREC'14), Reykjavik, Iceland, May, 2014. (paper)
  3. O. Zamazal, T. Kliegr. Type Inference in DBpedia from Free Text. Under review. resources