The downloads are provided as N-Triples (gzipped). The numbers correspond to instances count (in thousands).
Download highlights
Dataset	Dutch	English	German
Core Dataset Most accurate - result of pattern matching	nt	nt	nt
Inference Dataset Types are in the DBpedia ontology namespace - merge of Core, STI	nt	nt	nt
Extension Dataset Types are in the DBpedia resource namespace - highest type specificity	nt	nt	nt
Raw "Plain Text" Dataset All hypernyms are string literals (the original extracted word).	nt	nt	nt

Core dataset

This file contains reliable triples with DBpedia Ontology types. These were obtained by linking hypernyms parsed from the first sentence of the article to the DBpedia Ontology by strict match.

Accuracy

According to our evaluation done on 3.8 release in [1], the dataset provides a type to entities untyped in DBpedia with accuracy of 0.94. If an entity already has a type in DBpedia, it assigns a different type with accuracy 0.82. In another evaluation done on 3.9 release [2], we found the quality of the types in the Core dataset to be comparable to DBpedia infobox-based extraction: Core had 0.1 (absolute) higher precision in determining the exact type of an entity than DBpedia, and 0.05 lower hierarchical precision ( more relaxed match was permitted).

Time to generate

2-3 days on a single CPU (one language).

State of automation

Fully automated, available on github.

Inference dataset

This file contains less reliable triples with DBpedia Ontology types. This file was generated by our STI algorithm, which used as input i) the hypernym from the first sentence. It is analogous with the DBpedia Heuristics dataset, however due to different source data (article text vs links in Heuristics), the overlap with the Heuristics dataset is small.

Accuracy

In evaluation done on 3.9 release [2], the hierarchical precision was less than 0.1 absolute points lower for the Inference dataset than for statements in DBpedia.

Time to generate

3-4 days on a single CPU (one language).

State of automation

STI is already part of LHD framework available on github.

Languages

English, German, Dutch (STI)

Extension dataset

This file contains triples with DBpedia Resource types. It covers the same set of instances as Core+Inference dataset, the difference is that no mapping to DBpedia Ontology was performed, the types are in the dbpedia/resource namespace, which ensures highest specificity of the types.

Accuracy

Not directly estimated, will be likely between those reported for Core and Inference dataset.

Time to generate

Is a byproduct of "Core" dataset generation

Languages

English, German, Dutch

State of automation

Is a byproduct of "Core" dataset generation.

Open issues

The type (DBpedia resource) is connected with the subject (DBpedia entity=Wikipedia article) using a dummy <?> predicate. The DBpedia community may wish to use either rdf:type or more specific predicate such as http://linguistics-ontology.org/gold/2010/hypernym.

Raw "Plain text" dataset

This file contains hypernyms in a plain text form as they were obtained by the pattern matching component.

Accuracy

Greater than 0.9 for all languages (German, English, Dutch) [1].

Time to generate

Is a byproduct of "Core" dataset generation

Languages

English, German, Dutch

State of automation

Is a byproduct of "Core" dataset generation.

Publications

T. Kliegr. Linked Hypernyms: Enriching DBpedia with Targeted Hypernym Discovery. Web Semantics, Volume 31, March 2015, Pages 59-69 (paper)
T. Kliegr, O. Zamazal. Towards Linked Hypernyms Dataset 2.0: complementing DBpedia with hypernym discovery. In 9th International Language Resources and Evaluation Conference (LREC'14), Reykjavik, Iceland, May, 2014. (paper)
O. Zamazal, T. Kliegr. Type Inference in DBpedia from Free Text. Under review. resources

Linked Hypernyms Dataset

Core dataset

Accuracy

Time to generate

State of automation

Inference dataset

Accuracy

Time to generate

State of automation

Languages

Extension dataset

Accuracy

Time to generate

Languages

State of automation

Open issues

Raw "Plain text" dataset

Accuracy

Time to generate

Languages

State of automation

Publications