Annotated GATE corpora for measuring accuracy of hypernym discovery
The experimental results can be verified by comparing the "aggr" and "thd" annotation sets in the provided GATE corpora using the GATE Corpus Quality Assurance tool.
The annotations are provided in GATE document format (XML-serialized).
The individual documents contain following annotations:
- aggr annotation set: h annotation on hypernym if at leat two annotators agreed on it
- ann1 annotation set: h annotation on hypernym assigned by annotator 1
- ann2 annotation set: h annotation on hypernym assigned by annotator 2
- ann3 annotation set: h annotation on hypernym assigned by annotator 3 *
- thd annotation set: annotations created by THD
- h hypernym assigned by THD
- harticle helper annotation on article preceding the hypernym assigned by thd
- Token (contains POS tag as feature), SpaceToken, Split (EN only), Sentence (EN only) annotations created by GATE and TreeTagger (NL, DE).
* Note: Annotator 3 processed all documents in English corpus, and only documents without agreement between annotator 1 and annotator 2 for German and Dutch.
The documents have following document features:
- article_title: title of the Wikipedia article
- dbpedia_url: DBpedia resource URL
- lang: language of the source Wikipedia (en/nl/de)
- thd_type: the result of interannotator agreement on hypernym
- '' (empty string): at least two annotators have not marked any hypernym
- '__ANNDISAGR__': no agreement between at least two annotators
- other value: the hypernym at least two annotators have agreed upon
- db_type_x (x=0,...,n): the set of resources linked using rdf:type relation to dbpedia_url in DBpedia.
- thdMatchType: result of comparison of thd_type with the last part of db_type_x url (after last forward slash) using the comparison script
- precise: one of db_type_x name matches the hypernym,
- ends with: one of db_type_x name ends with the hypernym as its substring,
- contains: hypernym is a substring of the db_type_x,
- no rdf:type: there is no db_type_x, but hypernym was discovered,
- no hypernym: hypernym was not discovered,
- no match: neither of the above applies.
The titles of documents with interannotator agreement can be listed with this simple groovy script