Evaluation Framework for Benchmarking NER Systems

This page provides all information you need to perform an evaluation of a Wikipedia-based NER systems using the framework. We also present a preliminary results of the evaluation of the EntityClassifier.eu NER system (also known as THD).

Step-by-Step evaluation

Load a benchmark dataset into GATE: to load the benchmark News and Tweets datasets in GATE use the provided plugins. Download the plugins and unpack them into the GATE plugins folder (../GATE/plugins/). Next, use the GATE CREOLE plugin manager to enable them.
Then, in GATE create an instance of each plugin, create a corpus pipeline and put the plugins in the pipeline. Check the configuration of each plugin and run the pipeline. The plugins will create a document corpus for each dataset.
Run your favorite NER on the benchmark corpus: to run a NER on the benchmark corpus, you should create an instance of GATE client plugin for the NER. We provide a reference implementation of a GATE client plugin for the EntityClassifier.eu NER system. Download and unpack the plugin in the GATE plugins folder, enable it in the GATE CREOLE plugin manager and put it in a corpus pipeline. Finally, you can run it on a chosen benchmark corpus.
Perform type alignment: since we assume that the evaluated NER systems provide types for the entities from the DBpedia Ontology, you need to align the types of the entities from the benchmark dataset with the types provided from the evaluated NER system. For this task you can use the provided OntologyAwareFeatureDiffPR GATE plugin. The plugin will align the entity types and create a new feature aligned-type with value of the aligned type (DBpedia Ontology type URI).
Perform comparison using the Corpus Quality Assurance tool: Finally, you can use the Corpus Quality Assurance tool to evaluate the performance of the NER system. The Corpus Quality Assurance tools is already included in the GATE software.

Bellow we present the preliminary results from the evaluation of the EntityClassifier.eu NER system on the News and Tweets benchmark datasets.

Evaluation results for the EntityClassifier.eu NER system on the Tweets benchmark dataset.
	Precision (strict/lenient)	Recall (strict/lenient)	F1.0 score (strict/lenient)
Entity recognition	`0.45/0.56`	`0.67/0.84`	`0.54/0.67`
Entity disambiguation	`0.24/0.26`	`0.36/0.39`	`0.29/0.31`
Entity classification	`0.12/0.13`	`0.17/0.19`	`0.14/0.15`

Evaluation results for the EntityClassifier.eu NER system on the News benchmark dataset
	Precision (strict/lenient)	Recall (strict/lenient)	F1.0 score (strict/lenient)
Entity recognition	`0.69/0.78`	`0.33/0.38`	`0.45/0.51`
Entity disambiguation	`0.37/0.41`	`0.18/0.20`	`0.24/0.27`
Entity classification	`0.69/0.78`	`0.33/0.38`	`0.45/0.51`