Probabilistic Soft Logic for Entity Annotation
This page provides additional details on PSL4EA, an ontological knowledge powered approach based on Probabilistic Soft Logic for jointly revising multiple NLP entity annotations.
The proposed approach is fully implemented and evaluated in the following paper:
- An Ontology-driven Probabilistic Soft Logic Approach to Improve NLP Entity Annotations
By Marco Rospocher.
In Proceedings of the 17th International Semantic Web Conference, ISWC 2018, Monterey, CA, USA, October 8-12, 2018
[bib] [pre-print/mirror]
PSL4EA has been evaluated on three reference datasets for Named Entity Recognition and Classification (NERC) and Entity Linking (EL):
- AIDA CoNLL-YAGO: This dataset consists of 1,393 English news wire articles from Reuters, with 34,999 mentions hand-annotated with named entity types (PER, ORG, LOC, MISC) for the CONLL2003 shared task on named entity recognition, and later hand-annotated with the YAGO2 entities and corresponding Wikipedia page URLs. It is split in three parts: eng.train (946 docs), eng.testa (216 docs), eng.testb (231 docs).
- MEANTIME: The NewsReader MEANTIME corpus consists of 480 news articles from Wikinews, in four languages. In our evaluation, we used only the English section and its 120 articles. The dataset, used as part of the SemEval 2015 task on TimeLine extraction, includes manual annotations for named entity types (only PER, ORG, LOC) and DBpedia entity links.
- TAC-KBP: Developed for the TAC KBP 2011 Knowledge Base Population Track, this dataset consists of 2,231 English documents, including newswire articles and posts to blogs, newsgroups, and discussion fora. For each document, it is known that all the mentions of one or a few query entities can be linked to a certain Wikipedia page and to a specific NERC type (only PER, ORG, LOC), giving rise to a (partially) annotated gold standard for NERC and EL.
The following PSL4EA resources are made available:
- ZIP (~152MB) containing the (soft-truth values of) ground atoms for predicates ImplClN (file impCl_NERC_obs.200.tsv.gz) and ImplClE (file impCl_EL_obs.200.tsv.gz). Values for ImplClN were learned from AIDA CoNLL-YAGO eng.train, while for ImplClE they were deterministically obtained from the DBpedia-YAGO mappings and YAGO TBox. Each file contains three columns:
- a NERC or EL annotation;
- a YAGO class;
- the soft-truth value of the corresponding ground atom.
- PDF (~66KB) file containing all evaluation metrics computed for all measures, with and without using PSL4EA, by
- micro-averaging, considering only mentions in the gold standard;
- micro-averaging, considering all mentions returned by the system;
- macro-averaging by document;
- macro-averaging by NERC type.
- ZIP (~616KB) package of the evaluation folder, containing:
- the official TAC scorer;
- commands for computing scores (and statistical significance) for all metrics and measures considered (cf. the paper for details on interpreting the values);
- gold, standard, and PSL4EA annotations for all datasets (excluding TAC-KBP, under LDC copyright).