Evaluation using predicate models TBox
Evaluation results updated in November 2015
Overview
We evaluate PIKES performances with respect to the state of the art, assessing precision and recall in the extraction of frame structures described using traditional predicate models - PropBank (PB), NomBank (NB), VerbNet (VN), and FrameNet (FN) as done by tools such as FRED [2], Lodifier and NewsReader. Specifically, we compute PIKES precision and recall in extracting the following knowledge graph components from a gold standard text: instances, edges (i.e., instance/instance unlabeled relations), and triples, considered either globally and divided by category: links to DBpedia, VN/FN/PB/NB types, VN/FN/PB/NB participation relations, owl:sameAs relations. We also compare PIKES performances with the ones of FRED, the state-of-the-art tool most similar to PIKES, on the same gold text.
This evaluation has been carried out the first time in September 2015 for the SAC 2016 paper, and we provide here its results. However, as PIKES continues to evolve, we provide also results obtained with the latest version of PIKES (as of November 2015). Latest results are overall better, especially for what concerns FrameNet types and properties due to the use of two SRL tools - Semafor and Mate.
In the following, we describe the sentences and graphs used in the evaluation, covering both how we produced the gold graphs and how we obtained the graphs for FRED and PIKES (SAC 2016 and latest graphs). Then, we report the results of the separate evaluation of PIKES against the gold standard (SAC 2016 and latest results), which covers a large amounts of the features provided by PIKES but with the exclusion of FrameBase types and properties which are evaluated separately). Finally, we report the results of the comparative evaluation of PIKES and FRED on a simplified gold standard (derived from the one manually built) where both tools are comparable (SAC 2016 and latest results).
This page, the detailed alignment reports linked from this page, and all the compared graphs (gold graphs, FRED graphs, PIKES graphs) are available in a downloadable ZIP file.
Sentences and graphs
The following table lists the sentences of the gold standard used for the evaluation, each one associated to multiple knowledge graphs: the manually annotated gold graph; the graph produced by FRED and the graph produced by PIKES (SAC 2016 and latest). The same 8 sentences used in [1] have been used, with the minor exception of sentence S7 that was slightly shortened due the unprocessability of its full version with FRED online demo (tested Sep. 2015).
Sentence | Text | Gold graph | FRED graph | PIKES graph | |
---|---|---|---|---|---|
SAC | latest | ||||
S1 | The lone Syrian rebel group with an explicit stamp of approval from Al Qaeda has become one of the uprising most effective fighting forces, posing a stark challenge to the United States and other countries that want to support the rebels but not Islamic extremists. | .ttl | .ttl | .ttl | .ttl |
S2 | Money flows to the group, the Nusra Front, from like-minded donors abroad. | .ttl | .ttl | .ttl | .ttl |
S3 | Its fighters, a small minority of the rebels, have the boldness and skill to storm fortified positions and lead other battalions to capture military bases and oil fields. | .ttl | .ttl | .ttl | .ttl |
S4 | As their successes mount, they gather more weapons and attract more fighters. | .ttl | .ttl | .ttl | .ttl |
S5 | The group is a direct offshoot of Al Qaeda in Iraq, Iraqi officials and former Iraqi insurgents say, which has contributed veteran fighters and weapons. | .ttl | .ttl | .ttl | .ttl |
S6 | This is just a simple way of returning the favor to our Syrian brothers that fought with us on the lands of Iraq, said a veteran of Al Qaeda in Iraq, who said he helped lead the Nusra Front's efforts in Syria. | .ttl | .ttl | .ttl | .ttl |
S7 | The United States, sensing that time may be running out for Syria president Bashar al-Assad, hopes to isolate the group to prevent it from inheriting Syria. | .ttl | .ttl | .ttl | .ttl |
S8 | As the United States pushes the Syrian opposition to organize a viable alternative government, it plans to blacklist the Nusra Front as a terrorist organization, making it illegal for Americans to have financial dealings with the group and prompting similar sanctions from Europe. | .ttl | .ttl | .ttl | .ttl |
The gold graphs, collaboratively built by two annotators, consist of the relevant RDF triples that should be included in the output of a frame-oriented KE system when applied to the corresponding evaluation sentences:
- The nodes of a graph are the instances mentioned in the corresponding sentence (entities, frames, attributes). Each instance is anchored to exactly one mention, with coreferring mentions giving rise to distinct instances. Instances are linked by owl:sameAs triples to matching entities in DBpedia, and typed with respect to classes encoding VN, FN, PB and NB frame types (most specific types represented).
- The edges of a graph are given by triples connecting different instances. They express owl:sameAs equivalence relations (to explicitly represent and evaluate coreference resolution), instance-attribute association relations, and frame-argument participation relations whose RDF properties encode VN, FN, PB and NB thematic roles.
In order to simplify the manual construction of gold graphs, the link between an instance in a gold graph and the corresponding mention is implicit and given by the instance URI, whose local name corresponds to the head token of the mention in the text. In case of ambiguities, i.e., if there are multiple occurrences of a word in the sentence, a sequential index is added (e.g., in sentence S7, :syria_1 and :syria_2 refer respectively to the first and second occurrences of Syria).
FRED graphs were obtained by invoking the public online demo of FRED. The RDF graph for an input sentence was produced according to the following process:
- We invoked FRED a first time without requiring FN triples, which causes FRED to extract frames based on VerbNet.
- We invoked FRED a second time requiring FN triples, which causes FRED to omit VN data and, instead, to return frame types encoded in the URIs of frame instances (e.g., :Hostile_encounter_1). As FRED authors claim to perform frame detection w.r.t. FN, we extract these frame types from instance URIs and attach them by means of rdf:type to the instances obtained from the first invocation of FRED. The result is a single RDF file including both VN and FN frame data.
- Finally, we rewrite the RDF file by introducing some new prefix and reordering its triples, so to improve readability.
PIKES graphs were obtained using the public demo of PIKES. No specific post-processing was necessary, apart a conversion from TriG to Turtle to get rid of unneeded provenance information (for what concerns this evaluation).
Separate evaluation results
The following table reports the results (SAC 2016, latest) of evaluating PIKES alone against the full gold standard described above, including the number of gold elements (instances, triples, edges), True Positives (# TP), False Positives (# FP), and False Negatives (# FN) for each evaluated component. The links under the ‘alignment’ column lead to separate reports showing how elements of the gold standard and elements from PIKES have been matched in the evaluation, which clearly indicate where false positive and false negative errors occurred.
Component | # Gold items | # TP | # FP | # FN | Precision | Recall | F1 | Alignments |
---|---|---|---|---|---|---|---|---|
Instances | 153 | 148 | 9 | 5 | .943 | .967 | .955 | view |
Triples | 596 | 303 | 122 | 293 | .713 | .508 | .594 | (see below) |
DBpedia links | 18 | 14 | 6 | 4 | .700 | .778 | .737 | view |
types (VN) | 44 | 24 | 10 | 20 | .706 | .545 | .615 | view |
types (FN) | 53 | 19 | 7 | 34 | .731 | .358 | .481 | view |
types (PB) | 53 | 38 | 7 | 15 | .844 | .717 | .776 | view |
types (NB) | 37 | 29 | 13 | 8 | .690 | .784 | .734 | view |
roles (VN) | 94 | 46 | 16 | 48 | .742 | .489 | .590 | view |
roles (FN) | 108 | 28 | 28 | 80 | .500 | .259 | .341 | view |
roles (PB) | 119 | 68 | 14 | 51 | .829 | .571 | .677 | view |
roles (NB) | 55 | 32 | 19 | 23 | .627 | .582 | .604 | view |
owl:sameAs | 15 | 5 | 2 | 10 | .714 | .333 | .455 | view |
Edges | 171 | 131 | 16 | 40 | .891 | .766 | .824 | view |
Component | # Gold items | # TP | # FP | # FN | Precision | Recall | F1 | Alignments |
---|---|---|---|---|---|---|---|---|
Instances | 153 | 147 | 13 | 6 | .919 | .961 | .939 | view |
Triples | 596 | 335 | 136 | 261 | .711 | .562 | .628 | (see below) |
DBpedia links | 18 | 14 | 6 | 4 | .700 | .778 | .737 | view |
types (VN) | 44 | 24 | 10 | 20 | .706 | .545 | .615 | view |
types (FN) | 53 | 38 | 25 | 15 | .603 | .717 | .655 | view |
types (PB) | 53 | 37 | 7 | 16 | .841 | .698 | .763 | view |
types (NB) | 37 | 24 | 7 | 13 | .774 | .649 | .706 | view |
roles (VN) | 94 | 47 | 15 | 47 | .758 | .500 | .603 | view |
roles (FN) | 108 | 47 | 32 | 61 | .595 | .435 | .503 | view |
roles (PB) | 119 | 67 | 15 | 52 | .817 | .563 | .667 | view |
roles (NB) | 55 | 31 | 18 | 24 | .633 | .564 | .596 | view |
owl:sameAs | 15 | 6 | 1 | 9 | .857 | .400 | .545 | view |
Edges | 171 | 134 | 21 | 37 | .865 | .784 | .822 | view |
Comparative evaluation results
In order to fairly compare PIKES with FRED we had to modify/simplify the gold standard to account for two aspects of FRED:
- FRED does not return PB/NB frame types and PB/NB/FN frame roles, so we removed them from the gold standard;
- FRED does not support nominal predicates and argument nominalization, although it often represents the associated participation relations with arbitrary triples. To exemplify, the span ’Its fighters’ in sentence S3 is represented by FRED using triple :fighter_1 :fighterOf :neuter_1 (where :fighter_1 and :neuter_1 are the instances denoted by mentions ‘fighters’ and ‘Its’, respectively), whereas the gold standard and PIKES employ the nominal frame (disambiguated w.r.t. FN) :fighter_frame rdf:type fn:Irregular_combatants; fn:combatant :fighter_1; fn:side1 :neuter_1 (where :fighter_frame is a frame instance also denoted by ‘fighters’). Thus, we automatically transform the latter representation - both in the gold standard and in PIKES output - into FRED one.
The following table show the results (SAC 2016, latest) of evaluating FRED and PIKES against the simplified gold standard, including the number of gold elements for each evaluated knowledge graph component and links to separate reports showing how gold elements and elements from FRED and PIKES have been aligned. PIKES exhibits better precision and recall than FRED for all the considered components, with differences in terms of F1 ranging from .059 to .221 for SAC 2016 results and from .042 to .221 for latest results.
Component | # Gold items | FRED | PIKES | Alignments | ||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | |||
Instances | 137 | .930 | .869 | .898 | .937 | .978 | .957 | view |
Triples | 166 | .543 | .416 | .471 | .713 | .554 | .624 | (see below) |
DBpedia links | 18 | .615 | .444 | .516 | .700 | .778 | .737 | view |
types (VN) | 31 | .593 | .516 | .552 | .667 | .581 | .621 | view |
types (FN) | 26 | .550 | .423 | .478 | .762 | .615 | .681 | view |
roles (VN) | 76 | .547 | .382 | .450 | .722 | .513 | .600 | view |
owl:sameAs | 15 | .357 | .333 | .345 | .714 | .333 | .455 | view |
Edges | 155 | .869 | .555 | .677 | .937 | .768 | .844 | view |
Component | # Gold items | FRED | PIKES | Alignments | ||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | |||
Instances | 137 | .930 | .869 | .898 | .911 | .971 | .940 | view |
Triples | 166 | .543 | .416 | .471 | .698 | .584 | .636 | (see below) |
DBpedia links | 18 | .615 | .444 | .516 | .700 | .778 | .737 | view |
types (VN) | 31 | .593 | .516 | .552 | .667 | .581 | .621 | view |
types (FN) | 26 | .550 | .423 | .478 | .731 | .667 | .644 | view |
roles (VN) | 76 | .547 | .382 | .450 | .741 | .526 | .615 | view |
owl:sameAs | 15 | .357 | .333 | .345 | .857 | .400 | .545 | view |
Edges | 155 | .869 | .555 | .677 | .910 | .787 | .844 | view |
In line with the approach of [1], we also compare PIKES and FRED against an additional gold graph obtained by merging the outputs of both tools, cleaned up of incorrect triples. By definition, this gold graph is a subset of the simplified gold standard discussed above. The goal of this additional evaluation, as noted in [1], is to comparatively evaluate each tool within the knowledge extraction tool space (i.e., considering only correct triples that can be extracted by at least one tool). The following table reports the results obtained (SAC 2016, latest). Again, PIKES exhibits better precision and recall than FRED for all the considered components, with differences in terms of F1 ranging from .059 to .269 for SAC 2016 results and from .042 to .381 for latest results. The larger difference is due to PIKES recall being generally higher than FRED recall, which means that the gold graph here defined tends to coincide with the correct answers by PIKES (this can be seen as a limit of this kind of evaluation, which favors the system with higher recall).
Component | # Gold elements | FRED | PIKES | Alignments | ||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | |||
Instances | 136 | .930 | .875 | .902 | .937 | .985 | .961 | view |
Triples | 115 | .543 | .600 | .570 | .713 | .800 | .754 | (see below) |
DBpedia links | 14 | .615 | .571 | .593 | .700 | 1 | .824 | view |
types (VN) | 27 | .593 | .593 | .593 | .667 | .667 | .667 | view |
types (FN) | 17 | .550 | .647 | .595 | .762 | .941 | .842 | view |
roles (VN) | 51 | .547 | .569 | .558 | .722 | .765 | .743 | view |
owl:sameAs | 6 | .357 | .833 | .500 | .714 | .833 | .769 | view |
Edges | 134 | .869 | .642 | .738 | .937 | .888 | .912 | view |
Component | # Gold elements | FRED | PIKES | Alignments | ||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | |||
Instances | 135 | .930 | .881 | .905 | .911 | .985 | .947 | view |
Triples | 118 | .543 | .585 | .563 | .698 | .822 | .755 | (see below) |
DBpedia links | 14 | .615 | .571 | .593 | .700 | 1 | .824 | view |
types (VN) | 27 | .593 | .593 | .593 | .667 | .667 | .667 | view |
types (FN) | 19 | .550 | .579 | .564 | .613 | 1 | .760 | view |
roles (VN) | 51 | .547 | .569 | .558 | .741 | .784 | .762 | view |
owl:sameAs | 7 | .357 | .714 | .476 | .857 | .857 | .857 | view |
Edges | 136 | .869 | .632 | .732 | .910 | .897 | .904 | view |
References
- A Comparison of Knowledge Extraction Tools for the Semantic Web.
By Aldo Gangemi.
In ESWC 2013 Proceedings, Springer Berlin Heidelberg, volume 7882, pages 351-366, 2013.
[online version] - Knowledge Extraction Based on Discourse Representation Theory and Linguistic Frames.
By Valentina Presutti, Francesco Draicchio, Aldo Gangemi.
In EKAW 2012 Proceedings, Springer-Verlag Berlin, pages 114-129, 2012.
[online version] [web site]