pikes - Evaluation using predicate models TBox

Overview

We evaluate PIKES performances with respect to the state of the art, assessing precision and recall in the extraction of frame structures described using traditional predicate models - PropBank (PB), NomBank (NB), VerbNet (VN), and FrameNet (FN) as done by tools such as FRED [2], Lodifier and NewsReader. Specifically, we compute PIKES precision and recall in extracting the following knowledge graph components from a gold standard text: instances, edges (i.e., instance/instance unlabeled relations), and triples, considered either globally and divided by category: links to DBpedia, VN/FN/PB/NB types, VN/FN/PB/NB participation relations, owl:sameAs relations. We also compare PIKES performances with the ones of FRED, the state-of-the-art tool most similar to PIKES, on the same gold text.

This evaluation has been carried out the first time in September 2015 for the SAC 2016 paper, and we provide here its results. However, as PIKES continues to evolve, we provide also results obtained with the latest version of PIKES (as of November 2015). Latest results are overall better, especially for what concerns FrameNet types and properties due to the use of two SRL tools - Semafor and Mate.

In the following, we describe the sentences and graphs used in the evaluation, covering both how we produced the gold graphs and how we obtained the graphs for FRED and PIKES (SAC 2016 and latest graphs). Then, we report the results of the separate evaluation of PIKES against the gold standard (SAC 2016 and latest results), which covers a large amounts of the features provided by PIKES but with the exclusion of FrameBase types and properties which are evaluated separately). Finally, we report the results of the comparative evaluation of PIKES and FRED on a simplified gold standard (derived from the one manually built) where both tools are comparable (SAC 2016 and latest results).

This page, the detailed alignment reports linked from this page, and all the compared graphs (gold graphs, FRED graphs, PIKES graphs) are available in a downloadable ZIP file.

Sentences and graphs

The following table lists the sentences of the gold standard used for the evaluation, each one associated to multiple knowledge graphs: the manually annotated gold graph; the graph produced by FRED and the graph produced by PIKES (SAC 2016 and latest). The same 8 sentences used in [1] have been used, with the minor exception of sentence S7 that was slightly shortened due the unprocessability of its full version with FRED online demo (tested Sep. 2015).

Sentence	Text	Gold graph	FRED graph	PIKES graph
Sentence	Text	Gold graph	FRED graph	SAC	latest
S1	The lone Syrian rebel group with an explicit stamp of approval from Al Qaeda has become one of the uprising most effective fighting forces, posing a stark challenge to the United States and other countries that want to support the rebels but not Islamic extremists.	.ttl	.ttl	.ttl	.ttl
S2	Money flows to the group, the Nusra Front, from like-minded donors abroad.	.ttl	.ttl	.ttl	.ttl
S3	Its fighters, a small minority of the rebels, have the boldness and skill to storm fortified positions and lead other battalions to capture military bases and oil fields.	.ttl	.ttl	.ttl	.ttl
S4	As their successes mount, they gather more weapons and attract more fighters.	.ttl	.ttl	.ttl	.ttl
S5	The group is a direct offshoot of Al Qaeda in Iraq, Iraqi officials and former Iraqi insurgents say, which has contributed veteran fighters and weapons.	.ttl	.ttl	.ttl	.ttl
S6	This is just a simple way of returning the favor to our Syrian brothers that fought with us on the lands of Iraq, said a veteran of Al Qaeda in Iraq, who said he helped lead the Nusra Front's efforts in Syria.	.ttl	.ttl	.ttl	.ttl
S7	The United States, sensing that time may be running out for Syria president Bashar al-Assad, hopes to isolate the group to prevent it from inheriting Syria.	.ttl	.ttl	.ttl	.ttl
S8	As the United States pushes the Syrian opposition to organize a viable alternative government, it plans to blacklist the Nusra Front as a terrorist organization, making it illegal for Americans to have financial dealings with the group and prompting similar sanctions from Europe.	.ttl	.ttl	.ttl	.ttl

The gold graphs, collaboratively built by two annotators, consist of the relevant RDF triples that should be included in the output of a frame-oriented KE system when applied to the corresponding evaluation sentences:

The nodes of a graph are the instances mentioned in the corresponding sentence (entities, frames, attributes). Each instance is anchored to exactly one mention, with coreferring mentions giving rise to distinct instances. Instances are linked by owl:sameAs triples to matching entities in DBpedia, and typed with respect to classes encoding VN, FN, PB and NB frame types (most specific types represented).
The edges of a graph are given by triples connecting different instances. They express owl:sameAs equivalence relations (to explicitly represent and evaluate coreference resolution), instance-attribute association relations, and frame-argument participation relations whose RDF properties encode VN, FN, PB and NB thematic roles.

In order to simplify the manual construction of gold graphs, the link between an instance in a gold graph and the corresponding mention is implicit and given by the instance URI, whose local name corresponds to the head token of the mention in the text. In case of ambiguities, i.e., if there are multiple occurrences of a word in the sentence, a sequential index is added (e.g., in sentence S7, :syria_1 and :syria_2 refer respectively to the first and second occurrences of Syria).

FRED graphs were obtained by invoking the public online demo of FRED. The RDF graph for an input sentence was produced according to the following process:

We invoked FRED a first time without requiring FN triples, which causes FRED to extract frames based on VerbNet.
We invoked FRED a second time requiring FN triples, which causes FRED to omit VN data and, instead, to return frame types encoded in the URIs of frame instances (e.g., :Hostile_encounter_1). As FRED authors claim to perform frame detection w.r.t. FN, we extract these frame types from instance URIs and attach them by means of rdf:type to the instances obtained from the first invocation of FRED. The result is a single RDF file including both VN and FN frame data.
Finally, we rewrite the RDF file by introducing some new prefix and reordering its triples, so to improve readability.

PIKES graphs were obtained using the public demo of PIKES. No specific post-processing was necessary, apart a conversion from TriG to Turtle to get rid of unneeded provenance information (for what concerns this evaluation).

Separate evaluation results

The following table reports the results (SAC 2016, latest) of evaluating PIKES alone against the full gold standard described above, including the number of gold elements (instances, triples, edges), True Positives (# TP), False Positives (# FP), and False Negatives (# FN) for each evaluated component. The links under the ‘alignment’ column lead to separate reports showing how elements of the gold standard and elements from PIKES have been matched in the evaluation, which clearly indicate where false positive and false negative errors occurred.

SAC 2016 results (Sept. 2015)
Latest results (Nov. 2015)

Component	# Gold items	# TP	# FP	# FN	Precision	Recall	F1	Alignments
Instances	153	148	9	5	.943	.967	.955	view
Triples	596	303	122	293	.713	.508	.594	(see below)
DBpedia links	18	14	6	4	.700	.778	.737	view
types (VN)	44	24	10	20	.706	.545	.615	view
types (FN)	53	19	7	34	.731	.358	.481	view
types (PB)	53	38	7	15	.844	.717	.776	view
types (NB)	37	29	13	8	.690	.784	.734	view
roles (VN)	94	46	16	48	.742	.489	.590	view
roles (FN)	108	28	28	80	.500	.259	.341	view
roles (PB)	119	68	14	51	.829	.571	.677	view
roles (NB)	55	32	19	23	.627	.582	.604	view
`owl:sameAs`	15	5	2	10	.714	.333	.455	view
Edges	171	131	16	40	.891	.766	.824	view

Component	# Gold items	# TP	# FP	# FN	Precision	Recall	F1	Alignments
Instances	153	147	13	6	.919	.961	.939	view
Triples	596	335	136	261	.711	.562	.628	(see below)
DBpedia links	18	14	6	4	.700	.778	.737	view
types (VN)	44	24	10	20	.706	.545	.615	view
types (FN)	53	38	25	15	.603	.717	.655	view
types (PB)	53	37	7	16	.841	.698	.763	view
types (NB)	37	24	7	13	.774	.649	.706	view
roles (VN)	94	47	15	47	.758	.500	.603	view
roles (FN)	108	47	32	61	.595	.435	.503	view
roles (PB)	119	67	15	52	.817	.563	.667	view
roles (NB)	55	31	18	24	.633	.564	.596	view
`owl:sameAs`	15	6	1	9	.857	.400	.545	view
Edges	171	134	21	37	.865	.784	.822	view

Comparative evaluation results

In order to fairly compare PIKES with FRED we had to modify/simplify the gold standard to account for two aspects of FRED:

FRED does not return PB/NB frame types and PB/NB/FN frame roles, so we removed them from the gold standard;
FRED does not support nominal predicates and argument nominalization, although it often represents the associated participation relations with arbitrary triples. To exemplify, the span ’Its fighters’ in sentence S3 is represented by FRED using triple :fighter_1 :fighterOf :neuter_1 (where :fighter_1 and :neuter_1 are the instances denoted by mentions ‘fighters’ and ‘Its’, respectively), whereas the gold standard and PIKES employ the nominal frame (disambiguated w.r.t. FN) :fighter_frame rdf:type fn:Irregular_combatants; fn:combatant :fighter_1; fn:side1 :neuter_1 (where :fighter_frame is a frame instance also denoted by ‘fighters’). Thus, we automatically transform the latter representation - both in the gold standard and in PIKES output - into FRED one.

The following table show the results (SAC 2016, latest) of evaluating FRED and PIKES against the simplified gold standard, including the number of gold elements for each evaluated knowledge graph component and links to separate reports showing how gold elements and elements from FRED and PIKES have been aligned. PIKES exhibits better precision and recall than FRED for all the considered components, with differences in terms of F1 ranging from .059 to .221 for SAC 2016 results and from .042 to .221 for latest results.

SAC 2016 results (Sept. 2015)
Latest results (Nov. 2015)

Component	# Gold items	FRED			PIKES			Alignments
Component	# Gold items	Precision	Recall	F1	Precision	Recall	F1	Alignments
Instances	137	.930	.869	.898	.937	.978	.957	view
Triples	166	.543	.416	.471	.713	.554	.624	(see below)
DBpedia links	18	.615	.444	.516	.700	.778	.737	view
types (VN)	31	.593	.516	.552	.667	.581	.621	view
types (FN)	26	.550	.423	.478	.762	.615	.681	view
roles (VN)	76	.547	.382	.450	.722	.513	.600	view
`owl:sameAs`	15	.357	.333	.345	.714	.333	.455	view
Edges	155	.869	.555	.677	.937	.768	.844	view

Component	# Gold items	FRED			PIKES			Alignments
Component	# Gold items	Precision	Recall	F1	Precision	Recall	F1	Alignments
Instances	137	.930	.869	.898	.911	.971	.940	view
Triples	166	.543	.416	.471	.698	.584	.636	(see below)
DBpedia links	18	.615	.444	.516	.700	.778	.737	view
types (VN)	31	.593	.516	.552	.667	.581	.621	view
types (FN)	26	.550	.423	.478	.731	.667	.644	view
roles (VN)	76	.547	.382	.450	.741	.526	.615	view
`owl:sameAs`	15	.357	.333	.345	.857	.400	.545	view
Edges	155	.869	.555	.677	.910	.787	.844	view

In line with the approach of [1], we also compare PIKES and FRED against an additional gold graph obtained by merging the outputs of both tools, cleaned up of incorrect triples. By definition, this gold graph is a subset of the simplified gold standard discussed above. The goal of this additional evaluation, as noted in [1], is to comparatively evaluate each tool within the knowledge extraction tool space (i.e., considering only correct triples that can be extracted by at least one tool). The following table reports the results obtained (SAC 2016, latest). Again, PIKES exhibits better precision and recall than FRED for all the considered components, with differences in terms of F1 ranging from .059 to .269 for SAC 2016 results and from .042 to .381 for latest results. The larger difference is due to PIKES recall being generally higher than FRED recall, which means that the gold graph here defined tends to coincide with the correct answers by PIKES (this can be seen as a limit of this kind of evaluation, which favors the system with higher recall).

SAC 2016 results (Sept. 2015)
Latest results (Nov. 2015)

Component	# Gold elements	FRED			PIKES			Alignments
Component	# Gold elements	Precision	Recall	F1	Precision	Recall	F1	Alignments
Instances	136	.930	.875	.902	.937	.985	.961	view
Triples	115	.543	.600	.570	.713	.800	.754	(see below)
DBpedia links	14	.615	.571	.593	.700	1	.824	view
types (VN)	27	.593	.593	.593	.667	.667	.667	view
types (FN)	17	.550	.647	.595	.762	.941	.842	view
roles (VN)	51	.547	.569	.558	.722	.765	.743	view
`owl:sameAs`	6	.357	.833	.500	.714	.833	.769	view
Edges	134	.869	.642	.738	.937	.888	.912	view

Component	# Gold elements	FRED			PIKES			Alignments
Component	# Gold elements	Precision	Recall	F1	Precision	Recall	F1	Alignments
Instances	135	.930	.881	.905	.911	.985	.947	view
Triples	118	.543	.585	.563	.698	.822	.755	(see below)
DBpedia links	14	.615	.571	.593	.700	1	.824	view
types (VN)	27	.593	.593	.593	.667	.667	.667	view
types (FN)	19	.550	.579	.564	.613	1	.760	view
roles (VN)	51	.547	.569	.558	.741	.784	.762	view
`owl:sameAs`	7	.357	.714	.476	.857	.857	.857	view
Edges	136	.869	.632	.732	.910	.897	.904	view

References

A Comparison of Knowledge Extraction Tools for the Semantic Web.
By Aldo Gangemi.
In ESWC 2013 Proceedings, Springer Berlin Heidelberg, volume 7882, pages 351-366, 2013.
[online version]
Knowledge Extraction Based on Discourse Representation Theory and Linguistic Frames.
By Valentina Presutti, Francesco Draicchio, Aldo Gangemi.
In EKAW 2012 Proceedings, Springer-Verlag Berlin, pages 114-129, 2012.
[online version] [web site]