Quality Metrics For RDF Graph Summarization

Tracking #: 1648-2860

Mussab Zneika
Dan Vodislav
Dimitris Kotzinos

Responsible editor: 
Guest Editors IE of Semantic Data 2017

Submission type: 
Full Paper
RDF Graph Summarization pertains to the process of extracting concise but meaningful summaries from RDF Knowledge Bases (KBs) representing as close as possible the actual contents of the KB. RDF Summarization allows for better exploration and visualization of the underlying RDF graphs, optimizstion of queries or query evaluation in multiple steps, better understanding of connections in Linked Datasets and many other applications. In the literature, there are efforts reported presenting algorithms for extracting summaries from RDF KBs. These efforts though provide different results while applied on the same KB, thus a way to compare the produced summaries and decide on their quality, in the form of a quality framework, is necessary. So in this work, we propose a comprehensive Quality Framework for RDF Graph Summarization that would allow a better, deeper and more complete understanding of the quality of the different summaries and facilitate their comparison. We work at two levels: the level of the ideal summary (or ideal schema) of the KB that could be provided by an expert user and the level of the instances contained by the KB. For the first level, we are computing how close the proposed summary is to the ideal solution (when this is available) by computing its precision and recall against the ideal solution. For the second level, we are computing if the existing instances are covered (i.e. can be retrieved) and in what degree by the proposed summary. We use our quality framework to test the results of three of the best RDF Graph Summarization algorithms, when summarizing different (in terms of content) and diverse (in terms of total size and number of instances, classes and predicates) KBs and we present comparative results for them. We conclude this work by discussing these results and the suitability of the proposed quality framework in order to get useful insights for the quality of the presented results.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 18/May/2017
Major Revision
Review Comment:

This paper proposes a quality framework for RDF graph summarization. It measures both the schema-level and the instance-level coverage of an RDF dataset achieved by RDF graph summarization approaches. The connectivity of a summary is also considered. The framework is used to evaluate three existing approaches based on a number of real-world RDF datasets.

The paper addresses an important research problem in the Semantic Web area. There have been many and various approaches to dataset summarization, but there is a lack of widely accepted evaluation criteria or an extensive empirical evaluation. This paper has the potential to meet the challenge, though its current form requires major revision.

I have two concerns about the proposed framework.

1. All the evaluation criteria are defined over knowledge patterns. The authors claim that their framework can be used to evaluate any RDF summarization algorithm. How could you prove that all such algorithms can be appropriately transformed into knowledge patterns? In particular, some approaches not mentioned in the related work section define a summary of an RDF graph as a subgraph extracted from it, rather than in an aggregate schema-like form, say "Structural Properties as Proxy for Semantic Relevance in RDF Graph Sampling" (ISWC '14) and "Generating Illustrative Snippets for Open Data on the Web" (WSDM '17). I'm not sure if it is possible and appropriate to regard such summaries as knowledge patterns.

2. Although it is claimed that the framework presents a *comprehensive* way to measure the quality of RDF summaries, all the evaluation criteria are essentially based on the same principle, i.e., a good summary *accurately* characterizes the original data. This has been thoroughly discussed in [11] and [28], in which various accuracy metrics have been proposed and used. What is the difference between those metrics and the ones proposed in this paper? In addition, apart from accuracy, some other factors also influence the quality of a summary, such as conciseness and comprehensibility, which are not addressed in this *comprehensive* framework.

Some details in the framework and the experiments are to be clarified.

3. In Equation (5), how is Nps computed? What do you mean by *represent* a class?

4. Prior to Equation (10), it is true that the algorithms do not invent new properties, but isn't it possible that an algorithm chooses a property that is not included in the ideal summary? Do you assume that the ideal summary covers all the properties? Is this assumption reasonable? Why don't you make a similar assumption on classes?

5. In Section 6.2, what is an untyped dataset? By removing all the entity and property types, how do your evaluation criteria work? They are exactly based on classes and properties.

6. In Section 6.2.1, how are the ideal summaries generated? Is it possible that different experts would generate different ideal summaries?

7. I applaud that considerably many datasets are used in the experiments. However, they are not as heterogeneous as claimed. In particular, DBpedia is not considered, which uses much more classes and properties than any other dataset used in the experiments. If for some reason DBpedia could not be tested, a discussion would be appreciated.

The writing should be significantly improved. Minor issues include but are not limited to the following:
- Abstract: optimizstion --> optimization
- Page 1: build and described --> built and described
- Page 2: RDG --> RDF
- Page 2: PDF --> RDF
- Page 3: RDF Schema describe --> RDF Schema describes
- Page 4: There is a question mark in a \cite environment.
- Page 6: In Equation (4), an unpaired bracket should be removed.
- Page 14: This is also explains --> This also explains
- Page 15: This is explain --> This explains

Review #2
By Melike Sah submitted on 11/Jun/2017
Major Revision
Review Comment:

This paper presents metrics for RDF graph summarization. They introduce schema and instance level precision, recall and F-Measure-based quality metrics. They also use connectivity as a quality measure. The proposed metrics are evaluated on different datasets using three RDF summarization methods. Although the paper addresses a gap in the literature, it requires substantial revisions as follows:

- Precision, recall and F-measure as well as connectivity metric should be mentioned in abstract.
- The novelty of the proposed metric should be explicitly stated in the introduction. There are class-based (and precision) based metrics. Is the novelty of the framework is to use the properties in the calculations. And as well as instance and schema metrics? Or adding coverage and connectivity? Please clarify.
- Bisimilarity in section 2 is not clear. Please give an example like you do in the previous example.
- Figure 1 very low resolution.
- In quality assessment model, without an example RDF graph, and an ideal summary (ground truth summary), it is very hard to follow the metrics. My suggestion is there should be a detailed example here. In addition to formulas, explain the meanings of the formulas on this example. In that way, reader can follow it easily.
- Precision and recall for classes, in equation (1), it directly uses properties. Why you do not use direct class matches? Instead you count property level matches. Therefore, the title “precision and recall for classes” seems incorrect and it is more like “property level recall”. You need to justify this.
- By reading the paper, I do not clearly understand the difference between a knowledge graph and a knowledge pattern? Please explain.
- Typeof link in equation 3 is not clear. Why you differenciate typeof links? Is there an error in equation 3 (or typeof(pa))?
- Equation 10 as well as schema precision at property level needs more explanation.
- The definition of property instances to me is wrong. Give an example of a property instance.
- Instances(pa) in equations 15 and 16. What is the relation to typeof here. Please explain.
- Why did you use absolute in equation 17. Does the coverage cab ne negative?
- Precision and recall at property level. I do not understand this section. You were actually using property at class instance level (or are you using only typeof in class precision/recall?) and here you use the rest of the properties. You need to explicitly say it/clarify it in the text. Again here, why did you use coverage, please clarify.
- Connectivity, equation 31, how do you actually measure this? Clarify.
- In my opinion, the major drawback of the paper is, it does not mention how the quality framework is implemented. Do you have a software to assess this. Do you use SPARQL queries, etc. There must be a quality framework section that explains these details.
- In the experiments, it is not clear how the RDF summarization algorithms are implemented. Do you implement them according to their papers, or did you use their software. If so, links to the software should be given. If you implemented them, then you should discuss the procedure.
- Datasets. The whole paper is about measuring the quality of RDF summaries regarding to an ideal summary. But in the datasets section, no details of the ideal summaries are given. How did you generate ground truth RDF summaries (did you take from somewhere, or manually perform this for each dataset?)
- Again, how did you use these datasets. Did you store them locally (If so there are very large datasets containing millions of triples, how did you store them)? Or use online versions?
- How did you delete typeof links from these datasets (from the stored one or online ones)?
- Why did you compare quality against with typeof and without typeof links. To me it is not justified with sufficient detail.
- Why typed results are not good for Explod in Table 4. Why Explod is worse performing in Bank dataset. Again clarify.
- At instance level evaluations, I do not understand if untyped, how do you measure class/property instances metrics. This means that such information does not exist as you mentioned in the equations. Is this mean that you modified the equations for untyped situations or omitted some equations in the calculations? This needs to be clarified.
- Table 6 connectivity metric, I think there is a mistake. Isn’t it normalized?
- Following typos should be corrected as well: ? (unknown references), propeties,PDF, untped

Review #3
Anonymous submitted on 17/Jul/2017
Major Revision
Review Comment:

The paper proposes a quality Framework for RDF graph summarization to allow a better understanding of the quality of different RDF summaries and facilitate their comparison. The framework measures the quality of an RDF dataset at two levels. At the schema level, where an ideal summary is used to measure the precision and recall; and at the instance level to calculate the coverage an RDF summary provides at for classes and property instances.

The paper is well-written but needs better positioning in the research field. In particular, one of the major issues with the paper is its lack of justification of novelty. For instance, the paper needs to better position itself with the previous precision-based quality models. Authors should also deal with this reasonable criticism: “adding recall to previous models can’t be considered novel enough, especially that whenever a benchmark exists, precision and recall can be identified”.

Furthermore, although the paper describes two group of metrics for evaluating the RDF summaries, however, the paper misses to describe a systematic approach for applying the metrics. In other words, the proposed framework can have even better utility if described in a generic way (for example could be by defining an algorithm that takes an RDF summary provided by an algorithm as an input and then runs different conditions to see what metrics are appropriate, etc …), or could be though an illustrative figure that illustrates the overall framework. If the paper got accepted, I would encourage the authors to add an implementation section to describe the applicability of the proposed framework, including running the various metrics with the corresponding SPARQL queries.

As for the experiments, the diverse characteristics of the datasets that have been selected for experimentation is noteworthy. However, I was expecting to have more qualitative analysis for the results. For instance the authors have to better justify the poor class precision Pc reported in Table 4.

There are some typos (few examples below) and grammatical issues throughout the paper that authors must address in the revision:
Page 15/section 6.2.2: This is explain why it is important
Page 15/section 6.2.3: while the two others always have always 1