Data-driven Assessment of Structural Evolution of RDF Graphs

Tracking #: 2160-3373

Carlos Bobed
Pierre Maillot
Peggy Cellier
Sébastien Ferré1

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
Since the birth of the Semantic Web, numerous semantic data repositories have appeared. The applications that exploit them rely on the quality of their data through time. In this regard, one of the main dimensions of data quality is conformance to the expected usage of the vocabulary. However, the vocabulary usage (i.e., how classes and properties are actually populated) can vary from one base to another. Moreover, through time, such usage can evolve within a base and diverge from the previous practices. Methods have been proposed to follow the evolution of a knowledge base by following the evolution of their intentional schema (or ontology); however, they do not capture the evolution of their actual data, which can vary greatly in practice. In this paper, we propose a data-driven approach to assess the global evolution of vocabulary usage in large RDF graphs. Our proposal relies on two structural measures defined at different granularities (dataset vs update), which are based on pattern mining techniques. We have performed a thorough experimentation which shows that our approach is scalable, and can capture structural evolution through time of both synthetic (LUBM) and real knowledge bases (different snapshots and updates of DBpedia).
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 13/Apr/2019
Major Revision
Review Comment:

The paper presents a data-driven approach to asses the structural evolution of RDF graphs. The approach relies on a pattern mining approach, i.e., minimal description length, to asses the evolution of a knowledge base. The authors propose 3 measures for calculating the structural similarity between a given graph and its updated versions, i.e., compression (already introduced in a previous work); classification; and structural information across states. These measures are then used for data evolution assessment on a global level between snapshots, and local level for fine-grained updates. The approach is evaluated on one synthetic dataset, generated using the LUBM data generator; and on the DBpedia dataset, using different snapshots in the period between 2010 and 2016.

The paper is well structured, well written and easy to follow and understand. The paper addresses an interesting and important problem, and the approach introduces some interesting ideas, however there are several major drawbacks.

- The proposed approach seems a bit trivial and it requires further motivation and explanation what makes this approach so suitable for the given problem. For example, many existing approaches could be used to measure the coherency (with small modifications) of the data and track the changes over time, e.g., path ranking algorithms [1], graph kernels [2], distributions of properties/classes [3], even KG embeddings. It would be interesting to see a discussion in this direction and maybe implement such approaches as baselines, e.g., using existing KG embeddings it is easily to calculate the coherency of the classes in the ontology based on the vectors of the instances of the given class; increased coherency after an update might indicate improvement of the data, while decreased coherency might indicate errors in the data or outliers. Same metric could be used to monitor the changes on instance level, predicate level or class level.

- As mentioned in the related work section, there is a handful of approaches that are used to address the same problem (e.g. [20-23], however the authors give a short description of those approaches without providing detailed comparison between their approach and the related approaches. To strengthen the claims of the approach, a proper theoretical and empirical evaluation comparison between the proposed approach and the related approaches must be conducted.

- While monitoring the evolution of a knowledge base is very important, it would be nice to have a discussion or demonstration how much a given application in the real-world is affected by such a knowledge evolution. And how would such an approach be used to circumvent possible problems, and improve the quality of the application, i.e., how would such a solution be deployed along a real-world application and can we measure the impact of the solution?

* Minor comments:

- To improve the readability of the paper, it might be useful to add a figure of the architecture of the approach pipeline.
- The experimental evaluation of the approach cannot be considered as a contribution, thus it should be removed from the contribution claims list.
- How do you define "a very regular knowledge base"
- Some of the figures have two captions, i.e., one in the figure itself, and one under the figure.
- Tables and figures should be on the same page where they are referenced on, e.g. Table 1, Fig 1 and 2.
- Table 8 is not referenced at all.

- Section 1: However, a knowledge base may not, in practice, conformS exactly..." -> "conform"
- Section 2: "The standard code table is a specific code table which contains all and only singletons." -> "all and only singletons"???
- Section 3: "Bags of transactions"-> list/set of transactions would be better.
- Section 3.3: "all directors have directed at least a movie..." -> "all directors have directed at least ONE movie..."
- Footnote 3: "...more such their usage"???
- Section 5.1.: "Figure 2 shows the timeline of such an evaluation." -> I believe it should be evolution instead of evaluation.
- Section 5.2.: "but we have not to neglect"-> "but we SHOULD not neglect.

[1] Lao, N., Cohen, W.W.: Relational retrieval using a combination of path constrained random walks. Mach. Learn. 81(1), 53–67 (Oct 2010)
[2] de Vries, Gerben Klaas Dirk, and Steven de Rooij. "Substructure counting graph kernels for machine learning from RDF data." Web Semantics: Science, Services and Agents on the World Wide Web 35 (2015): 71-84.
[3] André Melo, Heiko Paulheim: Synthesizing Knowledge Graphs for Link and Type Prediction Benchmarking. ESWC (1) 2017: 136-15

Review #2
By Jedrzej Potoniec submitted on 25/Apr/2019
Major Revision
Review Comment:

The paper presents an interesting approach to measure the evolution in time of an RDF graph, based purely on the data contained in the graph and not on other artifacts, e.g., an ontology. The approach is based on frequent pattern mining and using the patterns for compressing the graph. Most of the contribution of the paper is presented in Section 4, where methods for measuring structural similarity are presented.

I find the idea and approach noteworthy and original, but the experimental evaluation is not very convincing and the quality of writing is quite poor. Detailed remarks are presented below.

* It seems to me that there are three names for the same thing: “semantic web repository”, “base”, “knowledge base”. Please select one and stick with it throughout the whole paper.
* “Semantic Web” is possibly not the best keyword when submitting to the Semantic Web Journal.

Section 1:
* Page 1: A movie application will indeed expect some structure in the graph, but this is completely different problem: the application must have some assumptions, which should be formalized (e.g., as SHACL constrains) and used to test a data source before it is used. Few sentences later the authors remark that the vocabulary should be seen as a statistical distribution, but it only underlines the problem: if 80% of movies have a release date, then the remaining 20% do not have it and, when processing them, the functionality of the movie application is degraded.
* Page 2, left column, lines 19-32: I find it confusing to consider “updates” and “snapshots” as local and global view of the same thing. To me, a snapshot is a description of a fixed state, while an update is a description of a process, a transition between two fixed states. Now both of these can be defined on different granularity levels. In particular, in my opinion, Definition 5 presents an update as two local snapshots: before the update and after the update.
* Page 2, right column, lines 13-15: “our proposal can discover those implicit categories”. I find this to be a clear reference to the area of ontology learning, particularly to finding new subclasses. Please discuss works related to this topic, e.g., Jedrzej Potoniec, Agnieszka Lawrynowicz:
Combining Ontology Class Expression Generation with Mathematical Modeling for Ontology Learning. AAAI 2015: 4198-4199.
* Page 2, right column, lines 25-26: “consolidated versions of a KB (snapshots)” → “snapshots (consolidated versions of a KB)”

Section 2.1:
* Page 3, left column, lines 23-24: “supp(…) = 3, which means that”. It should be the other way around: “{a, b} is supported by …, which means supp(...)=3”

Section 2.2:
* “Cover and Usage”: This paragraph is unclear and relies too much on intuition of the reader. There should be formal definitions of the introduced concepts as well. Moreover it is not clear for me why a greedy approach to covering is used.
* “Code Length”: L(code(X)) is defined as a logarithm, which is a real number. It is thus hard to expect that there will be an actual code of such length. It is also not immediately obvious in what sense this is a correct definition. Consider a database of 65 transactions: 64 transactions {a} and 1 transaction {b}. usage({a})=64, usage({b})=1 and, respectively and assuming binary logarithms, L(code({a}))=-log(64/65)=0.02, L(code({b}))=-log(1/65)=6.02. Neither of these lengths makes sense to me, as I can neither encode {a} using 0 bits, nor need to use 6 bits to encode {b}. I admit I may be mistaken in my reasoning here, but if so, this also should act as an indicator that perhaps this section needs some extending.
* L(CT|D) “D” is irrelevant here, as length of the code table does not depend on the database it was derived from. It is also not immediately clear for me why there is L(code_SCT(X)) in the sum. If I understand correctly, it is possible for X\in CT to contain more than one element (|X|>1), and then by definition of SCT X does not belong to SCT and code_SCT(X) does not exist.
* Table 1: using colors as a sole way to encode different codes is nice as long as your color vision is right and your medium supports color. This is particularly true for SCT, where all the codes are black and it seems that pairs {a}/{b} and {d}/{e} share the same code, whereas if I understand correctly, they share the same code length, but not the same code. Please provide additional way of coding, which will behave well while printed in black and white.

Section 3.1:
* Page 4, right column, line 29: “triples” → “properties”
* Definition 3: at this point it is unclear how the sets C and P are exactly defined. The main difference is between “intensional” definition C={c|(c, rdf:type, rdfs:Class)\in B} and “extensional” definition C={c|(*,rdf:type,c)\in B}. This becomes clear only after reading Definition 4.
* This is a side remark and I will not require it addressed in the revised version: expressiveness of Definition 3 is very limited. For example, adding additional birth dates to a person will remain unnoticed. I would suggest, as a future work, using more expressive language here or, possibly, even a pattern mining approach to define features. For example of such an approach, consisting of feature construction by pattern mining and propositionalisation, see: Agnieszka Lawrynowicz, Jedrzej Potoniec: Pattern Based Feature Construction in Semantic Data Mining. Int. J. Semantic Web Inf. Syst. 10(1): 27-65 (2014)
* Definition 4: P(I_PCB) – I think “P” is an undefined symbol here.
* Page 5, left column, line 25: “in related to” → “is related to”

Section 3.2:
* Definition 5: AR(u) should be formally defined.

Section 3.3:
* I find abbreviating the names confusing and prefer using full names instead.
* The running example with its X, Y and Z is not very user friendly. Perhaps it could be a description of a real movie?
* Page 6, left column, line 27: “have yet” → “are yet”

Section 4:
* The names of subsections are very strange. If I guess correctly they are abbreviations and stand for, e.g,, “Measuring Structural Similarity with Patterns Through Compression”. Please do not do that.

Section 4.1:
* English in the first sentence is particularly strange.
* Definition 7: I find it surprising that similarity is (a) not symmetric; (b) decreasing with objects becoming more and more similar. Please consider renaming it.
* Line 31: “Typically” → Why typicaly? What makes this typical? What are non-typical examples?

Section 4.2:
* Please provide appropriate equations when discussing Laplace smoothing.
* This section introduces the word “codify” to denote encoding using a code table. I find this very surprising and I think that “encode” would be a much better choice (of course, this is relevant for all the other uses of “codify” in the paper as well)
* Please provide more details for calculating L(q’|CT_1): expand the sums and preferably compute the numeric value (the same goes for other calculations in the running examples)

Section 4.3:
* Page 8, left column, line 44: “the number of bits” → explain why is so.
* Page 8, right column, lines 5-12: “To compensate...” → please explain in more details the rationale and show that it has some desirable properties. Currently it looks very ad-hoc.
* Page 8, right column, line 15: “it can be proved” → Please provide the proof.
* Page 8, right column, lines 29, 30: “status of the knowledge base” → “state of the knowledge base”
* Page 8, right column, line 43-45: I think that whenever a singleton is an argument of the code function, it should be code_SCT and not code_CT

Section 5.1:
* Page 9, right column, lines 25, 26: “assymetric” → “asymmetric”
* Page 9, right column, lines 41-47: Why is this relevant? It does not seem to be used anywhere in the paper.

Section 5.2:
* I do not understand Figure 2 and currently I do not think it improves the paper. Please describe it more clearly or remove it.
* Page 10, left column, lines 16-17: This would require some sort of aggregation, which is not discussed here.
* Page 10, right column, lines 20-23: The scenario with different ETL pipelines seems to me quite forced.
* Page 10, right column, lines 31-45: I think that you should disentangle interpretation and administrator actions. If I understand correctly, the interpretation is objective/task independent: a positive value indicates a converging structure, while a negative value indicates a diverging structure. The administrator action is, indeed, subjective/task dependent.
* Page 10, right column, lines 48-51: How the administrator is supposed to do the comparison?

Section 6:
* There should be clearly stated and highlighted hypotheses/research questions for all the experiments. Otherwise it is impossible to judge whether the experiment actually makes any sense.

Section 6.2:
* I find most of this section uninteresting. LUBM is fine for benchmarking, but of little use otherwise. In particular, discussing behavior of measures on a synthetic dataset is quite pointless, as this does not reflect the expected behavior in the real-world scenario.

Section 6.2.2:
* Page 12, left column, new definition for sim: This definition uses a new function “ratio”. I assume this is the same as “L%” defined earlier in the paper. If it so, then this is a theoretical property of the measure and you should discuss it earlier in the paper, provide a proof and remove Section 6.2.2 entirely.

Section 6.2.3:
* The hypothesis should be changed to theorem and proved: what exactly is a disrupting update should be defined and then, from this, you should show that this implies that the delta will be positive. (The same goes for a restoring update.) Currently, you try to simulate them, but the simulation is flawed and, in the last paragraph, you arrive at conclusion that sometimes a disrupting update actually improves the situation.
* Similarly, you experimentally show that the higher probability of disruption, the wider the range of deltas. This is some kind of monotonicity and, again, should be proved: if an update X is more disruptive than Y, then delta(X)>=delta(Y).

Section 6.3.1:
* Page 13, right column, lines 15, 16: As far as I know, dbp: namespace is not part of the DBpedia ontology. It represents raw results of information extraction from Wikipedia. I think you would be better off not using this namespace: the graphs would be smaller and more regular.
* Page 13, right column, line 38: “compression ratio between 0.260 and 0.300” → but compression ratio for 2016-10 is 0.306. Also, I wonder if compression ratio is the same as “L%”?
* Table 6 and the text describing it in the left column of page 14: honestly, I do not clearly see the two clusters as described, e.g., sim(3.6|3.9)>sim(2016-10|2015-10)>sim(2016-10|3.9). If this definition of clusters is the result of some algorithm, then it must be clearly stated.
* Page 14, left column, last line: “shows scalability of our approach” → I appreciate that you were able to perform the comparison in 5 minutes, but you fail to mention that, in order to perform the comparison, you spent overall 216 hours computing the code tables. Also, seeing that you had to extend the deadline for computing the code tables for 2015-10 and 2016-10, I suspect that scalability could be an issue.
* Page 14, right column and Figure 6: It seems to me that you claim that the complexity is O(#NoSingletonCodes*#CodifiedTransactions). Would it be possible to explain why is that instead of proving it by line fitting?
* It seems to me that Table 8 is not referenced anywhere in the text. Also, I am a bit surprised that the number of non-singletons is the lowest for 2014. Why is that?

Section 6.3.3:
* Page 15, right column, lines 39-43: I disagree with the conclusion that there exists a source using an outdated schema. It may be as well a sign of data evolution. If I understand correctly, for this to happen, it is enough that there was a group of resources with little description in 2014. Most of them got described in 2015 and the updates fits 2015, but a small number remained underdescribed and fits 2014 schema better.
* Page 16, left column: “punctual”, “raise awareness” are probably not the best wordings in this context.
* Figure 7 is quite large, but not particularly interesting.

Section 7:
* The problem considered in the paper is not new and this I would rather see this section in the beginning of the paper, explaining the novelty of the approach.
* This section is mostly concern with methods for assessing the evolution of the data. However, as the proposed approach is rooted in pattern mining, I find this section lacking in references, even to works published in the same journal it was submitted to, e.g.,
* Page 17, right column, line 39: „obtention” is probably not quite the right word there.

Review #3
By Ilaria Tiddi submitted on 11/May/2019
Major Revision
Review Comment:

The paper tackles the problem of RDF dataset quality over time and presents an approach for assessing it using the evolution of the vocabulary used in the graph. The evolution is measured in terms of a sequences of insert/delete operations at a different granularity (updates or snapshots). The question becomes how to compare updates to snapshots, i.e. identifying the outdates parts that alters the structural pattern of a newer snapshot. Two new measures at dataset- and update-level of granularity are presented along with snapshot-specific one, presented by the authors in [8]. Both measures are based on frequent itemset mining, and in particular to the Minimum Description Length (MDL) principle which allows the discovery of altered structural patterns. Experimentation performed on different snapshots and updates of both a synthetic dataset as well as DBpedia show the scalability of the approach even over time. The authors also introduce a methodology to apply these measures and assess the evolution of a graph.

I think the paper is very interesting and mature for a journal publication. I have some concerns, expressed below, which I think should be tackled in a new version of the paper.

+ well written
+ interesting problem and contribution
+ sound approach and evaluation

- lack of a clear example
- evaluation could be extended on at least another real-world dataset
- discussion on the type of evolution patterns identified would be a plus

### originality
The paper is original in what two structural measures are presented to deal with comparisons between updates and snapshots. Few related work in the same area, but the contribution is original and the proposed methodology for dataset evolution assessment is valuable for our community.

In general, I would have really appreciated a deeper analysis around the evolution patterns observed by the authors. Which type of changes have the authors identified with their data? How do patterns look like? If (according to Sect 7) "it is trivial to extract the patterns in a more human-readable format", why not showing them to the reader? Also, from the way I understood the approach, I am not entirely clear whether it would be better to talk about vocabulary usage evolution or rather schema usage evolution? Which dynamics characterise the evolution patterns across times, and why? This is mentioned as part of the next steps to undertake, but a minimal speculation on the results would be interesting and give additional value to the paper.
I would be very curious to know if there are different "divergence patterns" depending of the type/domain of a dataset, for instance, and if techniques need to be adapted. The only way to know would be to try to assess different types of RDF datasets (different = different domains). See my point below on expanding the evaluation.

### significance of the results
Results are sound and significant. The evaluation is well thought and nicely structured, held on a synthesic data set as well as DBpedia for a real use-case scenario. Being a journal paper, evaluating on at least another existing dataset would have been of great added value.

I am also wondering if the approaches mentioned in Section 7 (eg 20, 21, 22) could not be partly compared, at least on the synthetic dataset (even if their goal is different than the current paper). This would significantly strengthen the paper.

### quality of writing

Nicely written, the paper is clear and has a very easy-to-follow structure.
I have a big problem with the lack of a clear real-world example throughout the paper, which would help better following the narrative. There is a number of obscure sections throughout Section 2 - 4 (Table1, Section 2.1, Section 4.2) and I think a real-work example picked from the authors’ experience would definitely improve the clarity of the paper. I assume a non-expert reader can struggle in understanding the more formal bits. Perhaps using the example of section 3.3 since the beginning?
Some things in Section 2 go too quick too, it seems like the authors want to rush towards the following section. In Section 2.1, for example : "there exists a lot of algorithms to extract the frequent itemsets from a transactional database (for instance [12–14])" — why not to give a shot description of them? Same for the MDL principle described in Section 2.2?

Minor remarks :

follow the evolution of a knowledge base by following the evolution >> rephrase
open semantic data >> open data, structured data
will have some of its functionalities become unavailable >> some of its functionalities might be affected
a knowledge base may not, in practice, conforms >> a knowledge base may not, in practice, conform
line 38, page 1 : makes it possible >> make possible
Section 3
Complexity of frequent pattern mining approaches in related to >> Complexity of frequent pattern mining approaches is related to
Section 3.2
What does "distant parts of the graph" means precisely?
Section 4.1
but not the converse >> but not the inverse (explained? perhaps represented?)
Section 4.3
p 9, line 20 would have changed completely its signs >> their signs
line 20 deviate >> deviates
Section 5.1
line 25/26 assymmetric >> asymmetric
Section 5.2
line 16 left diverge >> diverges
line 29 left >> Effect assessment (and everywhere else)
Section 6.2.3
line 18 right to this effect >> to this end
line 20 DBP >> DBpedia
Section 7
line 25 , right : consequencies >> consequences
something wrong with line spacing in page 17, right
References :
Why are they not ordered?