Review Comment:
[Groups like "at page XX line YY column (left|right)" will be shortened using the following pattern pXXlYY(l|r).]
Summary
=======
This article aims at tackling the problem of data quality in the context of RDF datasets loaded in a decentralized system thanks to the use of the Blockchain technology. The authors named their system DCQE. The article is structured in 5 sections. Authors first present the general context of their study in Introduction citing related articles dealing with data quality for RDF datasets. They then describe their quality evaluation model in Section2 before presenting their system in Section3. They finally present some experiments in Section4 before concluding in Section6.
Nowadays, the evaluation of RDF data quality is a hot topic since more and more RDF datasets are available and prone to change (dynamicity of the triples which can often be updated). This article proposes a novel approach to offer data quality evaluation strategy in the context a decentralized architecture by using the Blockchain to record transaction information and quality evaluation result information.
Major Comments
==============
I have several *big & major* concerns:
I/ The paper is not self-contained.
> Indeed, it is hard to follow for readers that do not have prior knowledge about Semantic Web, Decentralized Architecture and Blockchain Technologies. I admit that the Semantic Web Journal mainly deals with semantic-web oriented readers, nonetheless key concepts could be recall briefly e.g. RDF, SPARQL (which are so far not cited in the paper)… More generally, a preliminary/background section is missing to recap what are RDF and SPARQL and the decentralized strategy and *most important* to present the blockchain technology which is not yet obvious for everybody.
II/ Their are oddities in formulæ presented in Section2.
> My concern mainly deal with the formula number (2) and as a consequence with all the furmulæ which are involving it. Authors defined the "number of subject average attributes" as VP=1-SPO/DS where SPO and DS are respectively "number of triples" and "number of unique subjects". My problem with this formula is that its values are included in ]-∞;0] which then lead to strange results (e.g. QRDF could be lower than zero)… Indeed, let's consider the two extreme cases:
- the dataset contains k triples with k different subjects thus SPO/DS=k/k=1 then VP=0.
- the dataset contains k triples which are *all* having the same subject thus SPO/DS=k/1=k and then VP=1-k which has -∞ as a limit when k goes to infinity…
As a consequence, the following sentence "The larger the value, the more data sets use the triples to describe the subject." (see p2l17r) is false!
III/ There is no Related Work section!
> The only paragraph discussing briefly some previous studies is located in the Introduction. In my opinion, such topics like Data Quality and Decentralized Systems (for RDF or not) should be presented apart to properly present how the current study is providing novel aspects to research.
IV/ The system is not presented completely.
> The section describing DCQE is, in my opinion, suffering from a lack of details. First of all, it would be interesting to have access to the code of the system if opensource and if it is private then I would have liked a justification. Indeed, having access to the system's code would allow reviewers to glance at the project and to test it (even to reproduce the experiments see V/ in this regard).
> In addition, the system description is to high level and should be more detailed for instance thanks to the uses of several examples explained step-by-step.
> Finally, I also have some specific remarks such as:
- What is the query language used, I suppose SPARQL?
- What is the decentralized system used? Is it for instance ipfs?
- Author do not seem to take into consideration monetary aspects because "price factor is different in different systems", it would have been great to have a comparison between several of them…
V/ The experimental section must be reset completely in my opinion.
> First of all, all the experiments are not reproducible by readers of the paper, which is to me really problematic. (This remark should be considered with the remarks dealing with the sources of DCQE - see prior.) For example, authors declare "The experimental data sets use the ArchiveHub data set." but without producing a citation nor a reference nor a footnote; I typed this in my favorite search engine and was not able to find any relevant websites… Following the same direction, I was not able to understand properly how the test protocol was set up and then realize, for instance, how the updates are done. Authors say they are using "100 queries", even though the query language is not specify here, it would also be interesting to be able to see those queries.
> Second, author test their system using a dataset of 431,088 triples which is in my opinion not enough to maybe discover bottlenecks of performance since they realized their experiments on a computer having "16GB 2133MHz LPDDR3 memory".
> Third, using only one computer to test a *decentralized* system is in my opinion a weakness, especially since authors did not provide a fair description of how the decentralized system is working.
VI/ The quality of the paper should be improved.
> Some sentences are hard to follow, I could find some typos., in addition there are some figures which are not referenced in the text, and some other ones are hard to read because of the font-size…
Minor Comments
==============
Please find here some minor remarks (in comparison to my first 6 ones):
Section1
* The Introduction could be better motivated thanks to a use-case or an example which could be described.
* More generally, the claims of the paper could be stated more clearly in the Introduction.
Section2
* This Section suffers from a lack of examples during the description of the concepts to help readers understanding what's happening.
* The first paragraph of Section2 is describing the restriction on which the paper will next focus on. I'm a bit disappointed that at the end of the discussion/paper, the "node service quality" is not considered again, even briefly.
* In p2l23r, what is the "RDF medical report", is it something new or could the authors provide a citation for this concept?
* In p2l24r, a "certain number", what could be an estimation?
* In p2l51r, "data updates" and "modifications" should be better introduced and formalized.
* In table1, maybe the third column could be removed.
* In (4), what happens if it is always the same triple which is modified, does it mean that Verifiability goes to +∞?
* In (6), what is "Eachother" since it hasn't been introduced?
Section3
* "DCQE" -> what does it mean? Why such a name?
* I do not see the interest of Fig1.b in the rest of the paper development…
* The title of Fig.2 isn't perfectly matching the content since there is no "Blank Node" written in the figure itself.
* More generally, it's confusing to use "Blank Node" to name a "temporarily creating node" (see p5l33l) since it exists also a concept of blank nodes in the RDF specifications.
* I could find in the text a reference to Fig.3.
* In p5l38r, what are those "previous studies" the authors are referring to?
* Section3.3 would require an example to make the understanding easier.
Section4
* Authors seem to use SPARQL as a query language however I couldn't find any occurrence of the term "SPARQL" in the article and the canonical citation associated to is missing too.
* Author have set k1=k2=k3=k4=1, why such a choice?
* In table3, what are the "masters"?
* In table3, what is representing the 6th column which has no title?
* In p8l35r, "each node generates its own RDF entity record table", it would have been interesting to see example of such record tables.
* In p9l13l, "Different systems" -> which ones?
* In p9l23l, authors said that "51 percent of attacks can still be launched", I think they wanted to say that the 51%-attack can still be launched. Which also lead to the fact, that it would be appreciable to have a reference pointing on this attack description since it is really specific to the Blockchain area.
* In Section4.2, Fig.8 is at the end not really described.
Section5
* In p10l18l, authors are mentioning "new dimensions", it should be interesting that they provide some of them.
* In p9l38r, what are typically the size of those "backups"?
References
* [3] is not giving the journal nor the conf. nor the book where it has been accepted
* [13] has some problem with the encoding of special letters
Typos.
======
Here are some typos. I found:
- title: "A RDF Dataset"->"An RDF Dataset"
- p1l51r: "de centralized"->"decentralized"
- p7: Fig.7 "hash0-3"->"hash1-0"?
- p7l35r: is it "physical" or "medical"?
- p7l46r: "verify the verifiability"
- From p8l39r to p8l48r: this paragraph is hard to follow.
- p9l28l, "there are 5 common attributes", no there are 4 according to the line before id est 9, 5, 31 and 7.
- p9l29r: "Descrip"?
|