Review Comment:
The paper presents a framework that aims at guiding a user from URIs/queries they have as input, to a collection of relevant datasets and a mechanism to query them. The work build on (at least) 2 existing publications, and adds the dataset similarity index (ReLOD) to the picture. Each step is evaluated, code and tools are available online.
Overall evaluation: I find the results presented in the paper valuable, but the presentation flaws make me ask for major revision. Details follow below in my comments for the motivation of the research, presentation, and smaller more detailed comments.
Motivation/rationale:
I come from a practical side of things: my current team has several commercial products in a natural science domain, running on top of triple stores, and we regularly face questions like "how can I find a dataset similar to ...? also talking about ...?" for bio, chemical, medical and bibliographic data. And I never even hoped to get some sort of automated answer to the question "Which of the data sets contains the most valuable results?" you pose in Scenario 1 in the introduction.
Therefore, I am excited by the direction this paper is going towards, although I couldn't find many answers playing with the tools linked in the paper. In the cases when I found related datasets, it was difficult to make any use of them, as the only info I got was a link to a file with a cryptic name (ex. b63446ad7d9f8960762c50a5a3492120.hdt), and I haven't understood where to get any quantification of the relatedness. Just saying "these 2 datasets are related" is still useful, but surfacing the explanation would help a lot. Perhaps, the short time spent playing with the tool is to blame, and low coverage for some of the LD types.
The notion of dataset similarity seems a useful construct, but, in reality, URIs are not that often shared among actually similar datasets describing same things. I would be happy to see some use cases which demonstrate the usefulness of the chosen approach to defining the dataset similarity. As well as the usability studies or any applications of the approach (which is not meant to be theoretic, right?).
Presentation and structure:
This is the weak side of this paper. It is a very bumpy read :) Especially introduction, but all other sections also desperately need proof reading. There are too many unnecessary commas and "which's" (with the beloved construction being "in which" - the meaning often remains unclear), missing verbs, pronouns and prepositions, some not really English expression (e.g. "the first" instead of "firstly"). Some sentences give an impression of literal translation from some other language. So, please, proof read the paper, it makes understanding of otherwise interesting results very difficult to comprehend.
Related work is thorough, though in parts lengthy and not very concrete (e.g. page 7, column 1).
The role of the formalization and the unproved theorem in Section 3 remain unclear. Some parts (like file structure) are appropriate rather for the tool documentation.
Evaluation section contains pages copied from 2 other papers the work is based on, [10] and [11], lengthy especially for WimuQ. I don't mind intersecting papers at all, but frankly, I don't see why it is done in this specific case. The big and very interesting contribution of this paper is ReLOD. Perhaps, it would make more sense just to summarize the research published before, and explain the tables and graphs in 4.4 a bit more, maybe with examples and links to use cases. More details to help understand tables and figures (e.g. Table 7) would also be beneficial.
================
Smaller comments:
p1 l38 "Those datasets represent now the well known as Web of Data, which represents..." - lost in English
p2 l7: "we created called "Where is my URI?" - created what? or remove "called"
p2 l14 "we show how to integrate and querying LOD datasets" - grammar
p3 l7 "ReLOD, which is the extension..." - main sentence not found
p3 l16 "Concerning the LOD-cloud dataset..." - same
p4 l12 "In this section we will present the state of the art relate to Identfying datasets" - grammar
p6 l44 "The approach was build to cluster entities, not LOD datasets, which we cannot use the same concepts here" - what does it mean?
p7 l9 " described on the paper[64]. In which discuss the identity crisis" - here and in many places before "which" is used where it shouldn't
p9 l43 "index of LOD datasets, in which involves" - no "in"
p16 l30 "While Ch queries often 30 require higher number of distributed datasets in order to compute the final resultset of the queries." - not a sentence, what's the meaning?
Tables 4-5: why some datasets are NOT similar, but one is contained in another? (ex. agrinepaldata and vivosearch). DB sizes in Table 3 could be useful
Table 6: the number of exact/similar properties is extremely low, why is that?
p18 l6 "Thus, an example of application could with a case when..." - something wrong with this sentence
p18 l16 " that one dataset can enrich each other with complementing information from another dataset." - again, grammar
p20 l44 "Where DsPropMatch on the Table 7 refers to the number of properties/classes the datasets share among each other." - grammar, why "where"? what do you want to say here? In Table, not on.
p20 l48 "the quality of the datasets should be considerate a important phase. " - you mean "should be considered"? in which sense can "quality" be "a phase" - quality evaluation?..
Why is Table 1 referred only at page 20, after Tables 2 to 7, what' the logic?
Table 16 - how have you done this comparison, how the gold standard was created, etc.? A very interesting part (as it evaluates the novel results in this paper), explained in a cryptic way.
page 22 a paragraph from line 48 - you said this on page 14 already
|