Review Comment:
Overall evaluation
Select your choice from the options below and write its number below.
== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject
-1
Reviewer's confidence
Select your choice from the options below and write its number below.
== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)
4
Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
5
Novelty
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
5
Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present
2
Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
2
Review
This paper aims to look at how linked data is used in practice. It does this by analysing queries entered into the YASGUI query editor. This seems to be novel in that measurements are taken on the client side (rather than looking at server logs) using the query tool rather than looking at server logs. The main claim for novelty though is that current studies have focused on analyses of the availability and structure of linked data, while this study focuses on the usage of linked data and the kinds of queries that users pose to the Web of data. The authors claim that their data collection approach is data-set independent and this acts as "an observational lens". They use the analogy of a search engine such as Google or Yahoo to illustrate what they mean by "observational lens" - these search engines process queries on distributed information - ultimately they indicate what information users want from the World Wide Web. The hope is that YASGUI could become this single entry point for the Web of data and thus provide unique insights into what users/applications want from the Web of data.
On the whole the paper seems to make a novel contribution. It presents some interesting ideas and an interesting approach, or methodology, that could be part of a toolbox of techniques for measuring the usage of the Web of data and the kinds of queries that users want to pose to the Web of data. While the authors aim to solve some problems with existing analyses, I think that instead they present an approach that is complementary to existing analyses.
In terms of actual content, I would say that the paper is mainly a mixture of a system description and a methodology for data gathering and analysis, with a baseline set of results and a simple analysis of these results. The paper contains a detailed description of YASGUI, its features and a comparison to similar tools. It contains a detailed introduction and related work section, which does a good job of positioning the work relative to the body of existing work. In terms of balance, the results and analysis section constitutes just under half of the paper. While this part of the paper presents some interesting observations, it feels like it is lacking in rigour and it could do with some improvement (more details below). If the authors could address some of these concerns then I think that the strengths and weaknesses of the work will be clear and the paper will be fit for publication.
Some of the wording used in the paper (particularly in the introduction, which is where I first noticed it) could make the reader feel like the authors are claiming a little bit too much. For example, "This provides deep insight in how we interact with Linked Data". It's not clear whether this is just flowery language, but I think that the authors need to be more careful in taking into consideration threats to external validity. In particular, those caused by the effects of the YASGUI tool, the population of YASGUI users and the kinds of queries they pose in the tool. The usage data will obviously provide a deep insight into how the users of YASGUI use it, the kinds of queries that they pose when using it, and the endpoints that they use. However, this obviously isn't the same thing as the apparent claim here, which seems much more general. The authors do touch on this but they could be clearer about it. The authors also mention some other results that the YASGUI usage data can provide, namely:
- which part of the Linked Data cloud is actually used
- what part is open an accessible
- the complexity of man-made queries
- the most commonly used namespaces
Again I think the authors need to be clear about the scope of the conclusions that can be drawn. I suppose that there are two points here: (1) The authors should make it clear that these results only apply to the class of queries posed by the YASGUI user population, and (2) The authors should ideally provide some details of what they believe the biasing effects of using this particular GUI are - i.e. how it affects the queries posed, how it affects which datasets (parts of the linked data cloud) are used etc.
In Section 2 (Related Work) the authors provide a detailed motivation for their work in order to justify it and to differentiate it from previous work. I think that they do a reasonable job of this. It seems like there are three main points: (1) Current studies focus on structural analyses and quality assurance of published data, or they focus on endpoint availability. In either case, current studies don't provide an insight into which data is actually used; (2) There are problems with current query tools that prevent them from being appropriate for a study that examines query logs. In particular, current query tools don't provide enjoyable user experiences, they aren't useable, and therefore don't encourage end users to user them so they don't facilitate an analysis of user behaviour. Current tools also bias uses to specific endpoints. (3) Current query log analysis is limited to a handful of endpoints and isn't generalisable to the whole Web of data. The authors aim to point out that an analysis using YASGUI could solve these problems, as it can reveal which datasets are queried, it provides an enjoyable experience (encouraging people to use it and therefore provide lots of data), and it isn't limited to specific datasets. The message comes across loud and clear. However, while it is interesting, I do think that the "Interfaces to Linked Data" is overly long. The authors could just briefly mention the main points. Also, since this whole section is quite long, it would be good if the authors could summarise (in a tabular way perhaps) the main points that need to be addressed and how their work addresses them. Any space saved here could be used to discuss threats to external validity in more depth.
The Methods, Results and Analysis sections could be improved quite a lot. I've put some more detailed comments below (in the Minor Comments at the end). However, some things in particular need clarifying. When you say "2,947 unique views", are these just views of the YASGUI webpage or are they 2,974 unique users *who actually submitted* queries? How many users actually submitted queries and what was the break down (ideally in terms of percentiles e.g. 50% of users submitted 2 queries or less 20% submitted 3 queries or less etc. etc.) of the number of queries submitted per user? How many of these users only submitted the query that is present by default in the YASGUI webpage? Do you filter out this query at all? (It's a SELECT query and this is by far the most common type of query) How many submitted some number of queries that indicates that they were actually querying and using data rather than just trying out the very nice UI? Does this figure of 2,947 represent all of the users (or page views), or just the 64% who allowed some form of logging? These are important figures that enable the reader to understand some of the data and how well it might generalise, but they are not present. Ultimately, important user data is missing, which means that it is hard to put the results into context and assess their significance.
With regards to the query analysis, you don't say anything about the size of the triple patterns. Also, and I don't know how feasible this is, but it would be great if you could boil the queries down into classes of isomorphic graphs and then present the most common ones. At the moment it's impossible to tell what kind of graph patterns appear for example, and this analysis might provide some insight. See Samantha Bail's thesis, which includes work on justification isomorphism for some ideas.
In terms of presentation, most of the space is taken in introductory material, related work descriptions and a system description of the YASGUI tool (7 pages, results are 6 pages, and conclusions and references 2 pages). While these sections are interesting to read, the analysis section feels a bit thin on the ground in comparison to the rest of the paper. Moreover, it would be good if the authors could tie the results and analysis that are presented to their original goals.
The conclusion summarises the main results of the paper and I'm please that the authors begin by noting that the results are biased. However, as stated above, they could have more of discussion about this and I still think some of the claims in the conclusion are inappropriate and exaggerated. For example, "This gives unprecedented insight into how we actually use the Linked Data cloud, and what part of the linked data cloud we use". It doesn't really - it gives an insight into the parts of the linked data cloud used by whoever the 2,974 YASGUI are (other tools might produce different results, not to mention the use of the Web of data by other applications). Furthermore, there aren't enough details to assess who these users are or what they were trying to accomplish with YASGUI (whether they were just playing with it or trying to get some serious data analysis done). As the results stand, they are only applicable to this tool and only applicable to the users who actually used this tool. The authors should perhaps tone down some of the presentation so that it is clear that they are presenting YASGUI and the data obtained from this tool and not results on the Web of data in general. As part of this, I would also say that a more specific title for the paper is needed that at least mentions YASGUI and that the data is specific to this tool.
The authors finish off off the conclusion with, "This paper introduces a tool, dataset and methodology that increase[s] our knowledge of the use of Linked Data". Again, I think this is too strong, but you are right, you do present a tool, data set and methodology and this is a valuable contribution - stick to this line, don't exaggerate the results and make the limitations clear, and I think that you'll have a decent paper.
Minor comments (should be addressed if possible):
The capitalisation in some of the references is not correct. For example, reference [17] sparql -> SPARQL, or gui -> GUI.
In the related work section, the first paragraph mentions the depiction of the linked data cloud and dates the latest version at November 2011. It's my understanding that a newer version has, very recently, been produced (perhaps since the authors wrote the paper). The authors should double check this.
The last sentence at the end of first subsection in Section 2 ("In short, our knowledge of what Linked Data, and how much resides where is incomplete") doesn't really make sense.
First sentence in "Interfaces to Linked Data" ("The most expressive use of Linked Data is done through query interfaces for the SPARQL query language,") doesn't really make sense. What do you mean by "most expressive use"?
In the subsection "Usage of Linked Data", the authors write, "when the technology was first standardized in the 1990s". Wasn't most of the Semantic Web technology standardised in the 2000s?
Section 3 is a bit jumbled and could be structured more nicely. Perhaps break it up into Materials, Methods, Result, Analysis etc. The current Analysis says how the results will be analysed rather than being a true results analysis section. Try to separate out the results from your interpretation of the results.
You could be more precise about the result. For example, state the exact dates of data collection. State the number of users along with the percentage e.g. 1709 users (58%) allowed fill logging.
In section 4, 45.323 looks funny to me. I'd prefer, 45,323 (i.e. commas rather than dots, but this is personal preference of course).
How do you estimate 25,000 queries were executed through YASGUI without your knowledge?
Figure 3 is very pretty, but it's hard to read and interpret. From looking at it I do get the impression that there are a lot of datasets registered at the datahub that are not used in queries submitted by users of YASGUI, but it's only really reading the text that this becomes clear. I don't think the figure actually adds anything beyond the description in the text.
It seems helpful that you provide a little bit of context (i.e. the number of Semantic Web Dog Food visitors). It would be even more useful if you could provide a few more numbers here of different endpoints.
|