Feeling the Pulse of Linked Data

Tracking #: 727-1937

Authors: 
Laurens Rietveld
Rinke Hoekstra

Responsible editor: 
Guest Editors EKAW 2014 Schlobach Janowicz

Submission type: 
Conference Style
Abstract: 
Existing studies of Linked Data focus on the availability of data rather than its use in practice. The number of query logs available is very much restricted to a small number of datasets. This paper proposes to track Linked Data usage at the client side. We use YASGUI, a feature rich web-based query editor, as a measuring device for interactions with the Linked Data Cloud. It enables us to determine what part of the Linked Data Cloud is actually used, what part is open or closed, the efficiency and complexity of queries, the tasks they are used for, and how these results relate to commonly used dataset statistics.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
[EKAW] conference only accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 25/Aug/2014
Suggestion:
[EKAW] reject
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject

-1

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)

4

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

5

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

5

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present

2

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

2

Review

This paper aims to look at how linked data is used in practice. It does this by analysing queries entered into the YASGUI query editor. This seems to be novel in that measurements are taken on the client side (rather than looking at server logs) using the query tool rather than looking at server logs. The main claim for novelty though is that current studies have focused on analyses of the availability and structure of linked data, while this study focuses on the usage of linked data and the kinds of queries that users pose to the Web of data. The authors claim that their data collection approach is data-set independent and this acts as "an observational lens". They use the analogy of a search engine such as Google or Yahoo to illustrate what they mean by "observational lens" - these search engines process queries on distributed information - ultimately they indicate what information users want from the World Wide Web. The hope is that YASGUI could become this single entry point for the Web of data and thus provide unique insights into what users/applications want from the Web of data.

On the whole the paper seems to make a novel contribution. It presents some interesting ideas and an interesting approach, or methodology, that could be part of a toolbox of techniques for measuring the usage of the Web of data and the kinds of queries that users want to pose to the Web of data. While the authors aim to solve some problems with existing analyses, I think that instead they present an approach that is complementary to existing analyses.

In terms of actual content, I would say that the paper is mainly a mixture of a system description and a methodology for data gathering and analysis, with a baseline set of results and a simple analysis of these results. The paper contains a detailed description of YASGUI, its features and a comparison to similar tools. It contains a detailed introduction and related work section, which does a good job of positioning the work relative to the body of existing work. In terms of balance, the results and analysis section constitutes just under half of the paper. While this part of the paper presents some interesting observations, it feels like it is lacking in rigour and it could do with some improvement (more details below). If the authors could address some of these concerns then I think that the strengths and weaknesses of the work will be clear and the paper will be fit for publication.

Some of the wording used in the paper (particularly in the introduction, which is where I first noticed it) could make the reader feel like the authors are claiming a little bit too much. For example, "This provides deep insight in how we interact with Linked Data". It's not clear whether this is just flowery language, but I think that the authors need to be more careful in taking into consideration threats to external validity. In particular, those caused by the effects of the YASGUI tool, the population of YASGUI users and the kinds of queries they pose in the tool. The usage data will obviously provide a deep insight into how the users of YASGUI use it, the kinds of queries that they pose when using it, and the endpoints that they use. However, this obviously isn't the same thing as the apparent claim here, which seems much more general. The authors do touch on this but they could be clearer about it. The authors also mention some other results that the YASGUI usage data can provide, namely:

- which part of the Linked Data cloud is actually used
- what part is open an accessible
- the complexity of man-made queries
- the most commonly used namespaces

Again I think the authors need to be clear about the scope of the conclusions that can be drawn. I suppose that there are two points here: (1) The authors should make it clear that these results only apply to the class of queries posed by the YASGUI user population, and (2) The authors should ideally provide some details of what they believe the biasing effects of using this particular GUI are - i.e. how it affects the queries posed, how it affects which datasets (parts of the linked data cloud) are used etc.

In Section 2 (Related Work) the authors provide a detailed motivation for their work in order to justify it and to differentiate it from previous work. I think that they do a reasonable job of this. It seems like there are three main points: (1) Current studies focus on structural analyses and quality assurance of published data, or they focus on endpoint availability. In either case, current studies don't provide an insight into which data is actually used; (2) There are problems with current query tools that prevent them from being appropriate for a study that examines query logs. In particular, current query tools don't provide enjoyable user experiences, they aren't useable, and therefore don't encourage end users to user them so they don't facilitate an analysis of user behaviour. Current tools also bias uses to specific endpoints. (3) Current query log analysis is limited to a handful of endpoints and isn't generalisable to the whole Web of data. The authors aim to point out that an analysis using YASGUI could solve these problems, as it can reveal which datasets are queried, it provides an enjoyable experience (encouraging people to use it and therefore provide lots of data), and it isn't limited to specific datasets. The message comes across loud and clear. However, while it is interesting, I do think that the "Interfaces to Linked Data" is overly long. The authors could just briefly mention the main points. Also, since this whole section is quite long, it would be good if the authors could summarise (in a tabular way perhaps) the main points that need to be addressed and how their work addresses them. Any space saved here could be used to discuss threats to external validity in more depth.

The Methods, Results and Analysis sections could be improved quite a lot. I've put some more detailed comments below (in the Minor Comments at the end). However, some things in particular need clarifying. When you say "2,947 unique views", are these just views of the YASGUI webpage or are they 2,974 unique users *who actually submitted* queries? How many users actually submitted queries and what was the break down (ideally in terms of percentiles e.g. 50% of users submitted 2 queries or less 20% submitted 3 queries or less etc. etc.) of the number of queries submitted per user? How many of these users only submitted the query that is present by default in the YASGUI webpage? Do you filter out this query at all? (It's a SELECT query and this is by far the most common type of query) How many submitted some number of queries that indicates that they were actually querying and using data rather than just trying out the very nice UI? Does this figure of 2,947 represent all of the users (or page views), or just the 64% who allowed some form of logging? These are important figures that enable the reader to understand some of the data and how well it might generalise, but they are not present. Ultimately, important user data is missing, which means that it is hard to put the results into context and assess their significance.

With regards to the query analysis, you don't say anything about the size of the triple patterns. Also, and I don't know how feasible this is, but it would be great if you could boil the queries down into classes of isomorphic graphs and then present the most common ones. At the moment it's impossible to tell what kind of graph patterns appear for example, and this analysis might provide some insight. See Samantha Bail's thesis, which includes work on justification isomorphism for some ideas.

In terms of presentation, most of the space is taken in introductory material, related work descriptions and a system description of the YASGUI tool (7 pages, results are 6 pages, and conclusions and references 2 pages). While these sections are interesting to read, the analysis section feels a bit thin on the ground in comparison to the rest of the paper. Moreover, it would be good if the authors could tie the results and analysis that are presented to their original goals.

The conclusion summarises the main results of the paper and I'm please that the authors begin by noting that the results are biased. However, as stated above, they could have more of discussion about this and I still think some of the claims in the conclusion are inappropriate and exaggerated. For example, "This gives unprecedented insight into how we actually use the Linked Data cloud, and what part of the linked data cloud we use". It doesn't really - it gives an insight into the parts of the linked data cloud used by whoever the 2,974 YASGUI are (other tools might produce different results, not to mention the use of the Web of data by other applications). Furthermore, there aren't enough details to assess who these users are or what they were trying to accomplish with YASGUI (whether they were just playing with it or trying to get some serious data analysis done). As the results stand, they are only applicable to this tool and only applicable to the users who actually used this tool. The authors should perhaps tone down some of the presentation so that it is clear that they are presenting YASGUI and the data obtained from this tool and not results on the Web of data in general. As part of this, I would also say that a more specific title for the paper is needed that at least mentions YASGUI and that the data is specific to this tool.

The authors finish off off the conclusion with, "This paper introduces a tool, dataset and methodology that increase[s] our knowledge of the use of Linked Data". Again, I think this is too strong, but you are right, you do present a tool, data set and methodology and this is a valuable contribution - stick to this line, don't exaggerate the results and make the limitations clear, and I think that you'll have a decent paper.

Minor comments (should be addressed if possible):

The capitalisation in some of the references is not correct. For example, reference [17] sparql -> SPARQL, or gui -> GUI.

In the related work section, the first paragraph mentions the depiction of the linked data cloud and dates the latest version at November 2011. It's my understanding that a newer version has, very recently, been produced (perhaps since the authors wrote the paper). The authors should double check this.

The last sentence at the end of first subsection in Section 2 ("In short, our knowledge of what Linked Data, and how much resides where is incomplete") doesn't really make sense.

First sentence in "Interfaces to Linked Data" ("The most expressive use of Linked Data is done through query interfaces for the SPARQL query language,") doesn't really make sense. What do you mean by "most expressive use"?

In the subsection "Usage of Linked Data", the authors write, "when the technology was first standardized in the 1990s". Wasn't most of the Semantic Web technology standardised in the 2000s?

Section 3 is a bit jumbled and could be structured more nicely. Perhaps break it up into Materials, Methods, Result, Analysis etc. The current Analysis says how the results will be analysed rather than being a true results analysis section. Try to separate out the results from your interpretation of the results.

You could be more precise about the result. For example, state the exact dates of data collection. State the number of users along with the percentage e.g. 1709 users (58%) allowed fill logging.

In section 4, 45.323 looks funny to me. I'd prefer, 45,323 (i.e. commas rather than dots, but this is personal preference of course).

How do you estimate 25,000 queries were executed through YASGUI without your knowledge?

Figure 3 is very pretty, but it's hard to read and interpret. From looking at it I do get the impression that there are a lot of datasets registered at the datahub that are not used in queries submitted by users of YASGUI, but it's only really reading the text that this becomes clear. I don't think the figure actually adds anything beyond the description in the text.

It seems helpful that you provide a little bit of context (i.e. the number of Semantic Web Dog Food visitors). It would be even more useful if you could provide a few more numbers here of different endpoints.

Review #2
Anonymous submitted on 26/Aug/2014
Suggestion:
[EKAW] conference only accept
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.
1
== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject

Reviewer's confidence
Select your choice from the options below and write its number below.
3
== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.
4
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Novelty
Select your choice from the options below and write its number below.
4
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Technical quality
3
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Evaluation
4
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present

Clarity and presentation
5
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Review
This paper provides a novel method to capture the usage of the Linked Data. The novelty of the paper comes from the fact that they propose the use of their tool to collect usage data on the client side such that a more wide variety of data is collected. Since, the other methods as discussed in the related work section mainly concentrated on collecting data at the end points, the data collected are mainly restricted to how the data in certain datasets are queried. The authors in this paper propose that the client side data collection would give a much wider perspective, which I agree with.

The other contribution of this paper is the analysis of the data that has been collected so far. Although I am not entirely convinced that the number of users and queries are large enough to get the real perspective but this the data collected so far seems to be interesting. The analysis of complexity of queries in interesting in particular as the authors based on this analysis also provide some insights on how automatic query optimization can be done.

The paper also reports the system YASGUI which is a feature-rich SPARQL editor, the tool using which the usage data is collected. This part of the paper is more engineering and can't be categorized as research. The research contribution of this paper is the method of gathering data and analysis.

The related work is very well covered and the paper is well written in general.

I have a comment about the analysis section, the namespaces usage count should not take into consideration rdf and rdf schema namespaces. As it is obvious that these namespaces would occur more frequently in the queries as they are used to define the RDF vocabulary. So getting these namespaces as top one does not add to any information.

I think as the number of users of the editor grow a more comprehensive data analysis can be done. In my opinion this paper is worth publishing in the conference.

Review #3
Anonymous submitted on 01/Sep/2014
Suggestion:
[EKAW] conference only accept
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 2 accept

Reviewer's confidence
Select your choice from the options below and write its number below.

== 2 (low)

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 3 fair

Novelty
Select your choice from the options below and write its number below.

== 4 good

Technical quality
Select your choice from the options below and write its number below.

== 3 fair

Evaluation
Select your choice from the options below and write its number below.

== 3 fair

Clarity and presentation
Select your choice from the options below and write its number below.

== 4 good

Review

This paper explores the topic of how Linked Data is used in practice. While existing research primarily focuses on the availability and accessibility of Linked Data, the authors of this paper aim to investigate which types of queries are being executed as well as how much and what types of data are being accessed. The recently developed client-side SPARQL query editor, YASGUI, is the window through which the data is collected and analyzed.

Overall the paper gives a good overview of the current state of SPARQL clients, listing a number of the more prominent clients as well as their features. A significant amount of this paper is devoted to discussing the YASGUI client itself though. Since the novelty of YASGUI was previously introduced in another paper, less focus should be spent on the actual client and more on the new scientific contributions of the tool (that were not discussed previously).

In the methodology section the authors claim that 'we can only sketch a reliable picture of the Linked Data cloud... if we tap in to the interaction ... at the client side.' It is true that approaching the Linked Data cloud from the client-side will offer insight in to how much of the data are actually accessible and how much can be reached, but this does not speak to the 'machine queries' that are also of considerable value. Approaching this problem from a user-client-side perspective is useful for understand how individuals interact with the data but not for 'sketching a reliable picture of the Linked Data cloud.' My main issue here is with the strength of the wording and implication of the statement.

I have some privacy concern over the use of Google Analytics for this research. Please consider stating whether or not the participants that 'opted in' to sharing their data were also made aware that their search queries would also be fully accessible via Google, the commercial entity. Perhaps Piwik or some other such privately hosted option would be more suitable. Additional details on why users were given the option to 'opt-out' rather than 'opt-in' would be informative here as well.

While on the topic of analytics, it would have been useful to see some results of the analytics from a demographic perspective. Were properties such as 'user location,' 'operating system type', etc. recorded? It would be interesting to examine correlations between specific user properties and query properties. Additionally, the authors state the number of total queries and the number of unique queries. More statistical information such as the median number of unique queries per user would be useful.

One quick note. Figure 3 is very hard to read. Perhaps a link to a larger figure would be useful.

Aside from the comments mentioned above, the paper introduces an interesting aspect of Linked Data, one that deserves considerably more exploration from the community as a whole. The authors give a good introduction to this need and present some interesting preliminary results.