Review Comment:
This article presents a dataset with metadata about datasets in the educational domain. This dataset has been developed in the context of the LinkedUp initiative. The authors describe how the dataset was created, how it has been / can be used, and how the dataset can be updated in the future.
The dataset by itself is interesting, and the quality is good, although important URLs don't dereference. It is useful for dataset discovery, because it lists the main partitions (i.e., used types and properties) of datasets. However, the approach is not specific to educational data, so the restriction to this scope seems artificial. My main issue with the article is that the description is long, vague, and sometimes irrelevant or inaccurate. In general, the information-density is too low. I therefore recommend to either make the description more accurate and to-the-point, or to shorten the article to the minimum of 5 pages.
Below, I will first discuss the three focus points of the journal regarding Linked Dataset Descriptions (quality of the dataset, usefulness, clarity and completeness). Finally, I will detail some issues I found in the article.
(1) Quality of the dataset
Overall, the quality of properties and values in the dataset is high. It uses common ontologies (such as VoID) in the correct way.
The only serious problem I encountered is dereferenceability.
The URLs used to identify datasets do not dereference, for example:
- http://data.linkededucation.org/linkedup/dataset/data-southampton-ac-uk
- http://data.linkededucation.org/linkedup/dataset/ege-university-linked-o...
The URLs which identify dataset partitions are also not deferenceable, for example:
- http://data.linkededucation.org/linkedup/dataset/nobelprizes/cp/04b45caa...
- http://data.linkededucation.org/linkedup/dataset/nobelprizes/cp/cc2430bc...
(which should be Nobel Prize Awards and Categories).
Why can't those dataset partitions have URLs that lead to data about them? For example:
- http://data.nobelprize.org/sparql?query=CONSTRUCT%20WHERE%20%7B%20%3Fx%2...
- http://data.nobelprize.org/sparql?query=CONSTRUCT%20WHERE%20%7B%20%3Fx%2...
The effort to generate such URL is identical (if not less), but the result is much more useful for clients, an in line with the Linked Data principles.
Finally, keywords etc. are attached as human-readable labels instead of machine-interpretable concepts. It would be worthwhile to invest in machine-interpretable keywords.
(2) Usefulness (or potential usefulness)
The main purpose of this data is to find datasets; i.e., it does thus not expose data that was not available before, but rather serves as a guide into existing data. As such, it is useful for dataset discovery by automated (and, through the website, also manual) processes. The authors also hint at use for federation, but one then wonders whether federation-specific approaches (dataset summarization) would not be appropriate. Actually, I'm curious whether the authors have tried such summarization algorithms and whether they give similar or better results than the currently used services.
The scope of this dataset is restricted to datasets explicitly related to education, and the authors claim that “extending it would […] decrease the value of the dataset, making it less appropriate for discovery.” I fail to see why this would be the case, since all of the discussed techniques, and the resulting dataset, are independent of the education domain. In fact, if “aiiso:School” in Query 3 is replaced by, let's say, “example:Business” or “example:Car”, the mechanism would work equally well. Therefore, I disagree with the authors that the datasets' scope would influence its usefulness.
Furthermore, I disagree that use cases such as data discovery and access federation would be “critical in areas such as education, where very disparate and scarce data are available from many different sources.” It is not necessary to limit their applicability to education scenarios; i.e., this need is not education-specific, and thus not a specific advantage of this dataset.
(3) Clarity and completeness of the descriptions
The description is the weak point of this article. It is written from a very LinkedUp-centric point of view, and few attempts are made to relate the topic to the reader, which is important for any article. A major problem is the focus on the “what”, i.e., descriptions of what steps were taken, as opposed to the “why” an “how” of the decisions. In some parts, the description is frustratingly vague (“an external service”) or unnecessarily long. I would recommend a much more reader-oriented approach, that enables readers to actually use the dataset or the techniques behind it. I.e., the description of the dataset should be an invitation with concrete pointers for usage. At the moment, it sticks too much to (incomplete) details that are not helpful to the reader.
As far as the other criteria for Linked Dataset Descriptions are concerned, the following are missing:
- metrics and statistics on external and internal connectivity
- growth (partly)
- examples and critical discussion of typical knowledge modeling patterns used
- known shortcomings of the dataset (partly)
Below are details on issues I found in the article.
Section 1
- Does better cohesion really lead to better reusability, and how are both defined?
- How is “explicit relevance to learning” assessed?
- Why is SPARQL endpoint accessibility a requirement? As your reference [5] indicates, SPARQL endpoints are not the most reliable sources of data. Why wouldn't a data dump be sufficient (and why are they not linked in the dataset?) The “Web standards” argument doesn't cut it here, because data dumps and Linked Data documents are (plain) Web standards as well. Given [5], I would also strongly doubt that it really “facilitates the building of applications that draw from several of these datasets.”
Section 2
- “The primary aim […] was to support participants […]” => to what extent did you succeed?
- Reference [5] is currently the demo article of a corresponding main track article; in this context, the main track article (“SPARQL Web-Querying Infrastructure: Ready for Action?”) seems more relevant.
- How exactly is the small graph built?
- When probing with the exact same query every time, how do you know the result is not cached? The endpoint might be down for useful queries.
- How did you choose Query 2 and why is it a good match?
- The references to “external services” are rather frustrating. The fact that they are external is irrelevant; if you mention them, readers need to know what they do, how, and why.
- The links and mappings are insufficiently described. Furthermore, the claim that they enable federated queries is not grounded. This claim is then weakened by the (correct) statement that endpoint stability is an issue (and it remains unknown, unfortunately, to what extent the “hope” is justified).
- Query 3 is supposed to show how something is simple, but the query itself is actually quite complex. Are users supposed to come up with this themselves?
- What are the Prod and KIS datasets?
Section 3
- Please quantify “regular basis”.
- Figure 4 has little usefulness besides detailing VoID. Perhaps 3 and 4 are better represented as an example listing?
- The part about mapping is extremely vague. “Existing link/mapping” => Why? How? What? Where?
- Suggesting new mappings => How?
- Manually creating mappings => How? Where?
- Please detail the references to external services, and justify their use.
- How will you make the infrastructure easier to maintain?
Section 4
- As mentioned above, I don't agree with anything in the approach being specific to education, except the choice of datasets.
- Also, why would an extension decrease the value?
- How would you measure the quality of the summarized data?
Miscellaneous
Typo: “a rational for” => “a rationale for”
Typo: “the summary graph are” => “the summary graphs are”
Typo: “it as a limited” => “it has a limited”
Spelling: Various references have capitalization problems, including words such as TEL, SPARQL (x2), and URI.
Once the above issues are addressed, or if the paper is shortened to circumvent them, I think it can be a worthy addition to this special issue.
|