Review Comment:
______ In nutshell
Despite being a nice and interesting exercise meant to provide recommendations and support libraries in curating their data, the scientific contribution of the article is not clear and major revisions would be needed to better appreciate the authors’ work. Indeed, at the end of the paper I still fail in grasping the “So what?”. The following are my doubts:
- In p.2, l.24-28, the statement “The results of this study could then be used to reproduce and extend the methodology and the ShEx definitions, as well as to identify candidate datasets for reuse in innovative and creative ways.” is not enough to justify the usefulness and the potential impact of the work done. Indeed, at the end of the paper it is not clear yet whether you are proposing (1) a methodology (2) some source code available for extensions (3) an estimate of the current state of library data quality. This is possibly the most important aspect to clarify since page 1.
- To what extent is your approach tailored on library data? Did you find any aspect (a specific problem in library data, some common characteristics of data, some insight from results of the benchmark) that you can claim is peculiar of this type of data? If not, why should the methodology be restricted only to this type of data? The fact you have a case study on this type of data does not entail it is meant for these data only. Please provide more information to justify the relevance of your work.
- It is not clear whether it is possible (and what is the level of effort required) to reuse your work on another library dataset. Since ShEx are designed for bespoke datasets and schemas, it would be good to understand what are the candidate datasets that can benefit of your work as-is and to have an estimate of how much work should be done to apply it to other library datasets. This is fundamental to estimate the potential impact of the work.
- Authors claim they are proposing a methodology. However, this is poorly described (only steps of the procedure of authors’ approach are addressed and only one method - ShEx - is presented as the main pillar of the methodology). Moreover, descriptions of the methodology are often mixed with aspects related to the specific case study it is based on (the selected library datasets). Results of the case study are poorly described, and the discussion addresses partial information relevant to the case only, while it does not discuss the methodology. It’s hard to evaluate the paper as is, since authors claim cannot be demonstrated or generalised. If the goal is to validate the methodology, I’d rather focus on identifying aspects of the methodology that you want to validate, not the quality of a few, subjectively chosen, datasets. I may also argue that two highly curated datasets, selected on the basis of a benchmark described in a non open access article, is not enough to validate the methodology. Authors need to significantly improve this aspect in order to claim usefulness and to allow an easier review.
______ Terminological aspects
There are some terminological ambiguities and incorrect uses of terms that puzzle me.
“quality”: In p. 2 authors mention (as an example) inaccuracy, inconsistency, and incompleteness. These terms may both refer to contents (e.g. inaccuracy and incompleteness of content data) or to data themselves (e.g. ontology or syntactic inconsistency). Authors refer to their prior work for definitions. I had to read that article to clarify the boundaries of your work, but it’s not convenient for future readers.
In the Related work references to Library data validation approaches are missing (see below for more comments) Suggested actions:
- expand the Related work with prior work in assessing data quality in Libraries. I can suggest to have a quick look at the related work cited in this work [1]
- give details on the dimensions you want to address by means of ShEx and and motivate them clearly. You must add a quick reminder (in the related work section) on the results of the benchmark and the definitions of criteria. I must stress on the need to provide a clear motivation (citing your prior work as a basis for this in not enough)
- change “quality” everywhere with “data quality”.
- Once you define the dimensions you are interested in, please address these in the discussion of your results.
“GLAM”, DL and libraries. These are three different things, please use the terms appropriately. The first is an umbrella term for cultural institutions. You deal with library data only, so no need to emphasise (7 times) on other institutions. Likewise, Digital Libraries != libraries. The former may not be the same as the latter (e.g. the DL of NASA is not a library). Please use Library data whenever applicable. You may also consider avoiding to use DL at all.
______ Comments on specific sections
Introduction:
- p.1 l. 44-46, unclear and possibly irrelevant statement
- p.1 everywhere, the “explorations” the authors mention wrt GLAM and SW are going on for ~15 years. These are no experiments anymore, but running services and established collaborations. I propose to change the verb whenever possible and to give the fair credit to the work done by cultural institutions.
- Why mentioning Labs is relevant to this paper? It’s misleading and does not add useful information for the reader.
Related work:
- The Related work should address relevant research on the same topic so as to position *this* work in the state of the art. The section, instead, includes first a list of ontologies (some of those are not addressed later in your work), datasets (are these potential candidates for reusing your work?) and an example of data linking (why?). Secondly it presents two examples of software solutions for LOD validation. I think the long description of ShEx related software it’s not needed. I’d stress more on the last paragraphs of 2.2. (“The use and application of ShEx…” and “With regard to DLs….”).
- Fig.1 (which should be a listing not a figure) is not particularly informative in the context of related works and could be removed. Similarly, exemplar queries do not add much to the discussion. Consider removing them to help the text flow better.
- The conclusion made on Stardog ICV doesn’t sound right in this context (i.e.“These efforts are mostly concentrated on the evaluation of repositories focused on general knowledge rather than specific domains such as cultural heritage or literature.”). What is the difference in ShEx? Plus, you don’t focus on “cultural heritage or literature” either (libraries are none of those).
- Again, the term data quality here is used vaguely with respect to the scope of the article. Here it would be nice, along with a clear definition of what dimensions you are interested to, to make clear *why* ShEx is better (and not Stardog for example). See also comment on Section Methodology.
- I believe the first paragraph on SW is not needed giving the venue of the article.
- CIDOC is a standard for museums, not libraries, therefore if your scope is libraries consider mentioning FRBRoo instead.
- “In this sense” repeated twice
- p.4, Considering the aforementioned terminological mismatch, in p.4 l. 34 you may want to refer to “LOD published by Libraries” (DLs are not agents and do not publish)
- p.4 l.36 “This paper is based on the previously published benchmarking…” you need to make clear how this paper reuses and/or extends the prior one.
Methodology:
- Authors say “We have selected ShEx in this methodology since it has become very popular in the research community. In addition, ShEx enables reproducibility allowing researchers to improve the definitions. Moreover, several tools have recently been developed to automate the validation process.” this is not sufficient, since there are many other ways to perform reproducible validation using popular methods. Please extend.
- Here you mention 4 steps in your approach and then you give some motivations for the usage of one specific method. A methodology cannot be reduced to the description of approach and one method. Moreover, it is not clear the distinction between the methodology as a general framework and the instantiation of the methodology in the case study. These two things should be distinguished and separately described in the article (since authors claim they are validating the methodology). In fact, Selecting datasets is an aspect relevant to the case study but not of the general methodology. Likewise, licensing is a problem that may arise in the implementation, but it has nothing to do with the methodology (I may want to apply it to my own data that have a restrictive license for others to reuse). Please completely revise this section and move paragraphs from the methodology to section 4, wherein to document the steps of the methodology applied to the cases.
- in 3.2 and 3.3 you don’t really explain how you can identify the resources and apply the rules on the basis of your approach. How do you know when you are dealing with a Work or a Person? Do you manually select classes and properties for each schema? Do you use regex, triple patterns, etc. to detect URIs identifying individuals of those classes? You should list here the approaches that you identified (which are currently described in the execution).
Assessing:
- The sections dedicated to the validation of the two datasets actually address the steps of the methodology, and only some exemplar SPARQL query is provided. No detailed information on results is given, neither here nor in the following section
- There is no discussion! Only a few limitations and approaches faced in the execution are addressed (that should be described before, when describing the execution indeed) and no conclusions are presented, neither on the assessed data quality of the case study nor on the general limitations and benefits of the methodology. This section should be massively extended to provide the reader with enough information on the following aspects:
- the results of the data quality assessment.
- What is wrong with Library data and why ShEx can help
- Why does the methodology proposed work well (at this point of the paper, the goal is not clear yet)
- a clear motivation for reusing your methodology in other contexts
- the limitations of both the methodology and the case studies, rather than the limitations of the data sources only.
______ Typos and language:
There are many repetitions, ambiguous terms, generic sentences, that authors would immediately spot when they careful proofread it. I’d suggest using synonyms whenever possible (e.g. explor*, exploit*,) which would help to let the narrative flow. A few typos and suggestions.
- Please change all the figures in listings.
- p.1 l.42, innovate -> innovative
- p.1 l.37, “collaborative edition” sounds odd, maybe “collaborative effort”
- p. 2 l.22, data-quality -> data quality
- p.3 l.37, approach -> example
All the best,
Marilena Daquino
[1] https://doi.org/10.1002/asi.24301
|