A Shape Expression approach for assessing the quality of Linked Open Data in digital libraries

Tracking #: 2615-3829

Gustavo Candela
Pilar Escobar
María Dolores Sáez
Manuel Marco-Such

Responsible editor: 
Special Issue Cultural Heritage 2021

Submission type: 
Full Paper
Cultural heritage institutions are exploring Semantic Web technologies to publish and enrich their catalogs. Several initiatives such as Labs are based on the reuse of the materials published by cultural heritage institutions in innovative and creative ways. In this sense, quality has become a crucial aspect when identifying and reusing a dataset for research. In this article, we propose a methodology to create Shape Expressions definitions to validate LOD datasets published by digital libraries. This methodology is then applied to two use cases based on datasets published by relevant GLAM institutions. It intends to encourage institutions to use ShEx to validate LOD datasets as well as to promote the reuse of LOD made openly available by digital libraries.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Marilena Daquino submitted on 20/Feb/2021
Major Revision
Review Comment:

______ In nutshell
Despite being a nice and interesting exercise meant to provide recommendations and support libraries in curating their data, the scientific contribution of the article is not clear and major revisions would be needed to better appreciate the authors’ work. Indeed, at the end of the paper I still fail in grasping the “So what?”. The following are my doubts:
- In p.2, l.24-28, the statement “The results of this study could then be used to reproduce and extend the methodology and the ShEx definitions, as well as to identify candidate datasets for reuse in innovative and creative ways.” is not enough to justify the usefulness and the potential impact of the work done. Indeed, at the end of the paper it is not clear yet whether you are proposing (1) a methodology (2) some source code available for extensions (3) an estimate of the current state of library data quality. This is possibly the most important aspect to clarify since page 1.
- To what extent is your approach tailored on library data? Did you find any aspect (a specific problem in library data, some common characteristics of data, some insight from results of the benchmark) that you can claim is peculiar of this type of data? If not, why should the methodology be restricted only to this type of data? The fact you have a case study on this type of data does not entail it is meant for these data only. Please provide more information to justify the relevance of your work.
- It is not clear whether it is possible (and what is the level of effort required) to reuse your work on another library dataset. Since ShEx are designed for bespoke datasets and schemas, it would be good to understand what are the candidate datasets that can benefit of your work as-is and to have an estimate of how much work should be done to apply it to other library datasets. This is fundamental to estimate the potential impact of the work.
- Authors claim they are proposing a methodology. However, this is poorly described (only steps of the procedure of authors’ approach are addressed and only one method - ShEx - is presented as the main pillar of the methodology). Moreover, descriptions of the methodology are often mixed with aspects related to the specific case study it is based on (the selected library datasets). Results of the case study are poorly described, and the discussion addresses partial information relevant to the case only, while it does not discuss the methodology. It’s hard to evaluate the paper as is, since authors claim cannot be demonstrated or generalised. If the goal is to validate the methodology, I’d rather focus on identifying aspects of the methodology that you want to validate, not the quality of a few, subjectively chosen, datasets. I may also argue that two highly curated datasets, selected on the basis of a benchmark described in a non open access article, is not enough to validate the methodology. Authors need to significantly improve this aspect in order to claim usefulness and to allow an easier review.

______ Terminological aspects
There are some terminological ambiguities and incorrect uses of terms that puzzle me.

“quality”: In p. 2 authors mention (as an example) inaccuracy, inconsistency, and incompleteness. These terms may both refer to contents (e.g. inaccuracy and incompleteness of content data) or to data themselves (e.g. ontology or syntactic inconsistency). Authors refer to their prior work for definitions. I had to read that article to clarify the boundaries of your work, but it’s not convenient for future readers.
In the Related work references to Library data validation approaches are missing (see below for more comments) Suggested actions:
- expand the Related work with prior work in assessing data quality in Libraries. I can suggest to have a quick look at the related work cited in this work [1]
- give details on the dimensions you want to address by means of ShEx and and motivate them clearly. You must add a quick reminder (in the related work section) on the results of the benchmark and the definitions of criteria. I must stress on the need to provide a clear motivation (citing your prior work as a basis for this in not enough)
- change “quality” everywhere with “data quality”.
- Once you define the dimensions you are interested in, please address these in the discussion of your results.

“GLAM”, DL and libraries. These are three different things, please use the terms appropriately. The first is an umbrella term for cultural institutions. You deal with library data only, so no need to emphasise (7 times) on other institutions. Likewise, Digital Libraries != libraries. The former may not be the same as the latter (e.g. the DL of NASA is not a library). Please use Library data whenever applicable. You may also consider avoiding to use DL at all.

______ Comments on specific sections
- p.1 l. 44-46, unclear and possibly irrelevant statement
- p.1 everywhere, the “explorations” the authors mention wrt GLAM and SW are going on for ~15 years. These are no experiments anymore, but running services and established collaborations. I propose to change the verb whenever possible and to give the fair credit to the work done by cultural institutions.
- Why mentioning Labs is relevant to this paper? It’s misleading and does not add useful information for the reader.

Related work:
- The Related work should address relevant research on the same topic so as to position *this* work in the state of the art. The section, instead, includes first a list of ontologies (some of those are not addressed later in your work), datasets (are these potential candidates for reusing your work?) and an example of data linking (why?). Secondly it presents two examples of software solutions for LOD validation. I think the long description of ShEx related software it’s not needed. I’d stress more on the last paragraphs of 2.2. (“The use and application of ShEx…” and “With regard to DLs….”).
- Fig.1 (which should be a listing not a figure) is not particularly informative in the context of related works and could be removed. Similarly, exemplar queries do not add much to the discussion. Consider removing them to help the text flow better.
- The conclusion made on Stardog ICV doesn’t sound right in this context (i.e.“These efforts are mostly concentrated on the evaluation of repositories focused on general knowledge rather than specific domains such as cultural heritage or literature.”). What is the difference in ShEx? Plus, you don’t focus on “cultural heritage or literature” either (libraries are none of those).
- Again, the term data quality here is used vaguely with respect to the scope of the article. Here it would be nice, along with a clear definition of what dimensions you are interested to, to make clear *why* ShEx is better (and not Stardog for example). See also comment on Section Methodology.
- I believe the first paragraph on SW is not needed giving the venue of the article.
- CIDOC is a standard for museums, not libraries, therefore if your scope is libraries consider mentioning FRBRoo instead.
- “In this sense” repeated twice
- p.4, Considering the aforementioned terminological mismatch, in p.4 l. 34 you may want to refer to “LOD published by Libraries” (DLs are not agents and do not publish)
- p.4 l.36 “This paper is based on the previously published benchmarking…” you need to make clear how this paper reuses and/or extends the prior one.

- Authors say “We have selected ShEx in this methodology since it has become very popular in the research community. In addition, ShEx enables reproducibility allowing researchers to improve the definitions. Moreover, several tools have recently been developed to automate the validation process.” this is not sufficient, since there are many other ways to perform reproducible validation using popular methods. Please extend.
- Here you mention 4 steps in your approach and then you give some motivations for the usage of one specific method. A methodology cannot be reduced to the description of approach and one method. Moreover, it is not clear the distinction between the methodology as a general framework and the instantiation of the methodology in the case study. These two things should be distinguished and separately described in the article (since authors claim they are validating the methodology). In fact, Selecting datasets is an aspect relevant to the case study but not of the general methodology. Likewise, licensing is a problem that may arise in the implementation, but it has nothing to do with the methodology (I may want to apply it to my own data that have a restrictive license for others to reuse). Please completely revise this section and move paragraphs from the methodology to section 4, wherein to document the steps of the methodology applied to the cases.
- in 3.2 and 3.3 you don’t really explain how you can identify the resources and apply the rules on the basis of your approach. How do you know when you are dealing with a Work or a Person? Do you manually select classes and properties for each schema? Do you use regex, triple patterns, etc. to detect URIs identifying individuals of those classes? You should list here the approaches that you identified (which are currently described in the execution).

- The sections dedicated to the validation of the two datasets actually address the steps of the methodology, and only some exemplar SPARQL query is provided. No detailed information on results is given, neither here nor in the following section
- There is no discussion! Only a few limitations and approaches faced in the execution are addressed (that should be described before, when describing the execution indeed) and no conclusions are presented, neither on the assessed data quality of the case study nor on the general limitations and benefits of the methodology. This section should be massively extended to provide the reader with enough information on the following aspects:
- the results of the data quality assessment.
- What is wrong with Library data and why ShEx can help
- Why does the methodology proposed work well (at this point of the paper, the goal is not clear yet)
- a clear motivation for reusing your methodology in other contexts
- the limitations of both the methodology and the case studies, rather than the limitations of the data sources only.

______ Typos and language:
There are many repetitions, ambiguous terms, generic sentences, that authors would immediately spot when they careful proofread it. I’d suggest using synonyms whenever possible (e.g. explor*, exploit*,) which would help to let the narrative flow. A few typos and suggestions.
- Please change all the figures in listings.
- p.1 l.42, innovate -> innovative
- p.1 l.37, “collaborative edition” sounds odd, maybe “collaborative effort”
- p. 2 l.22, data-quality -> data quality
- p.3 l.37, approach -> example

All the best,
Marilena Daquino

[1] https://doi.org/10.1002/asi.24301

Review #2
By Katherine Thornton submitted on 22/Feb/2021
Major Revision
Review Comment:

The paper A Shape Expression approach for assessing the quality of Linked Open Data in digital libraries is a strong
candidate for publication in the Special Issue Cultural Heritage 2021 once revisions can be made.

The manuscript is original in that it is the first discussion of validating bibliographic data in RDF using ShEx. The significance of the results is not yet clear to me because the example schemas are using cardinalities that effectively permit 'zero or more' occurrences for all triple patterns listed in such a way that everything validates. The results would be more persuasive if the schema were written so that only the correctly-formatted data would be found to conform, and then sample data tested with those more restive schemas. The quality of the writing could be improved with copy editing to remove many grammatical errors.

The importance of this paper is that it addresses a practical application of semantic web technologies to a real-life workflow issue of validation of bibliographic data in RDF.

The usefulness of this paper is high in that the online validation examples are practical for others to consult and see in action. However, the schemas used for validation have cardinalities assigned that accept 'zero or more' instances of each triple pattern, so the schema is designed to accept any sample data as valid. A schema with fewer cardinalities of 'zero or more' would be more instructive.

The relevance of this paper is very high because many libraries are interested in converting some of their bibliographic data to RDF and are looking for useful tooling.

The stability of the validation workflow depends on an external tool, the ShEx2 Simple Online Validator. This tool has been available on the web for several years, if it remains available then the example manifests and schemas will continue to be working examples.

I would favorably rate the impact of this paper if the authors add 1-2 paragraphs about how the wrote the schemas, detailing how they tested them, and if they provide alternate schemas with cardinalities other than 'zero or more' for each of the triple expressions.

I think many readers would benefit from additional text in the paper that indicates where to find the manifests and how to examine and interpret the results of validation in the ShEx2 Simple Online Validator. Currently some readers may miss the footnote and could miss out on the opportunity to see the manifests and schemas in action.

I think readers would benefit from a discussion by the authors of how librarians could use conformance results in their workflows. Being explicit about these potential benefits would strengthen the paper.

page 1
column 1

Line 33 consider replacing 'material' with 'material formats' or 'material types'

Line 38 consider replacing 'use' with 'uses'

Line 41 this sentence is missing some words, consider replacing 'innovate' with 'innovative'

Line 45 I do not understand the final sentence of this paragraph, further clarification will make this stronger

column 2
line 37 consider replacing 'edition' with 'editing'

line 40 it is not clear to me what collective collection means, further clarification will make this stronger

line 48 consider removing 'the' before Semantic Web technologies

line 50 the verb 'hindering' does not agree with with the subject of this sentence

page 2
column 1
line 22 data-quality should not be hyphenated
line 31, the abbreviation DL can be used to stand in for digital library
line 40 consider adding 'and' after the comma
line 51 this sentence is not yet clear, consider revising it

column 2
line 10 the phrase 'providing a list of instructions' is not clear, consider expanding on your point here
line 32 the verb 'requires' does not agree with the subject of your sentence

page 3
line 3 the verb 'enhance' does not agree with the subject of this sentence
line 18 the verb 'run' does not make a complete sentence
line 29 the verb 'have' does not agree with the subject of this sentence
line 37 this sentence has multiple verbs that are not in agreement with one another
line 45 consider adding 'schemas' after ShEx
line 49 I don't think the parentheses around 'subject, predicate, object' are necessary

column 2
line 10 implementations should be plural
line 12 consider changing 'include' to 'includes'
line 33 including Shape Expressions (ShEx) is only necessary the first time the abbreviation is introduced

page 4
column 1
line 32 consider removing 'a' from 'such a software'
line 33 consider changing 'none' to 'no one' and consider changing 'has been carried out' to 'has carried out', consider removing 'the' before LOD
line 48 remove hyphen from 'data-quality'

line 31 consider changing 'ShEx rules' to 'ShEx schema'
line 40 this sentence is unclear

page 5
column 1
line 37 I'm not sure if 'resources' is the correct word here.

column 2
line 14 consider changing 'ShEx rules' to 'ShEx schema'
line 40 consider changing 'it' to 'they'
line 49 consider changing 'consists on' to 'consists of'

page 7
column 2
line 19 consider changing 'SPARQL sevices' to 'a SPARQL endpoint'

page 8
column 1
line 29 consider providing the English label for this Wikidata property

column 2
line 22 consider providing the English label for this Wikidata property

Review #3
By Jouni Tuominen submitted on 07/Mar/2021
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The authors present a compact, focused experiment on applying ShEx validation to digital libraries' datasets to foster data re-use, with two exemplifying use cases on datasets provided by two individual libraries. The presented methodology is quite straightforward application of ShEx. From purely technological perspective, the originality and significance of the contribution is not particularly high, but especially for researchers and practitioners working with (linked) data in digital libraries in GLAM institutions the paper would be relevant.

In the Related work / Validating LOD section, another shape-based validation language, SHACL, should be discussed: brief comparison with ShEx, and motivation for why ShEx was chosen.

The Results section should be expanded with more details: did you find any issues in the datasets during the validation, and if so, how many violations, etc.?

Generally, the language is readable, but there are some issues, and thus, proofreading by a native is encouraged.

Minor/language comments:

Page 1: "Galleries, Libraries, Archives and Museums (GLAM) institutions" -> "Galleries, Libraries, Archives and Museums (GLAM institutions)"

Page 2: "In addition, SPARQL provides a query language for RDF providing a list of instructions [16]." - This could be further elaborated.

Page 2: "Several major libraries (e.g., OCLC, British Library, National Library of France, publishers, and library catalog vendors)" -> "Several major libraries (e.g., OCLC, British Library, National Library of France), publishers, and library catalog vendors"

Page 3: e.g. "For instance, the type of a RDF node, literal datatype, XML String and numeric facets and enumeration of value sets." - This is not a complete sentence.

Page 3: "Another approach is based on Europeana and multilinguality describing the measures defined and providing initial interpretations of the results [37]." - This should be further elaborated: what does "measures defined" mean? What kind of interpretation?

Page 7: "The ShEx definitions have been made grouped by DL in a manifest file (see Figure 3)." -> "The ShEx definitions have been made are grouped by DL in a manifest file (see Table 3)."

Page 8: "foaf" -> "FOAF"

Page 8-9: "404 HTTP error (Request-URI Too Large)" -> "414 HTTP error (URI Too Long)" (though, some http server implementations may use, e.g., HTTP 404.14 substatus code for this)

"Research Libraries UK, A manifesto for the digital shift in research libraries, 2020, [Online; accessed 20-October-2020]." - Provide the URL.

"G. Candela, P. Escobar, R.C. Carrasco and M. Marco-Such, Evaluating the quality of linked open data in digital libraries" - Provide the year.