The Web of Data radiography: Is current Open Data good enough for Linked Data?

Tracking #: 2530-3744

Authors: 
Jhon Francined Herrera-Cubides
Paulo Alonso Gaona-García
Salvador Sanchez-Alonso

Responsible editor: 
Armin Haller

Submission type: 
Full Paper
Abstract: 
Open Data has been improving both publishing platforms and the consumers-oriented process over the years, providing better Openness policies and transparency. Although organizations have tried to open their data, the enrichment of their resources through the Web of Data has been decreasing. Linked Data has been suffering from notable difficulties in different stages of their life-cycle, becoming over the years less and less attractive to users. According to that, it decided to explore how the lack of some Opening requirements affect the decline of the Web of Data. This paper presents a Web of Data radiography, analyzing the Gov-ernmental domain as a case study. The results indicate that is necessary to enhance the data opening process to improve resource enrichment on the Web, as well as to have better datasets. These changes would have a positive influence on the overall use of the model. Given the magnitude of the problems identified, it is believed that the Web of Data model would inevitably lose the interest it aroused at the beginning if not addressed immediately by these problems. Besides, its use would be restricted to a few particular niches or would even disappear altogether.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Sebastian Neumaier submitted on 27/Aug/2020
Suggestion:
Reject
Review Comment:

The paper “The Web of Data radiography: Is current Open Data good enough for Linked Data?” presents an analysis and assessment of the state of 40 CKAN data portals. The paper reports several quality issues such as outdated datasets, insufficient licensing, and missing metadata, and concludes that Open Data should be rather provided as Linked Data.

The review considers the originality and significance of the work:

(1) originality:
In terms of originality I am not convinced by the presented methods and analyses in the paper.
I am (co-)author of the following papers:
[1] Neumaier, Sebastian, Jürgen Umbrich, and Axel Polleres. "Automated quality assessment of metadata across open data portals." Journal of Data and Information Quality (JDIQ) 8.1 (2016): 1-29.
[2] Kubler, Sylvain, et al. "Comparison of metadata quality in open data portals using the Analytic Hierarchy Process." Government Information Quarterly 35.1 (2018): 13-29.
[3] Neumaier, Sebastian, Jürgen Umbrich, and Axel Polleres. "Lifting data portals to the web of data." LDOW@ WWW. 2017.
[4] Neumaier, Sebastian, and Jürgen Umbrich. "Measures for assessing the data freshness in Open Data portals." 2016 2nd International Conference on Open and Big Data (OBD). IEEE, 2016.
[5] Neumaier, Sebastian, and Axel Polleres. "Enabling spatio-temporal search in open data." Journal of Web Semantics 55 (2019): 21-36.

The reports (regarding formats, licensing, metadata availability, etc.) and selected portals match the reported results in [1,2,4]. I am certainly interested in further studies, however, given these works and other related literature discussed in our papers, I am missing a critical differentiation and discussion of these related works, and therefore I am missing a clear, novel contribution, that would justify why the paper should be published in the SWJ: Are there any new insights? Why is your approach different/better/improved? Can you give a critical discussion of the other works? Compared to other analyses, did Open Data change/improve/stagnate?

(2) significance of the results:
Certainly, I encourage further research on quality assessments and improvements of current Open Data. However, given the lack of novelty of the study, and the missing critical discussion of related works I do not see the paper ready for getting published SWJ.

Review #2
Anonymous submitted on 27/Sep/2020
Suggestion:
Reject
Review Comment:

Originality


The paper seeks to examine a sample of open data in the government domain, available on the public Web, and determine whether it could be made available and reusable as linked open data. It analyzes a set of open data in terms of licenses, format, timelessness of updates, domains, provenance, and overall data quality.

The authors make numerous assumptions without providing empirical evidence or citing canonical sources. One repeated claim is that the Web of data is declining. Taken on face value this is not accurate, and not substantiated by the authors. Yes, there are issues with types of data, quality, provenance, timeliness, whether it is FAIR, but had the author’s clearly bounded their domain (government is far too broad), it may have been compelling and credible. As it was, it was neither.

The research questions (addressed below) included an assessment of the challenges of linking resources using linked open data principles. Had there been clarity and scope such as federal government scientific organizations in [country | region], or local councils publishing administrative data, than the approach of examining data format, licensing, last updated, may have been novel and useful. Unfortunately, there was a hodgepodge of information resources they collected, it did not compellingly identify original findings.

Significance of the results / Contribution 

The key premise is to analyze 1-3 star data as indicative of the robustness or quality of the Web of data. This runs contrary to Berners-Lee’s vision of the linked open data cloud, namely 5 star linked open data. Berners-Lee said 1-4 star data are stepping stones to the real benefits of good quality 5 star data. There is considerable scholarly literature since the original 2007 statement on LOD that should have been referenced and built on.

While the paper draws on existing literature, it reviews and assembles insights in a way that does not contribute to a useful assessment of available and reusable linked open data. The flaws are in the methodology, as well as, the conceptualization of how and where LOD is most useful for public sector data.

Validity /Analytical Approach 

Major concerns:

Academic writing requires specificity and bounded claims. If the authors bounded their claims and perhaps suggested a specific domain, e.g., government linked open data [on the environment] in [South America], or some other bounded claim that is provable, that would have been more credible.

The authors make numerous unsubstantiated claims. For example, the authors suggest that organizations (without defining type or jurisdiction), have tried to open their data but the Web of data has been decreasing. There is no proof offered that the Web of data is decreasing and one familiar with metrics would argue to the contrary. The authors state the Web of data is in a ‘state of decline’, again with no proof to that claim.

The methodology used a sampling of open data on CKAN, an open data portal. It has no bearing on linked open data per se. There was an overall confusion between data management platforms versus portals (catalogues).

The paper bounced between open data and linked open data, mixing and matching citations. It was unclear whether the author’s misunderstand the clear distinction or they had trouble with their bibliography management. Either way, it made it challenging to read and follow their arguments.

Finally, if one does not have a background in political science or law, it would be better to steer away from normative statements related to public sector norms and motivations. For example the author’s write, “openness and transparency are mandatory for the public sector…” Well, in liberal democracies, yes, that is a desirable characteristic. For countries however that do not espouse transparency and accountability, openness is not mandatory nor even desirable. This is a perennial issue with technical people involved in open data — You’re in good company but it doesn’t make it appropriate for scholarly writing. Write what you know about and don’t make unsubstantiated claims.

Quality of writing

English is a challenging second language to learn and prepare scholarly research. The authors are encouraged to review the basics on when to capitalize words. There were random and copious capitalized words for non-proper nouns, including open data, linked data. See https://www.grammarly.com/blog/capitalization-rules/

Free and subscription-based tools such as Grammerly would help with the drafting of future academic text.

The term “radiography” is used three times in the paper, in the title, abstract and to describe the challenges facing the Web of data. It was never defined. It is an unusual use of the word so if incorporating at term as an analogy, it is critical to describe the analogy. Also, grammatically the title would be ‘A Web of Data radiography: Is current open data good enough for linked data?’.

There were many issues related to citations. First, canonical sources including linked data criteria and guidelines were missing.

Canonical sources that one might expect for a paper on linked open data that were missing and less canonical sources were cited. In the future, you may wish to review:

- T. Berners-Lee, Linked data, 2006, https://www.w3.org/DesignIssues/LinkedData.html
- W3C, Best Practices for Publishing Linked Data, 2014, https://www.w3.org/TR/ld-bp/
- W3C, Data on the Web Best Practices, 2017, https://www.w3.org/TR/dwbp/
- Hyland, B. & Wood, D., 2011, The Joy of Data - A Cookbook for Publishing Linked Government Data on the Web, https://doi.org/10.1007/978-1-4614-1767-5_1

You may also find the following research useful & relevant:
- Janssen, Charalabidis & Zuiderwijk, Benefits, Adoption Barriers and Myths of Open Data and Open Government, https://doi.org/10.1080/10580530.2012.716740
- Kalampokis, Tambouris & Tarabanis, On publishing linked government data, 2013, https://doi.org/10.1145/2491845.2491869

This paper's citations often seemed random and did not reflect a summary as to why they were selected. Numerous erroneous and misplaced citations made the text challenging to read. For example, citations for the research approach were to a 2018 paper on open data quality metrics for a Barcelona open data portal (Abella et al 2018), a W3C editors DRAFT (2019) and a 2018 published paper with five citations Ullah, I., Khusro, S., Ullah, A., & Naeem, M. (2018), An Overview of the Current State of Linked and Open Data in Cataloging, with no explanation as to why these three sources were cited.

In an academic paper for one of the leading semantic web journals, it is expected that authors have a very clear grasp on why linked open data is recommended for use by governments. A simple description such as, ‘linked open data is published by government entities to provide human and machine readable data to encourage interoperability and reuse’, would have demonstrated the authors understand why some governments organizations go to the effort of modeling, publishing and maintaining open government data as LOD.

There were some good aspects of this paper that I hope the authors will consider expanding in the future. The analysis of the results section was the strongest part of the paper. A future paper extending analysis of what licenses are used with published linked open data, frequency of updates to linked open data, domains (including assessment of link rot), clarity and analysis of provenance (a huge concern for reuse!!) would be a useful contribution. Also, case studies within Spanish speaking governments entities would be helpful. The scholarly community would be enriched by good research on emerging and successful LOD projects in geographies beyond of North America and Europe.

Thank you for your submission.

Review #3
Anonymous submitted on 12/Oct/2020
Suggestion:
Major Revision
Review Comment:

Summary
The article highlights the inherent challenges in producing linked open data from open data published on data platforms. It specifically investigates the accessibility and reusability of the dataset published on CKAN platforms by analysing 217, 778 datasets from 40 randomly selected CKAN platforms. Availability was measured in terms of the type of licenses, update frequency, associated domains and provenance. Reusability was measured based on the number of linked resources found in the datasets. The identified pathologies in these two dimensions (availability and reusability) were outlined.
Comments

The article tackles an important problem; why linked open data resources is not growing despite the proliferation of open data platforms. The empirical approach adopted is also nice. However, given that the authors had scoped their work not to include organisational and non-technical issues, no proper context is provided to the conclusions reached. Having said this, there are some merits in the empirical findings.

I note below some points that limit the potential contribution of the article from both research and practice perspective.

1) On the methodology adopted, I think the choice of random sampling while convenient may not allow a deep analysis of the problem. Given that this portal differs significantly in maturity, it would have been more interesting to consider a stratified sampling strategy covering high-, intermediate and low-maturity catalogues based on established maturity benchmarks (open data barometer, etc.). This will allow the researchers to determine the pattern of issues associated with different categories of portals. Having said this, it is still possible to structure the analysis for instance based on the number of datasets published (as a proxy for maturity) against the identified problems in the areas of availability and reuse.
2) There is a need to provide more clarity about how the datasets were selected within a given data catalogue? What would also have been useful is to compute the percentage of URLs or URIs against the total number of datasets.
3) The authors should also elaborate on how the linked resource in the dataset were counted. Are they unique counts?
4) A table of anomalies and what to do to address them could be provided in a table
5) Examining a topic like linked data without considering organisational aspects may have limited practical value. Do the organisations associated with the catalogues have sufficient level linked data competence? Are there friendly linked data tools that could be integrated with the data platforms such as CKAN to make the production of linked data easy for publishers? Etc.

Overall, there is very little novelty in the approach and results from the work. Most of the conclusions are not surprising and are well known. There is a need for deeper analysis on both the availability and the reuse dimensions. For instance, counting URLs or URIs are not enough, deeper characterization of the datasets in terms of vocabulary used and links to other external datasets and resources will give a truer picture of re-use. In the area of availability, some of the data quality issues mentioned in the introduction section are related to the availability of the data. A lot of data are published that are practically unusable, say due to lack of metadata information etc.
For the article to make significant contributions on this topic, more work as indicated in the comments above needs to be done.

Other minor issues:
What does “Web of data radiography”? Do the authors mean they aim to “x-ray” the Web of Data”? Several language issues significantly limit the readability of the paper or the meaning intended by the authors. For example in the abstract, the authors wrote:
“According to that, it decided to explore how the lack of some Opening requirements affect the decline of the Web of Data”. Subsequently, they wrote .. “Given the magnitude of the problems identified, it is believed that the Web
of Data model would inevitably lose the interest it aroused at the beginning if not addressed immediately by these problems.”.