Using the W3C Generating RDF from Tabular Data on the Web Recommendation to manage small Wikidata datasets

Tracking #: 2810-4024

Authors: 
Steve Baskauf
Jessica K. Baskauf

Responsible editor: 
Guest Editors KG Validation and Quality

Submission type: 
Full Paper
Abstract: 
The W3C Generating RDF from Tabular Data on the Web Recommendation provides a mechanism for mapping CSV-formatted data to any RDF graph model. Since the Wikibase data model used by Wikidata can be expressed as RDF, this Recommendation can be used to document tabular snapshots of parts of the Wikidata knowledge graph in a simple form that is easy for humans and applications to read. Those snapshots can be used to document how subgraphs of Wikidata have changed over time and can be compared with the current state of Wikidata using its Query Service to detect vandalism and value added through community contributions.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jakub Klimek submitted on 08/Jul/2021
Suggestion:
Minor Revision
Review Comment:

After the revision, the paper seems more clearly focused, and with the new Related work section, it also seems better placed in the wider context of tabular representations of subsets of Wikidata. The authors also clarified that in the GLAM community, CSV is very well established even outside the IT experts community. My concern regarding the usability of the approach when manually editing a large spreadsheet (> 10 columns), including statement UUIDs, still stands. However, this opinion might be biased, as I am not a member of the GLAM community and I generally dislike using spreadsheet editors.

I have a few minor issues:

1. Figure 1 is raster graphics. I would recommend representing it as vector graphics for better printing and viewing on a hi-res monitor. The original is actually vector graphics, so it just needs to be handled properly.
2. In Figure 2, there is a component Wikidata relational database. I would argue that this is a bit misleading, as the data in Wikidata is stored in Wikibase, the same storage used for documents in Wikipedia. Therefore, it lacks relational characteristics. Whether or not Wikibase uses an underlying relational database is unimportant, as one does not access this database directly.
3. RDF/Turtle => RDF Turtle (see https://www.w3.org/TR/turtle/)
4. I still miss syntax highlight of the included Python scripts, SPARQL queries and JSON snippets.

Review #2
By John Samuel submitted on 23/Jul/2021
Suggestion:
Accept
Review Comment:

I thank the authors for considering my review comments and significantly modifying the article, especially the Related Works section. They have added a detailed discussion on ShEx, QuickStatements, and OpenRefine. Additionally, they compared their approach to these tools. Thanks to the updated title, the authors have also clarified the scope of their work: small Wikidata datasets. The introduction section now explains the potential users in detail and the novelty of their work, especially their proposed use of CSV files in the article. Thanks to Figure 2, the readers can also get an overview of their proposed approach. As per my previous remarks, they have also added links to the Wikidata pages on Schemas and Property Constraints in the subsection ‘Versioning and monitoring RDF datasets’, which may further help the researchers who wish to know the already available features on Wikidata.

Some minor corrections in Appendix B:
1. https://heardlibrary.github.io/digital-scholarship/script/wiki-data/wiki... doesn’t work. I think the correct link is https://heardlibrary.github.io/digital-scholarship/script/wikidata/wikid.... Note that there is no hyphen in wiki-data.
2. https://github.com/HeardLibrary/linkeddata/blob/master/vanderbot/wikidat... doesn’t work. I think the correct link is https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/wikida.... Note the hyphen in linked-data.

Review #3
By Andra Waagmeester submitted on 26/Jul/2021
Suggestion:
Minor Revision
Review Comment:

The authors describe a method to store/extract/describe subsets of Wikidata using csv. Given the continuous growth of Wikidata to the extend that it becomes hard to consume wikidata in its entirety, methods like the one described here are highly valuable.

"simple".
The authors make extensive use of the word "simple", without arguing what "simple" entails. This is particularly important where the authors argue that csv is a simple format. I disagree here. CSV (or any tabular related file format), is not simple, but more convenient due to the widespread use of applications like Microsoft Excel. The format itself, particularly in the context of data ingestion can be problematic for various reasons. There is the issue that software used to generate csv files apply automatic data conversion rules [1]. My main concern with csv files is that it basically is a 2-dimensional frame, which is often used to map/model multidimensional dimensional data. This leads to users applying creative modelling tricks to still be able to capture multiple axes in a single cell. (E.g. various delimiter characters next to the comma). The authors addressed this by aligning a JSON format to the CSV format. Which to me seems not so simple.

However, I am convinced that given the widespread use of software applications that generate CSV files, csv - with all its deficiencies - can still be a valuable user interface to knowledge graphs, like wikidata. Hence, CSV is not simple, only widely used and as such complex, since the user needs to be aware of its deficiencies.

The first sentence "Because of its availability and ease of use, Wikidata
has become one of the widely used open knowledge
graphs", nicely touches on this already. If Wikidata itself is already easy, why do we need a solution "to document tabular snapshots of parts of the Wikidata knowledge graph in a simple form that is easy for humans and applications to read". I think we do, but not because of CSV being easy/simple.

"non-professionals"
On page 2 the authors argue that the system is: "... extremely simple and easily used by non-professionals". What are non-professionals? I assume the authors mean people without a Computer Science degree. I would write it as such since non-computer scientists can be professionals too.

2. Related work
misses Wikibase Universal Bot[2], which by its own description also aims for use by users not having to learn how to write bots.

3. Applying the Generating RDF from Tabular Data on the Web Recommendation to the
Wikibase model.

"Wikidata descriptions are handled similarly to labels except that the property schema:description replaces rdfs:label."

rdfs:label exists in the Wikidata rdf schema. However, labels are also rendered as schema:name and skos:prefLabel.

schema:description indeed exists in Wikidata, but is not a rendering of rdfs:label, but to capture the description which exists next to the rdfs:label labels.

[1] https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0185207
[2] https://github.com/dcodings/Wikibase_Universal_Bot

Review #4
By Tom Baker submitted on 30/Jul/2021
Suggestion:
Accept
Review Comment:

Paper 2810-4024 (Using the W3C Generating RDF from Tabular Data on the Web Recommendation to manage small Wikidata datasets), a revision of 2659-3873, significantly expands on the earlier version from a main body of 9 pages (total 14 pages) with 19 references and 14 footnotes to a main body of 15 pages (total 25 pages) with 35 references and 29 footnotes. Most of this additional material addresses shortcomings identified in the first round of reviews - notably, the lack of a substantial section on related work. The authors have also added a diagram showing how the components described fit into a workflow, and screenshots of a GUI tool for constructing the JSON metadata used to drive CSV2RDF transforms.

The authors justify and explain these additions and changes in a lengthy and comprehensive cover letter, and I am satisfied that they have addressed the concerns raised in the first round. This has resulted in a stronger paper.

The paper makes it clear that the described approach is best suited to monitoring and updating small subsets of Wikidata "items of interest" (roughly, up to the number of items in a small art gallery) by users and small organizations with limited technical expertise. Users can limit not only the set of items, but also the set of properties, references, and qualifiers used to describe those items. I buy the argument that such users find it easier to spot and fill gaps in the data by scanning rows and columns in a spreadsheet, and by copy-pasting, than by trying to work with JSON-LD or Turtle.

The paper argues convincingly that existing tools -- QuickStatements and OpenRefine for manipulating and uploading tabular data to Wikidata; Quit Store and OSTRICH for archiving RDF graphs -- are less well-suited for use by technically non-expert users who need only to track a small number of items.

Some of the limitations of their approach are actually limitations of the Tabular Data on the Web specification (CSV2RDF): lack of support for language tags, for generating more than one triple from one column, and for multiple values.

The most serious technical objection in the first round of reviews, in my reading, was the lack of consideration of statements ranks, as ranks affect the materialization of truthy statements and thus affect the resulting graphs that will be compared with federated queries. In response, the authors argue that ranks are rarely used by novice users. Ranks could in principle be accommodated by adding columns to the CSV, though these may not be worth the extra complexity. They note that QuickStatements likewise does not support ranks.

Previous reviews point out the lack of evaluation or user feedback. The authors point to workshops, blog posts, videos, and "anecdotal feedback". ("In two workshops, non-programmers were able to learn about, set up, and use the system to write data to Wikidata in less than an hour.") I personally do not think that the lack of a formal user experience study should be grounds for rejecting the paper.

One significant (but fixable) flaw: Appendix A uses the 'rdf-tabular' tool to emit RDF/Turtle, but when I execute the given command using the rdf-tabular Ruby gem version 3.1.15, it raises the error "No writer found for ttl". Specifying one of the supported formats, such as N-Triples, does work. If the paper is accepted for publication, this should be addressed.

A reference or footnote should be added for Darwin Core Archives.