Review Comment:
The submission provides a survey of works addressing owl:sameAs links, and other issues relating to identity, on the Web of Data. After a general introduction and overview of the identity problem, the core of the paper delves into surveying: analyses of sameAs and identity links on the Web of Data, proposals of new (weaker) forms of identity links, services that manage and allow for looking-up same-as links, and methods for detecting erroneous identity links. The paper concludes with an overview of the work done, and some open challenges for future research.
Having worked on systems that both produce/clean and consume sameAs links, I am convinced that this topic is of fundamental importance for the Web of Data, and deserves much more attention than it is has received. The sameAs topic underlies the Web of Data's "capacity" for collaborative, decentralised data integration, which is a sort of fundamental "assumption" of the Web of Data's vision, but as discussed in this paper, this capacity is not something that can be taken for granted, but rather requires work to understand and advance. While tangential and important topics relating to entity matching, link prediction, etc., do receive a lot of attention in the context of a handful of datasets, the nature of the topic changes when considered at the scope of the Web of Data. By drawing together the works in the area, I thus find this submission to be a useful contribution. Perhaps the greatest praise I can offer is that in reading the survey, despite not having worked in the area for some time now, it started giving me ideas of potential topics that would be interesting to return to and look into in more detail.
There are, however, some issues with the paper that I think could be improved.
(A) As a minor but (for me) critical issue, I take exception with the phrasing of the "sameAs problem", as highlighted by the title, and a similar claim in the abstract that "identity in the Web of Data is broken" (it is not clear what that is supposed to mean). I think this view perhaps has its roots in the paper by Halpin et al., and while there are certainly issues and flaws with sameAs, I think they have become overstated, and I think this is the wrong foot to put forward at the beginning of this survey.
When we worked on SWSE, every so often we would run a new crawl, index the raw data, and then put up a search interface to "sanity check" the data, searching for ourselves, people like Tim Berners-Lee, countries, etc., to quickly make sure we had hit the sources we would expect to have been included. Every search would yield 10+ different results referring to the same entity in different sources. Afterwards we would then run sameAs-reasoning to "consolidate" the entities in the index (based on sameAs links, inverse-functional properties, etc.) and reload the interface. The difference was "night and day": now the results for an entity would feature data drawn together from 2 sources, 5 sources, 50 sources. This was somehow something very fundamental in terms of the Web of Data vision, and it was possible because of the sameAs links, inverse-functional properties, etc., which were/are extremely useful! Of course there were problems when we ran any form of reasoning, which frustrated us sometimes, but these problems are an inevitable part of the Web of Data, and we worked on ways to address those problems (some of which are discussed in the present submission).
I think after the Halpin et al. work (which was of course a very important paper), somehow people turned against sameAs, and it became the canonical example of data quality issues on the Web of Data. A weird sort of stigma took hold when discussing sameAs. This in turn (I believe) contributed to people "avoiding" sameAs, using other weaker properties like skos:closeMatch or simply not computing or publishing the links any more.
Put bluntly, as someone who has worked a lot with consuming content from the Web of Data, I would much rather have people publishing sameAs links with strong semantics that are sometimes wrong than publishing closeMatch links with no semantics that are not even wrong. So, I think sameAs links are something we should be encouraging people to publish, to contribute, to work on: I think they are an opportunity for the community, not a problem! Of course the issses need to be discussed in detail, but I personally think that putting a different foot forward in this paper, with a more positive title, and revising the abstract, would be more constructive, particularly as the article aims to become a reference work. This strange "stigma" that has come to surround sameAs is not constructive in my opinion.
(B) The authors choose not to go into detail on topics such as link discovery, entity matching, etc., which I think is justified in a way, as these techniques have a more "local" scope and have their own dedicated surveys. But through their omission, the paper is somehow not self-contained, or perhaps gives an incomplete picture. Put another way, in the current submission (an introduction to research surrounding sameAs links), these sameAs links seemingly appear out of nowhere. In order to give a full treatment of the topic, and to understand why there might be quality issues, I feel that the authors must discuss the origin of these links. This can be quite high-level, but should give the reader at least a good overview of where these links are coming from. A possible categorisation of ways to produce sameAs-links would be: manual (e.g., adding links to Wikidata by hand); heurisitic rules (e.g., Silk); machine learning (e.g., LIMES); OWL inference (e.g., what we did in [2]).
(C) I think some of the sections tend towards being a "wall of text", listing off work after work with a brief summary of some results of interest. The result is sometimes not very compelling to read. I think a notable example of this is Section 3, and in particular Sections 3.2 and 3.3, which are basically one monolithic paragraph, and which are rather discouraging to have to face quite early in the paper. Contrasting that, each section ends with rather nice discussion that returns to the bigger picture, and some later sections feature tables that help to get an overview of the relevant works. As a general comment, I would ask the authors to break up these monolithic paragraphs, and also to look at more opportunities to add more summary tables in further sections. The authors should at least revise Sections 3.2 and 3.3, and add a table for Section 3 that describes the papers analysing sameAs links, with the number of triples, the number of sources, a summary of the type of analysis, etc. There may be opportunities to do likewise in other sections too.
(D) As a survey, one might expect something along the lines of a systematic survey methodology in terms of how the scope was defined, how keywords to capture that scope were defined, how papers were searched and filtered, etc. This is useful to give a more formal idea of how complete the survey is (or if a paper is missing, why it might be missing). The current submission does not offer this. On the other hand I found the references provided to be quite complete. Perhaps, however, the authors could think about adding some details on how the papers discussed were found (even if informal), or maybe even applying a more formal methodology to check for additional papers.
--------------------------------------------------
Reflecting on the explicit review criteria for surveys:
- (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.
I think that for the most part the submission succeeds in this, but think that issues (A) and (B) would be key to address in this context!
- (2) How comprehensive and how balanced is the presentation and coverage.
Relating to issue (B), one missing aspect is a high-level discussion on where these sameAs links come from in the first place. Also I think a survey methodology, as mentioned in (D), would have helped to understand this aspect better.
- (3) Readability and clarity of the presentation.
In general I find the paper to be readable, with some exceptions (particularly section 3). I think addressing issue (C) would help to improve this aspect of the paper. Also I list some minor comments at the end that, if addressed, may hopefully improve readability and clarity further.
- (4) Importance of the covered material to the broader Semantic Web community.
I'm convinced that the topic is of importance.
In summary, I find the paper to be a useful summary of works on a very important topic that deserves more attention from the community, but that could be improved in certain aspects. My recommendation is for a minor revision in which I ask the authors to address issues (A), (B), (C) and to also perhaps consider (D). I also provide a list of minor comments below for the authors to revise.
--------------------------------------------------
MINOR COMMENTS:
* General
- Particularly given that this is a survey, I urge the authors to change the style in which they refer to papers in order to avoid using reference numbers as nouns in the discussion. By this I mean phrases such as "The approach proposed by [27]", "the authors of [16]", "a recent analysis by [5]". I know this style is (all too) common, but phrases like "The approach proposed by twenty seven" do not make sense when read, and require mental gymnastics to parse. Please change these phrases to simply use the author names: "The approach proposed by Melo [27]", "Ding et al. [16]", "a recent analysis by Raad et al. [5]". This is compatible with the journal style. Not only is this more readable, but it makes it easier to differentiate the same paper(s) mentioned in different parts of the paper, to remember papers, and is also more didactic in terms of associating different authors to different lines of research. In order to make life easy, depending on the reference style, it should be possible to use a macro in LaTeX to generate a citation like "Ding et al. [16]" directly (often using something like \citet{dingetal2020}).
- Tables: right align all numeric columns.
- Fix bad boxes; e.g., Section 6.4: the quality of owl:sameAs.
- In some cases work by the same authors is referred to in the first person plural ("we") and in other cases in the third personal plural ("they"). It would be good to be more consistent on this.
* Title
- "on the Web of Data" sounds more natural to my ear than "in the Web of Data".
* Section 1
- DLs go beyond first order logics with features like transitive closure. Maybe rather write "which are based on decidable 2-variable fragments of first order logic".
- "The idea is [that] by"
- "First family of works ..." Rephrase.
- "Finally, last family of works ..." Rephrase.
- "in the Web of Data" Again I would suggest "on" rather than "in".
* Section 2
- "criticisms have been levelled against it" Which criticisms? By whom? (With references)
* Section 3
- "several studies into investigating" -> "several studies into" or "several studies investigating"
- "these studies that analyse[]"
- "[different aspects of how owl:sameAs is used], either by analysing its use"
- "number of owl:sameAs [statements] linking"
- "a recent study [18] ha[s] analysed"
- "of [a] single central resource"
- "other type[s] of analyses"
- "This problem ha[s] motivated"
- "there [are] several approaches"
* Section 4:
- "relate most [of] the concepts"
- Section 4.1: What about including owl:equivalentClass and owl:equivalentProperty? These would perhaps seem to fit here?
- "Leibniz's law[, meaning that]"
- "inferred [for] its identical"
- "approaches [being] unclear"
- "that [they do] not require existing"
* Section 5:
- "such type[s] of services"
- Table 2: What is the difference between partitions and equivalence classes? This was not clear to me from the discussion. Perhaps it could be removed? (My guess is that equivalence class here refers to sameAs relations only, but the term "equivalence class" more generally refers to an abstract mathematical object: an element of a partition of a set formed by an (abstract) equivalence relation.)
- "represent[s] an equivalence class"
* Section 6:
- "that resulted [in] 1.3M"
- "containing 34.4M owl:sameAs [links]" Also, how do we get 34.4M links from combining 3.4M links and 22.4M links (34.4M > 3.4M + 22.4M). Is this a typo perhaps?
- "statements[, m]eaning that on average"
- "link [has] caused"
- "as a mean[s] to handle"
- "This approach[] hypothesises"
- "Around half of the [approaches presented here] have"
- "in the DBTropes dataset)[, h]ence[] indicating"
* Section 7:
- "DL reasoner" I'm slightly confused by this as I don't know of any DL reasoner applied to the sorts of data that the paper introduces (which are generally not DL-compliant). I would just say "reasoner" or "OWL reasoner" if you prefer to be more specific.
- "Web of Data lack[s] [] ontological axioms"
- "inconsistencies, hence[] suggesting that"
- "[cannot] be presumed"
- "also the aspect that ... were still unknown, until recently [5]". This is a strange claim; perhaps it is underspecified? These sorts of issues have been known about since at least 2010, but even earlier really (the earliest example I can think of is the FOAF-a-matic producing a mbox_sha1sum value for empty emails, on an inverse-functional property, basically inferring all people who used the tool but did not give an email to be inferred to be pair-wise same-as). Revise the claim.
|