The sameAs Problem: A Survey on Identity Management in the Web of Data

Tracking #: 2430-3644

Authors: 
Joe Raad
Nathalie Pernelle
Fatiha Sais
Wouter Beek
Frank van Harmelen

Responsible editor: 
Harald Sack

Submission type: 
Survey Article
Abstract: 
In a decentralized global knowledge space such as the Web of Data, the owl:sameAs predicate is an essential ingredient. It allows parties to independently mint names, while at the same time ensuring that these parties are able to connect and complete each other's data. Since the manual creation of these links is expensive at large-scale contexts such as the Web of Data, identity links are often created automatically, with a chance of error. With several works already proven that identity in the Web of Data is broken, we investigate in this survey the approaches tackling this "sameAs problem'', with a focus on (i) conducted studies and analyses of the identity use in the Web of Data, (ii) approaches proposing alternatives for owl:sameAs, (iii) approaches proposing identity management services, and (iv) ones focusing on detecting erroneous identity statements.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 23/Apr/2020
Suggestion:
Major Revision
Review Comment:

The paper investigates the use of owl:sameAs predicate within the semantic web domain, highlighting current issues and methods that have been investigated to solve the problem of erroneous links.

First the authors illustrate the identity problem, pointing out good examples for not experts. Then, they report interlinking, graph structure, and quality analysis that have been performed on the owl:sameAS predicate. For each analysis a good number of references has been listed. A whole section is dedicated to the introduction of possible alternative predicates that might be used. Centralized identity management systems are also presented and a discussion of their use is reported. In Section 6, the methods used to detect erroneous links are finally described.

1 - Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.
I am sorry to say that the paper is hard to read and might be difficult to have a general overview of the state-of-the-art solutions and open challenges. After improvements, the paper might provide a good guide to the reader who is looking for information on the use of owl:sameAs.

----
2 - How comprehensive and how balanced is the presentation and coverage.
The paper seems comprehensive. There is a lack of tables/images that can help the reader to compare the discussed issues and methodologies.

---
3 - Readability and clarity of the presentation.
I found the paper difficult to read because many details of the reviewed methods are missing or not well described.

----
4 - Importance of the covered material to the broader Semantic Web community.
The paper deals with an interesting issue and can foster future works to solve the issues of linking by owl:sameAs. It reports interesting articles that can foster future research.

I see that there is a lot of space for improvements. These are my main remarks.

Section 3.3 Discussion:
I have not found any discussion in this section. It continues to list articles and reports once more the structure of the next sections of the paper. Please provide a detailed discussion about the various analysis. Moreover, there is an introduction of possible methods to solve the detected issues but they are discussed later in the paper. This creates confusion to the reviewer.

Section 4:
This section reports possible alternatives predicates that can be used instead of owl:sameAs, however, it does not discuss which are the guidelines to use them, and why people failed to use them in the past. I would like to know which are the design aspects that must be considered to choose them? Why is owl:sameAs used even if other predicates are more appropriate?

In addition, it is reported that these predicates lack semantic definition. What has it been done within the community to address this issue? The literature review should include also this aspect.

Section 6:
Many methods are slightly described and it is hard to grasp the used methodologies and make a comparison. Authors are encouraged to provide more details about the reviewed works. I found very hard to find differences between some methodologies and I would suggest to add tables/figures to help the reader to understand which are the characteristics of discussed approaches.

In addition, I do not see the need to describe precision, recall and accuracy which are well-known metrics within the community. The section does not add information for the addressed topic. I would suggest to the authors to give less space to their explanation.

Finally, the section does not satisfy the reader expectation since the causes that brought the methods to fail are not described or over-simplified.

Table 3.
I would suggest to put the measures values in well distinct columns since they are hardly comparable.

Conclusion and Discussion section:
Finally, I consider a good survey article to not only be a good summary of state-of-the-art methods but also highlight their limitations, providing challenges and future research goals. Therefore, I would expect some more concrete suggestions instead of combining existing approaches. I think authors should extend the preliminary discussion that they have started with more insights, key observations, and some first ideas for the future works.

Minor remarks:
Footnote 2: please add the correct reference

The authors use a lot the adjective "recent" to introduce some works (for example, "in a recent work [7]...", even if there are many articles that have more than 10 years.

Typos:
pag. 5 "This problem have motivated several"
pag. 17 "an exception to this are"

Review #2
By Aidan Hogan submitted on 30/Apr/2020
Suggestion:
Minor Revision
Review Comment:

The submission provides a survey of works addressing owl:sameAs links, and other issues relating to identity, on the Web of Data. After a general introduction and overview of the identity problem, the core of the paper delves into surveying: analyses of sameAs and identity links on the Web of Data, proposals of new (weaker) forms of identity links, services that manage and allow for looking-up same-as links, and methods for detecting erroneous identity links. The paper concludes with an overview of the work done, and some open challenges for future research.

Having worked on systems that both produce/clean and consume sameAs links, I am convinced that this topic is of fundamental importance for the Web of Data, and deserves much more attention than it is has received. The sameAs topic underlies the Web of Data's "capacity" for collaborative, decentralised data integration, which is a sort of fundamental "assumption" of the Web of Data's vision, but as discussed in this paper, this capacity is not something that can be taken for granted, but rather requires work to understand and advance. While tangential and important topics relating to entity matching, link prediction, etc., do receive a lot of attention in the context of a handful of datasets, the nature of the topic changes when considered at the scope of the Web of Data. By drawing together the works in the area, I thus find this submission to be a useful contribution. Perhaps the greatest praise I can offer is that in reading the survey, despite not having worked in the area for some time now, it started giving me ideas of potential topics that would be interesting to return to and look into in more detail.

There are, however, some issues with the paper that I think could be improved.

(A) As a minor but (for me) critical issue, I take exception with the phrasing of the "sameAs problem", as highlighted by the title, and a similar claim in the abstract that "identity in the Web of Data is broken" (it is not clear what that is supposed to mean). I think this view perhaps has its roots in the paper by Halpin et al., and while there are certainly issues and flaws with sameAs, I think they have become overstated, and I think this is the wrong foot to put forward at the beginning of this survey.

When we worked on SWSE, every so often we would run a new crawl, index the raw data, and then put up a search interface to "sanity check" the data, searching for ourselves, people like Tim Berners-Lee, countries, etc., to quickly make sure we had hit the sources we would expect to have been included. Every search would yield 10+ different results referring to the same entity in different sources. Afterwards we would then run sameAs-reasoning to "consolidate" the entities in the index (based on sameAs links, inverse-functional properties, etc.) and reload the interface. The difference was "night and day": now the results for an entity would feature data drawn together from 2 sources, 5 sources, 50 sources. This was somehow something very fundamental in terms of the Web of Data vision, and it was possible because of the sameAs links, inverse-functional properties, etc., which were/are extremely useful! Of course there were problems when we ran any form of reasoning, which frustrated us sometimes, but these problems are an inevitable part of the Web of Data, and we worked on ways to address those problems (some of which are discussed in the present submission).

I think after the Halpin et al. work (which was of course a very important paper), somehow people turned against sameAs, and it became the canonical example of data quality issues on the Web of Data. A weird sort of stigma took hold when discussing sameAs. This in turn (I believe) contributed to people "avoiding" sameAs, using other weaker properties like skos:closeMatch or simply not computing or publishing the links any more.

Put bluntly, as someone who has worked a lot with consuming content from the Web of Data, I would much rather have people publishing sameAs links with strong semantics that are sometimes wrong than publishing closeMatch links with no semantics that are not even wrong. So, I think sameAs links are something we should be encouraging people to publish, to contribute, to work on: I think they are an opportunity for the community, not a problem! Of course the issses need to be discussed in detail, but I personally think that putting a different foot forward in this paper, with a more positive title, and revising the abstract, would be more constructive, particularly as the article aims to become a reference work. This strange "stigma" that has come to surround sameAs is not constructive in my opinion.

(B) The authors choose not to go into detail on topics such as link discovery, entity matching, etc., which I think is justified in a way, as these techniques have a more "local" scope and have their own dedicated surveys. But through their omission, the paper is somehow not self-contained, or perhaps gives an incomplete picture. Put another way, in the current submission (an introduction to research surrounding sameAs links), these sameAs links seemingly appear out of nowhere. In order to give a full treatment of the topic, and to understand why there might be quality issues, I feel that the authors must discuss the origin of these links. This can be quite high-level, but should give the reader at least a good overview of where these links are coming from. A possible categorisation of ways to produce sameAs-links would be: manual (e.g., adding links to Wikidata by hand); heurisitic rules (e.g., Silk); machine learning (e.g., LIMES); OWL inference (e.g., what we did in [2]).

(C) I think some of the sections tend towards being a "wall of text", listing off work after work with a brief summary of some results of interest. The result is sometimes not very compelling to read. I think a notable example of this is Section 3, and in particular Sections 3.2 and 3.3, which are basically one monolithic paragraph, and which are rather discouraging to have to face quite early in the paper. Contrasting that, each section ends with rather nice discussion that returns to the bigger picture, and some later sections feature tables that help to get an overview of the relevant works. As a general comment, I would ask the authors to break up these monolithic paragraphs, and also to look at more opportunities to add more summary tables in further sections. The authors should at least revise Sections 3.2 and 3.3, and add a table for Section 3 that describes the papers analysing sameAs links, with the number of triples, the number of sources, a summary of the type of analysis, etc. There may be opportunities to do likewise in other sections too.

(D) As a survey, one might expect something along the lines of a systematic survey methodology in terms of how the scope was defined, how keywords to capture that scope were defined, how papers were searched and filtered, etc. This is useful to give a more formal idea of how complete the survey is (or if a paper is missing, why it might be missing). The current submission does not offer this. On the other hand I found the references provided to be quite complete. Perhaps, however, the authors could think about adding some details on how the papers discussed were found (even if informal), or maybe even applying a more formal methodology to check for additional papers.

--------------------------------------------------

Reflecting on the explicit review criteria for surveys:

- (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.
I think that for the most part the submission succeeds in this, but think that issues (A) and (B) would be key to address in this context!

- (2) How comprehensive and how balanced is the presentation and coverage.
Relating to issue (B), one missing aspect is a high-level discussion on where these sameAs links come from in the first place. Also I think a survey methodology, as mentioned in (D), would have helped to understand this aspect better.

- (3) Readability and clarity of the presentation.
In general I find the paper to be readable, with some exceptions (particularly section 3). I think addressing issue (C) would help to improve this aspect of the paper. Also I list some minor comments at the end that, if addressed, may hopefully improve readability and clarity further.

- (4) Importance of the covered material to the broader Semantic Web community.
I'm convinced that the topic is of importance.

In summary, I find the paper to be a useful summary of works on a very important topic that deserves more attention from the community, but that could be improved in certain aspects. My recommendation is for a minor revision in which I ask the authors to address issues (A), (B), (C) and to also perhaps consider (D). I also provide a list of minor comments below for the authors to revise.

--------------------------------------------------

MINOR COMMENTS:

* General
- Particularly given that this is a survey, I urge the authors to change the style in which they refer to papers in order to avoid using reference numbers as nouns in the discussion. By this I mean phrases such as "The approach proposed by [27]", "the authors of [16]", "a recent analysis by [5]". I know this style is (all too) common, but phrases like "The approach proposed by twenty seven" do not make sense when read, and require mental gymnastics to parse. Please change these phrases to simply use the author names: "The approach proposed by Melo [27]", "Ding et al. [16]", "a recent analysis by Raad et al. [5]". This is compatible with the journal style. Not only is this more readable, but it makes it easier to differentiate the same paper(s) mentioned in different parts of the paper, to remember papers, and is also more didactic in terms of associating different authors to different lines of research. In order to make life easy, depending on the reference style, it should be possible to use a macro in LaTeX to generate a citation like "Ding et al. [16]" directly (often using something like \citet{dingetal2020}).
- Tables: right align all numeric columns.
- Fix bad boxes; e.g., Section 6.4: the quality of owl:sameAs.
- In some cases work by the same authors is referred to in the first person plural ("we") and in other cases in the third personal plural ("they"). It would be good to be more consistent on this.

* Title
- "on the Web of Data" sounds more natural to my ear than "in the Web of Data".

* Section 1
- DLs go beyond first order logics with features like transitive closure. Maybe rather write "which are based on decidable 2-variable fragments of first order logic".
- "The idea is [that] by"
- "First family of works ..." Rephrase.
- "Finally, last family of works ..." Rephrase.
- "in the Web of Data" Again I would suggest "on" rather than "in".

* Section 2
- "criticisms have been levelled against it" Which criticisms? By whom? (With references)

* Section 3
- "several studies into investigating" -> "several studies into" or "several studies investigating"
- "these studies that analyse[]"
- "[different aspects of how owl:sameAs is used], either by analysing its use"
- "number of owl:sameAs [statements] linking"
- "a recent study [18] ha[s] analysed"
- "of [a] single central resource"
- "other type[s] of analyses"
- "This problem ha[s] motivated"
- "there [are] several approaches"

* Section 4:
- "relate most [of] the concepts"
- Section 4.1: What about including owl:equivalentClass and owl:equivalentProperty? These would perhaps seem to fit here?
- "Leibniz's law[, meaning that]"
- "inferred [for] its identical"
- "approaches [being] unclear"
- "that [they do] not require existing"

* Section 5:
- "such type[s] of services"
- Table 2: What is the difference between partitions and equivalence classes? This was not clear to me from the discussion. Perhaps it could be removed? (My guess is that equivalence class here refers to sameAs relations only, but the term "equivalence class" more generally refers to an abstract mathematical object: an element of a partition of a set formed by an (abstract) equivalence relation.)
- "represent[s] an equivalence class"

* Section 6:
- "that resulted [in] 1.3M"
- "containing 34.4M owl:sameAs [links]" Also, how do we get 34.4M links from combining 3.4M links and 22.4M links (34.4M > 3.4M + 22.4M). Is this a typo perhaps?
- "statements[, m]eaning that on average"
- "link [has] caused"
- "as a mean[s] to handle"
- "This approach[] hypothesises"
- "Around half of the [approaches presented here] have"
- "in the DBTropes dataset)[, h]ence[] indicating"

* Section 7:
- "DL reasoner" I'm slightly confused by this as I don't know of any DL reasoner applied to the sorts of data that the paper introduces (which are generally not DL-compliant). I would just say "reasoner" or "OWL reasoner" if you prefer to be more specific.
- "Web of Data lack[s] [] ontological axioms"
- "inconsistencies, hence[] suggesting that"
- "[cannot] be presumed"
- "also the aspect that ... were still unknown, until recently [5]". This is a strange claim; perhaps it is underspecified? These sorts of issues have been known about since at least 2010, but even earlier really (the earliest example I can think of is the FOAF-a-matic producing a mbox_sha1sum value for empty emails, on an inverse-functional property, basically inferring all people who used the tool but did not give an email to be inferred to be pair-wise same-as). Revise the claim.

Review #3
By Enrico Daga submitted on 28/Sep/2020
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

***

The article is an introduction to the same as problem that surveys relevant work related to the problem and presents it generally in a readable and clear way. However, there are some important issues that should be better addressed. In what follows I report on these, following the flow of the article, not the relevance or the importance of each one of them.

Identify management as a notion evokes the problem of users' identity on the Web (WebID, etc…), I think it would be more accurate to refer to the sameAs problem as identity links management, instead (I was going to write entity linking but that creates another problem!).

In ontology engineering this problem is a part of the ontology alignment one but I understand we are not talking here about ontologies but linked open datasets. However, the distinction is blur and in many ways the survey has the demerit of not tackling at all this somehow embarrassing issue: the definition of identity following the semantics of owl:sameAs is grounded in description logic and it is very restrictive (Leibniz's Law, Section 2). It was understood very early how a restrictive definition is not suitable for the LOD (citation [4]) and we cannot honestly say that data publishers imply logical equality when they use owl:sameAs, since they only respond to a recommended good practice. Why this is important? Because only considering this premise we understand why, for example, alternative solutions are proposed and used (seeAlso, SKOS, etc…). People using these are doing it exactly for the reason that they are more accurate *because* they are less specified and restrictive. My overall impression is that these issues are underestimated in the SW community, and by transitivity in this survey, which results syntactically sound but somehow "with blinders".
For example, it doesn't seem to consider the pragmatics of owl:sameAs, besides the semantics as specified in the technical documentation (the OWL specification). The authors are very well aware of how problematic the notion of identity is (Sections 2.1 and 2.2) but they don't consider it in discussing the literature. For example, the conclusions are just dismissive of alternative approaches, because "they lack semantics" (e.g. in Section 7).

The article is essentially scoped within the research of the Semantic Web community. However, the Web of Data is populated and used by other communities as well, which have different background and data cultures (e.g. library domain, humanities, bio-medicine, etc,…). I was wondering whether the linking requirements of those can be reduced to owl:sameAs. Can we make the point on identity linking on the Web of Data without considering domain-specific requirements? How the literature on those domains consider owl:sameAs? I think a survey on this issues should not exclude non-SW literature, in some way.

Section 2.2 states that the problematic notion of identity is standardized in OWL2. This is not totally accurate as OWL2 only specifies the restrictive, logic-based notion of identity. One way of solving the problem is to expand the section on surveying the *tasks* that would benefit from a good management of identity links, from the literature. Besides, Section 2.2 does not have a conclusion (e.g. list of requirements for identity linking). Different tasks may have different requirements (e.g. SPARQL querying vs OWL reasoning). It is known how query resolution through graph traversal is badly affected by link directionality) [A]. Also, quality requirements coming from DL reasoning are different and may include elements of Trust [B].

In Section 3.1 the article reports on the finding (from a previous study) that "the majority of datasets have incoming links, whilst far fewer datasets have outgoing links, indicating that a relatively small number of datasets is linking to a relatively large amount of them". I found this claim counter-intuitive and surprising. Digging into [15] I found that the notion of dataset is inferred from the one of namespace (a doubtful approximation but a necessary one, I understand from [15]). This contradicts the well known topology of the LOD, where many SPARQL endpoints or files (a better notion of dataset) link to a few big entity hubs (Dbpedia, Wikidata, Geonames, etc…). Dbpedia necessarily has more incoming owl:sameAs links then outgoing ones but data.open.ac.uk will most likely have more outgoing links. I expect there are more datasets similar to data.open.ac.uk on the LOD then ones similar to Dbpedia. If I am wrong, what is the pragmatic impact of this finding on owl:sameAs management?

Section 3.2 touches the interesting point of the graph structure but without referring to the LOD graph. The first statement is that there are a few central entities connected to a large number of peripheral ones. However, this is true only in the inferred graph, since the links are actually the other way round. A more accurate way of saying that is that there are many peripheral entities linked to a few central ones on the LOD. I suspect that most of the studies on the matter just assume that this is not an important issue, since OWL same as is supposed to be symmetric and reflexive. However, how directionality of links affects the bottom-up semantics of same as? How the topology of the LOD affects the collection and management of same as links?

At the end of Section 3.3: "there is several approaches that" -> "there are several approaches that"

I have a problem with Section 4.1 as I think that the alternative approaches to identity cannot be dismissed by just saying that they are less formal then the one of OWL. The semantics of many of those terms is indeed weaker then owl:sameAs but this may be precisely what users need. I would say that rdf:seeAlso semantics is perfectly respected in the pragmatics, while I cannot certainly say the same for owl:sameAs. What we can conclude with relation to identity link management? Is it enough to manage 1 identity link type? (I think the survey reports several findings that suggest that the answer may be no).

Also, there are other ways of expressing and managing identity that work very well on the Web: OrcID, DOIs, the publication of gazetteers for providing well-curated concept lists in the humanities , which are not discussed.

I think that the problem with how alternative approaches to the specification of identity links are discussed is a clue for a bigger one. If the task the survey is concerned about is DL reasoning, than the section should be stashed as those identity links are not applicable to OWL. However, if the survey is about identity and linking, then the presentation should be scoped within specific requirements (query, discovery, reasoning, referencing, linking as in providing a link to humans to another datasets, etc…). Those requirements should provide the backbone for the discussion and conclusions section, which now seem focused only on how important it is to link entities at scale in general. It would be much more useful to see this big problem discussed in the light of concrete tasks and problems that affect the users of the Web of Data.

Minor: I did not understand why the Notes appendix in the last page.

[A] Hartig, Olaf, Christian Bizer, and Johann-Christoph Freytag. "Executing SPARQL queries over the web of linked data." In International Semantic Web Conference, pp. 293-309. Springer, Berlin, Heidelberg, 2009.
[B] Bonatti, Piero A., Aidan Hogan, Axel Polleres, and Luigi Sauro. "Robust and scalable linked data reasoning incorporating provenance and trust annotations." Journal of Web Semantics 9, no. 2 (2011): 165-201.

Review #4
Anonymous submitted on 13/Nov/2020
Suggestion:
Minor Revision
Review Comment:

=== I review the paper according to the journal's criteria:

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

The paper gives a good overview on research on identity statements on the Web.

(2) How comprehensive and how balanced is the presentation and coverage.

The paper is very comprehensive. Sometimes, the paper is a bit biased toward the authors' own investigations. For instance, I doubt that the authors were the first to discover awkward owl:sameAs examples (2nd paragraph on page 2).

I could imagine that the paper may be beneficially extended into three directions:

* Computational issues
A naive foward-chaining reasoning approach on larger dataset with a considerable amount of owl:sameAs statements increases the dataset considerably due to the properties of identity. To address this, for instance, the authors of reference 19 thus applied specialised processing. Another paper could be Calvanese et al. "Ontology-based Integration of Cross-linked Datasets", ISWC 2015

* Not open data
The paper focusses a lot on data publicly available on the web. With the adoption of semantic technologies in industry, I wonder if there is something to say in that direction.

* Recent data sets
Many of the analyses or their supporting datasets in the paper are at least 3, most 5 years old or older. This is probably because during that time, analysing Linked Data was something a lot of researchers did, anyway if there is a chance to get more recent results, the authors should seize it.

(3) Readability and clarity of the presentation.

The text is well-written.

(4) Importance of the covered material to the broader Semantic Web community.

Yes!

==== Other comments:
* Linked Open Data principles (page 1). Those are the Linked Data principles, see footnote 1. The focus on Linked *open* Data in other places of the paper should be re-visited as the paper often mean Linked Data, in my opinion, that does not need to be open.
* Title: ...Identity management... -> That term may confuse people who think of access control, see the wikipedia article on Identity Management
* Discussion after 5.2: Centralised naming services like DOI and PURL etc. are successful, or?
* 5.1 namespaces: are you talking about the IANA-registered URI *schemes* (not "namespaces")?
* "Finally, it has now been broadly acknowledged" -> I think this has been acknowledged for quite some time