Multilingual Linked Open Data Patterns

Tracking #: 406-1519

Authors: 
Jose Emilio Labra Gayo
Dimitris Kontokostas
Sören Auer

Responsible editor: 
Guest editors Multilingual LOD 2012 JS

Submission type: 
Survey Article
Abstract: 
The increasing publication of linked data makes the vision of the semantic web a probable reality. Although it may seem that the web of data is inherently multilingual, data usually contains labels, comments, descriptions, etc. that depend on the natural language used. When linked data appears in a multilingual setting, it is a challenge to publish and consume it. This paper presents a survey of patterns and best practices to publish Multilingual Linked Data and identifies some issues that should be taken into account. As a use case, the paper describes the patterns employed in the DBpedia Internationalization project.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By John McCrae submitted on 04/Feb/2013
Suggestion:
Minor Revision
Review Comment:

This paper describes several patterns for publishing multilingual linked data, and is generally well-written and would serve as a good introduction to people who are not familiar with linked data.

The first set of patterns describe naming of URIs. The authors give a good overview of some of the issues of using IRIs. While it is certainly useful to provide readable URIs, it is important to note that most users interact with linked data through HTML interfaces (e.g., DBpedia's PUBY) or specialist tools (e.g., Protege for OWL ontologies), and as such the use of labelling makes the need to rely on readable URIs less important.

The second set of patterns on dereferencing I found to be weakest. The use of content-negotiation for language is very useful for HTML documents but seems questionable for a linked data site, where users are not expected to view the data directly (but likely use a HTML interface). As such, the main argument for using any dereferencing seems to be saving network overhead, which seems weak and the costs could easily be offset even on mobile devices by specialized APIs, rather than bothering data providers with extra content-negotiation issues.

The sections on labelling and linking cover the available options very well. The section of re-use was also well written but the "localize existing vocabularies" section importantly does not describe how these new localization could be found by consumers of the original resource.

The use of DBpedia as a use case seems apt, as DBpedia has already met many of the challenges of multilingual linked data covered in this paper.

Minor errors:
p1. "0,7%" (should be 0.7%)
p3. "IRI supported" (should be support)
p12. "W3c" (should be W3C)

Review #2
By Jorge Gracia submitted on 07/Feb/2013
Suggestion:
Minor Revision
Review Comment:

This paper presents a survey of patterns to publish Multilingual Linked Data (MLD) on the Web. First, the problems of internationalisation on the Web and multilingualism in LD are introduced. Then, several patterns are proposed for naming, dereferencing, labelling, long descriptions, linking, and reuse linked data in a multilingual environment. The patterns are described in a comprehensive way by providing a description, context (with examples), discussion, and pointers to related information in the literature. The article finishes describing DBpedia as a use case of some of these patterns.

This work is relevant and timely for the community. The paper is very clear and very well structured and covers the problem well. The position of the authors is not to give a set of best practises but a survey of patterns so that each MLD creator can chose the ones that best fit to their necessities. This is a good point and is consistent with the general discourse of the paper. Nevertheless, in the abstract it is written that best practises for publishing MLD will be presented in the paper, which contradicts what Section 1 says about best practices (this can be fixed by simply omitting that reference in the abstract).

Here are some detailed comments that I hope help to improve the quality of the paper:
- The authors refer to multilingualism in Linked Open Data frequently. I wonder which part of the discussion can be applied to Linked Data in general also (not only to LOD). This could be briefly mentioned somewhere in the paper.
- I would reformulate the first sentence in Section 2. It sounds like if there weren't language barriers in the current WWW.
- Citations or footnotes describing technical concepts (like SPARQL, RDF, ASCII, etc.) should be introduced the first time that they are mentioned in the paper.
- In the introduction it is said that 4.7% of the non-information resources employ one language tag. The notion of "non-information resources" should be explained at that point.
- In Section 2.2, the sentence "Although IRI supported increases incrementally..." has to be rephrased for clarity. It would be good to add some example of concrete techniques (RDF or SPARQL specifications?) that support IRIs and others that not.
- In Section 2.2, I would explain "homograph attacks" a little more (one or two sentences).
- In section 3, I would say that numeric data are "language neutral" rather than "intrinsically multilingual". Although, strictly speaking, this is also arguable as numeric systems are culturally dependant (see http://en.wikipedia.org/wiki/Armenian_numerals for instance).
- The penultimate paragraph in Section 3 adds very little and could be omitted.
- In the last paragraph of Section 3, the term "localized" should be explained (for readers not familiarised with that expression).
- At the beginning of Section 4.1 (and later in the paper) they use the term "URI schemes", but I think not in its most commonly used meaning (see RFC 3986 specification for URIs). See examples of URI schemes in http://www.iana.org/assignments/uri-schemes.html. The term "URI scheme" should not be overloaded and, for instance, the sentence "the first step in a linked data development lifecycle is to design good URI schemes" could be safely changed by "... to design good URIs"
- In Section 4.1.1 they mention "local names". The term should be defined before.
- Regarding Sections 4.1.1 and 4.1.2, it has to be further clarified what the authors understand by opaque/descriptive URIs: is it the whole URI? Or is it just the local name?
- In 4.1.2, the sentence "Using opaque URIs may help to separate the concept from its different labels" is a bit confusing. It would be good to say something more about this.
- The last sentence of Section 4.2.1 is syntactically ambiguous: it is unclear which one is the "above mentioned functionality" (language or different representation of content).
- In 4.3 it is said that "Labels could be considered as units of textual information." I would remove that sentence to avoid wrong interpretations (like interpreting labels as words or as lexical entries).
- In Section 4.3.2, about the "multilingual labels" pattern, the authors wrote that "this pattern can be applied when labels have information in some natural language." I guess it is "...several natural languages”.
- In Section 4.4.1 ("divide long descriptions" pattern), they state that shorter descriptions benefit localisation, but this is not clear to me. In fact, SMT systems typically work better with longer texts (which provide more context to disambiguate the meaning).
- In 4.4.2 ("lexical description" pattern), "Using this pattern, we can describe the lexical content of longer descriptions". Actually, this pattern can be applied also to short labels, if richer lexical information is needed. Lemon model has to be briefly introduced before Example 11, to better understand this and other examples. Also, Example 11 could be rewritten in terms of lemon only, substituting rdfs:label by lemon:writtenRep, for instance:
:University a lemon:LexicalEntry ; lemon:form [ lemon:writtenRep "University"@en].
- In the discussion of 4.4.2, they state that providing lexical metadata for a resource supports fully automated software agents. Some example is needed here to illustrate it.
- Example 13 has to be reviewed: first, two separate URIs are introduced to represent Armenia, but then one of them changes when they are linked with sameAs.
- Section 4.5.3 ("add linguistic metadata") seems to overlap with 4.4.2. I think that the differences have to be emphasised. For instance, why not including lemon-based metadata also here? Further, I would mention SKOS-XL, which reifies the class Label so assertions in RDF can be made about labels.
- In 4.5.3, in addition to Lexvo, more references (and maybe a comparative) could be added to resources that can provide URIs to represent languages, such as id.loc.gov or http://www.lingvoj.org. See this interesting thread http://lists.w3.org/Archives/Public/public-lod/2012Feb/0073.html
- English is correct in general but has to be reviewed for typos. Some examples:
In abstract: "data usually contains labels" -> "data usually contain labels".
In Section 1: in the last paragraph, the sentence starting "Section 5 describes..." lacks a connector ("and"?) before "we describe..."
In Section 2: "Browsers supporting punycode automatically and convert the IRI to its punycode representation." Delete “and”?
Section 4: "Not all textual information attached to resources are labels and in fact," -> another comma before "in fact" would make the sentence clearer. Actually the whole sentence is very long and could be split in two.
Section 4.5.2: "...hence must be used careful" -> "...carefully".