The Apertium Bilingual Dictionaries on the Web of Data

Tracking #: 1194-2406

Authors: 
Jorge Gracia
Marta Villegas
Asunción Gómez-Pérez
Núria Bel

Responsible editor: 
Philipp Cimiano

Submission type: 
Dataset Description
Abstract: 
Bilingual electronic dictionaries contain collections of lexical entries in two languages, with explicitly declared translation relations between such entries. Nevertheless, they are typically developed in isolation, in their own formats and accessible through proprietary APIs. In this paper we propose the use of Semantic Web techniques to make translations available on the Web to be consumed by other semantic enabled resources in a direct manner, based on standard languages and query means. In particular, we describe the conversion of the Apertium family of bilingual dictionaries and lexicons into RDF (Resource Description Framework) and how their data have been made accessible on the Web as linked data. As result, all the converted dictionaries (many of them covering under-resourced languages) are connected among them and can be easily traversed from one to another to obtain, for instance, translations between language pairs not originally connected in any of the original dictionaries.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Roberto Navigli submitted on 23/Nov/2015
Suggestion:
Minor Revision
Review Comment:

Making available numerous bilingual dictionaries is vital for the creation of the linguistic linked data (LLD) cloud. This paper goes in this direction and presents the publication of Apertium on the Web as (linguistic) linked data. The representation model is based on lemon, including the use of its lemon translation module extension. The use of the lemon RDF model makes the contribution strong. The RDF generation methodology is very clear, with careful URI design, linking to and and publication in a Virtuoso triple store, with a Pubby interface.

The paper is an important contribution to the creation of a rich LLD cloud. The dataset is of high quality. A clear example is provided (e.g. in Figure 2 we see an excerpt of the RDF triple set). The dataset is clearly useful, thanks to connecting languages some of which are resource poor.

Besides the RDF resource creation part, which is the core of the paper and is very clearly described, I liked Section 5, where an exploration of the graph is discussed. I particularly liked the discussion of using a third pivot language to construct or enrich a bilingual dictionary, using the one time inverse consultation algorithm. Experiments on English, Catalan and Spanish are presented, where high precision and relatively low recall results can be achieved.

The paper needs some proofreading on minor language issues: as result (as a result), type of languages resources (type of language resources), associated to (with), and provide(s) it to the community, different process(es).

The cardinality operator should be applied to the numerator of Formula 1.

Review #2
By Sebastian Hellmann submitted on 08/Dec/2015
Suggestion:
Minor Revision
Review Comment:

Disclaimer: this review was written together with Bettina Klimek.

The present paper "The Apertium Bilingual Dictionaries on the Web of Data" describes a Linked Data dataset based on the Apertium family of bilingual dictionaries. The aim of the paper lies in the presentation of the development and usage of the Apertium RDF dataset as a Linked Data conversion of 22 Apertium LMF-in-XML bilingual dictionaries. Overall, the paper is understandable and well written. Furthermore, the following **positive aspects** can be enumerated:

* for the dataset URLs, a dedicated website, a SPARQL endpoint and a web portal allowing human-readable search of translations are provided and working
* a direct download and access of the datasets is enabled and working via an external repository (dathub.io)
* the data has been always available when accessed
* SPARQL query results are provided in Linked Data and various other formats
* the data is integrated into the LLOD cloud and linked to other datasets; the number of internal and external links is given and correct
* metadata stating the creator, license and source of the data is explicitly declared in RDF
* the creation method described applies to Linked Data dataset creation best practices and standards
* the usage of the data is fully described
* added value has been proposed by using Apertium RDF as a multilingual dataset in contrast to the bilingual source data

In the following, aspects of the paper are discussed which are advised to be **majorly revised**:

*1. vocabulary use*
For the conversion of the Apertium dictionaries a representation model has been presented which consist of the two well-established *lemon* and LexInfo vocabularies and the less-established vocabulary of the *lemon* translation module. A detailed investigation of the representation model with regard to the aim of the paper reveals that it is highly appropriate to describe lexical translations of two or more languages. However, from what has been practically undertaken it seems that the model has not been fully used, which is due to the information in the source data. As the authors state correctly “translations occur between specific meanings of the words” (cf. p.4), but the underlying LMF model does not provide specific meanings. Rather, the provided Sense IDs (cf. example on p.2) state only two orthographic representations of words in two different languages which are supposed to share the same meaning. Nevertheless, there is neither an explicit information about the content of
the meaning given nor a relation stating the semantic similarity between the words. As a consequence, in the Apertium RDF (without any external links) there are no meanings according to the proposed usage of the *lemon* translation model (as described in J. Gracia et al. “Enabling language resources to expose translations as linked data on the web.” 2014) provided. Thus, the lemon:LexicalSense resources do not point to an ontological entity via lemon:reference but rather serve as kind of place holders. What is more, the only properties of tr:Translation used in the data are tr:translationSource and tr:translationTarget, hence, omitting the also explained tr:context and the especially important tr:translationCategory properties. This rather insufficient usage of the available vocabulary reveals that the Apertium RDF datasets are a mere transformation from LMF XML to RDF adopting the flaws of the original dataset. The authors are advised to critically discuss the points just men
tioned and to justify their actual vocabulary usage.

*2. Usage of Apertium RDF as multilingual language resource*
In the paper the authors introduce an additional value of Apertium RDF in contrast to the original Apertium bilingual dictionaries in that the Linked Data transformation results in a (potential) multilingual dataset. However, the quality of the obtained indirect translations between languages which are traversed via a pivot language, is not convincing. It is comprehensible that the One Time Inverse Consultation method has been chosen to propose a way of identifying correct indirect translation candidates, given that the data does not contain explicit sense references. Nonetheless, an enrichment of such references has been undertaken by adding BabelSynset resources to the lexical senses, which enables a more straightforward approach of creating further direct multilingual translation links. The fact that many translations are linked to BabelSynsets which are identical for both the translation target and the translation source enables the introduction of the translation categories. Th
at means for each translation resource which fulfils this condition a translational equivalent relation could have been stated, e.g.:

apertium:tranSetEN-ES/bench_banco-n-en-sense-banco_bench-n-es-sense-trans a
tr:Translation ;
tr:translationSource apertium:tranSetEN-ES/bench_banco-n-en-sense ;
tr:translationTarget apertium:tranSetEN-ES/banco_bench-n-es-sense ;
**tr:translationCategory trcat:directEquivalent .**

With this information correct and true indirect translations are obtainable without any measurement and threshold filter. This can be shown by taking the same example as proposed. The task was to find the correct Catalan translation for the Spanish word “banco” by using English as pivot language (cf. p.8). What is known by traversing the two direct ES-EN and CA-EN Apertium RDF graphs is that “banco”ES has the two direct translations “bank” and “bench” in English and “bank”EN has the two direct translations “banc” and “riba” in Catalan, also known is that “bench”EN translates directly to “banc” in Catalan; resulting in altogether five translation pairs. Looking up the lexical senses of those pairs reveals that for each translation (except for the translation of “bank”EN and “riba”CA which has no BabelSynset) the lexical senses of both the translation source and the target point to the same BabelSynset, meaning that they are direct translatio
n equivalents. The manual compilation of the bilingual dictionaries adds to the quality of these sense references. Under the assumption that BabelSynsets are senses and therefore language independent concepts, it holds true that each translation that shares the same underlying concept is a direct translation of expressions in two different languages. Consequently, the correct translation of “banco”ES into Catalan can only be the one which shares the same BabelSynset(s) with an English word as the Spanish word “banco”. As the query showed, the translation pairs “banc”CA-”bank”EN and “banco”ES-“bank”EN share the same BabelSynset () and “banco”ES-”bench”EN and “bench”EN-”banc”CA share the same BabelSynset (). This means that the concept defined in the resource is encoded in the three expressions “bank”EN, “banco”ES and “banc”C
A and the concept defined in the resource is encoded in the three expressions “bench”EN, “banco”ES and “banc”CA. That there are two concepts involved is not problematic. All this declares is that there are two concepts which are encoded with the same expression in Catalan and Spanish but with two different expressions in English. What matters is that the same BabelSynset is shared between at least one EN-ES and EN-CA translation pair of “banco”ES, which applies to “banc”CA in this case. This is the same result as calculated by the authors with the OTIC method but with higher precision. This synonymy investigation and matching could have been undertaken by the authors. Of course, this is only a reliable method under the prerequisite that the links to the BabelSynsets are correct. Since the authors omit an explanation on how these links have been created and of what quality the linkings are, a judgement on the correctness of the created links cannot be undertaken. The authors are advised to add this missing information.
Overall, the current state of multilinguality in the Apertium RDF dataset bears an amount of uncertainty of the obtained translations which reduces the quality of the multilingual data presented. Also, a use of the calculated multilingual translations by third parties in machine translation could not be proven.

Next to these two major issues, the following **minor** points also require **revision**:

*a) version information*
- Given that future work shall include an investigation of the quality of the dataset and future changes/extensions might occur, the authors are advised to add a version number to the files.

*b) http://linguistic.linkeddata.es/def/translation/lemonTranslation.owl contains turtle not rdfxml*
i: curl -L -H "Accept: text/turtle" http://purl.org/net/translation redirect to .owl file
ii: better to use owl:subClassOf than rdfs:subClassOf

*c) wrong content-type header*
- should be text/turtle instead of application-x/turtle http://www.w3.org/TR/turtle/#sec-mime
curl -I -H "Accept: text/turtle" -L http://linguistic.linkeddata.es/id/apertium/tranSetEN-ES

HTTP/1.1 303 See Other
Date: Tue, 08 Dec 2015 11:54:57 GMT
Server: Apache-Coyote/1.1
Vary: Accept,User-Agent,Accept-Encoding
Location: http://linguistic.linkeddata.es/data/id/apertium/tranSetEN-ES
Content-Type: text/plain
Content-Length: 114
Via: 1.1 linguistic.linkeddata.es

HTTP/1.1 200 OK
Date: Tue, 08 Dec 2015 11:55:02 GMT
Server: Apache-Coyote/1.1
Vary: Accept
Content-Type: application/x-turtle
Content-Length: 5451525
Via: 1.1 linguistic.linkeddata.es

*d) curl -I -L http://linguistic.linkeddata.es/id/apertium/tranSetEN-ES redirects to Location:http://linguistic.linkeddata.es/page/id/apertium/tranSetEN-ES*
- why /page/id and not just /page. ?

*e) wrong link redirects to lemon*
- the *lemon* vocabulary links used in the data (e.g. lemon:LexicalSense here:http://linguistic.linkeddata.es/page/id/apertium/tranSetEN-ES/bank_bankD...) redirect to a page resulting in a 404 error. All lemon URIs should be checked so that http://lemon-model.net/lemon#Form is used instead of http://www.lemon-model.net/lemon#Form (note the www. subdomain)

*f) on section 5.2*
- The calculation with the equation on page 8 with the examples in Fig.4 results in a score of 0.66 for "riba"@ca and not 0.5. If the given score results are based on information not shown in Fig. 4 this should be explicitly stated or the necessary information added to the figure.
- With regard to the table here http://figshare.com/download/file/2201205/1 precision for threshold = 1 ranges from 61% - 83%.

*g) http://linguistic.linkeddata.es/def/translation-categories is not well formed turtle*
- it mixed qnames with angle brackets: rdfs:label "direct equivalent"@en ;

*h) An update of the Virtuoso version which enables SPARQL 1.1 would be appreciated.*

*i) Orthography and mode of expression*
- Linked Data is a proper noun and is to be written capitalized.
- The whole paper should be checked for minor mistakes, e.g. “As result of..” → “As a result of..” (p.1 and p.6), “..described in the remainder if this section” → “..described in the remainder of this section” (p.4).

**Summary:**
Overall the Apertium RDF dataset presented in this paper is of reaasonable quality and provides references to all resources involved as well as to the evaluation results. Further, the data applies to the [five star rating for Linked Open Data](http://www.w3.org/DesignIssues/LinkedData.html) and is a valuable addition of linguistic resources, including currently underrepresented languages, in the Web of Data. The representation model is well chosen and sufficiently explained. Also, the RDF generation process is clearly described and applies to W3C best practices and standards for creating multilingual Linked Open Data. With regard to the vocabulary choice and the described Linked Data generation method the Apertium RDF dataset can, therefore, be seen as a showcase for other linguistic Linked Data datasets. The aim of the authors to convert the initial Apertium bilingual dictionaries into RDF has been fulfilled. However, with regard to the actual usage of the vocabulary and the unified
graph as a multilingual dataset extension, the full potential of the Semantic Web technologies described in the paper is not exploited. With regard to the [Linked Data vocabulary use rating](http://www.semantic-web-journal.net/system/files/swj653.pdf), the applied *lemon* translation module achieves four out of five stars due to missing links pointing to the dataset. In order to raise the usability and the quality of the Apertium RDF dataset it is proposed to clearly state the known shortcomings of the original Apertium data and extend the current Linked Data with triples identifying translation equivalents. Further, the generation method of the BabelSynset links and an evaluation of their quality should be included. Additionally, a sense matching as proposed could be also undertaken and these results could evaluated in comparison to the OTIC method already applied (facultative).

Therefore, the paper is rated as a minor reject and the authors are strongly encouraged to revise the paper according to the major and minor critical aspects outlined. Having done that, it is more likely that third-parties will recognize, and thus make proper use, of the Apertium RDF dataset.

Review #3
Anonymous submitted on 27/Feb/2016
Suggestion:
Accept
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions. Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people. We strongly encourage authors of dataset description paper to provide details about the used vocabularies; ideally using the 5 star rating provided here .

The article describes the publication as Linked Open Data of a set of bilingual dictionaries which were originally compiled as part of the Apertium project.
As the authors point out the resource will be of great use in making available translations of lexical items in range of languages amongst which are also a good number of less resourced languages . Publishing the dataset in linked data also facilitates the linking together of language pairs not originally included in the original dataset: allowing the creation in effect of new bilingual resources. The problems caused by the difference in polysemy across different languages, a possible defect of the dataset if it is to be used to create new translation pairs, is also dealt with through the use of the OTIC method.

The article gives a full explanation of the names and URI's used in the resource: the authors aimed to preserve the original identifiers in the LMF version of the dictionaries as far as possible, and ISA action recommendations were used to pattern the URIs. The conversion of the data is also described in detail, e.g., the use of open refine. In addition the scripts used in the work have been made openly available. The overall methodology for converting the source data into a linked dataset is also described,. The main vocabularies used in the dataset are given. The paper also describes a novel extension to the lemon model, which was the overall basis for the conversion, for handling such translations which has been included in the Ontolex-Lemon model. This will be of great benefit for those who need to represent multilingual lexical resources featuring translations as RDF as lemon is currently accepted as a de facto standard for representing lexical resources in RDF. There are also a number of helpful examples clearly presented in the paper.

The article includes a table explaining the different language pairs covered, along with the number of triples and translations present. The number of links between the Apertium LOD dataset and lexinfo and Babelnet are described. The dataset is available on datahub, as well as via a SPARQL endpoint and a dedicated portal The authors also point out that the datasets are available under a GNU general public license.

As the authors explain, the newness of the resource means that at present the amount of third party usage is limited. However they point out that BabelNet are exploring the use of the resource to improve their translations.

Overall the paper is very clear and gives a sufficiently detailed description of the resouce, from the provenance of the data, to the methodology used to derive the resource, and its publication. Given that it is a new resource it is clear that third party reuse has not yet taken off, but its clear that the LLOD version of the Apertium dictionaries will make an extremely useful contribution to the linguistic linked open data cloud.