Evoke: exploring and extending lexicographic resources using a linked data approach

Tracking #: 2690-3904

Authors: 
Sander Stolk

Responsible editor: 
Guest Editors Advancements in Linguistics Linked Data 2021

Submission type: 
Tool/System Report
Abstract: 
Lexicographic resources such as thesauri contain a wealth of information for research, but their published forms not uncommonly limit the ways in which users can interact with them. The web application Evoke offers users functionality for viewing, navigating, extending, and analysing content of topical thesauri. Its use of linked data mechanisms and a novel architecture (relying on the use of data catalogues, internet browser storage, annotation of URIs) addresses one of the more intractable problems in modern lexicography: allowing users to engage more fully with published lexicographic content without them infringing on licenses or requiring additional hosting. Users of Evoke can engage with lexicographic content by annotating, adding tags, and building custom queries without the need to have full, unlimited access to the entire dataset of the lexicographic resource. Evoke is one of the first applications that provides a user-friendly interface for working with Linguistic Linked Data resources, opening these resources up to users who may not have advanced knowledge of linked data and RDF technologies. Students as well as established researchers have confirmed the user interface and underlying architecture of Evoke to be of value in answering novel research questions relying on the data of these lexical treasure troves -- complemented by their own.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Armando Stellato submitted on 30/Mar/2021
Suggestion:
Reject
Review Comment:

The author introduces a web application for facilitating browsing and annotation of topical thesauri expressed through the OntoLex-Lemon vocabularies. Among its characteristics: creation of personal annotations stored within the browser, a UI mediating access to online lexical resources, dedicated APIs for accessing resources for which the publisher does not want to openly share content. It is claimed that students as well as established researchers have confirmed the usefulness of such a tool.

I think the main limitations of this contribution lie in the stage of maturity of the platform (this is substantially a “report on tools and systems” article and, as such, the analysis of the platform is part of the review) and in the shallow description of its peculiar characteristics provided by the article.

Besides the availability of source code (actually non-availability, https://github.com/ssstolk/evoke is empty), the tool doesn’t seem to follow a download-and-install policy, rather offering a centralized service that people can access and interact with. This is obviously a possibility, but then I see two major flaws, especially in consideration of reviewing this contribution as a system paper:
* I tested the online version and it doesn’t allow users to add new thesauri
* There’s a mention of available thesauri, but only one there is: the thesaurus of Old English
Concerning quality, importance and impact of the described tool, there’s no evidence provided by the article. It is only claimed that students as well as established researchers have confirmed the usefulness of such a tool, but no official information is provided besides a reference to a yet-to-be published paper in a workshop which, in turn, will be furtherly extended on a local journal.

In the related works, other relevant projects that could be cited are ISA2 actions VocBench 3: https://ec.europa.eu/isa2/solutions/vocbench3 and PMKI (https://ec.europa.eu/isa2/actions/overcoming-language-barriers).
Furthermore, I see a lack of explicitly mentioned (and compared) platforms. Specifically related to the OntoLex standard, I could cite at least:

* The aforementioned VocBench 3 (http://vocbench.uniroma2.it/)
Stellato, A., Fiorelli, M., Turbati, A., Lorenzetti, T., van Gemert, W., Dechandon, D., Laaboudi-Spoiden, C., Gerencsér, A., Waniart, A., Costetchi, E., Keizer, J.: VocBench 3: A collaborative Semantic Web editor for ontologies, thesauri and lexicons. Semantic Web 11(5), 855-881 (Aug 2020)

* LexO-lite (https://github.com/andreabellandi/LexO-lite)
Bellandi, A., Giovannetti, E., Piccini, S., & Weingart, A. (2017). Developing LexO: A Collaborative Editor of Multilingual Lexica and Termino-ontological Resources in the Humanities. Proceedings of Language, Ontology, Terminology and Knowledge Structures Workshop (LOTKS 2017), co-located with the 12th International Conference on Computational Semantics (IWCS), 19 September 2017 Montpellier.

It is worthy of notice that both platforms offer an abstraction of OntoLex over RDF (which is claimed to be a contradistinguishing and unique feature of Evoke).

As observed from the online demo/tutorial linked in the article, the main browsing mechanism is by search and then by exploration of the hierarchy. I find however that having to choose in between the view of the resources and the hierarchy which, additionally, is limited to one node and its direct children (plus the visualization of the linear path from the root to the node) is rather inconvenient. The feeling is like being stuck in a small place and never get the overall view. The author claims that the “Development of the features necessary for exploring such content has been informed by feedback from both researchers and students (gathered through workshops, courses, and questionnaires) since the release of the first prototype of Evoke in 2018”. As my personal preferences differ a lot from this kind of view (but I’m one person) and the feedback I got (same: surveys, workshops and – most important – community-open mailing lists) about similar systems is definitely in line with my perspective, I am curious to know if there’s any questionnaire that can be provided that could support this choice (in case, I suggest to add it to the paper) or if this aspect has never been covered specifically: possibly, it has always been treated with a different granularity and maybe there has never been any occasion to analyze this peculiar aspect.

TECHNICAL NOTES:
Section 3.1 mentions access to datasets through SPARQL or API calls. As there’s a matter of compliancy, it would be interesting to know which APIs have been implemented, if the system can be dynamically extended with connectors with further APIs etc.. This is only clarified in section 3.3 (it’s Evoke’s specific API) but it could have been made clear in advance instead of having many vague mentions of APIs that need to be explained later. Furthermore, I assume that these dedicated Evoke API should be embraced by the resource publishers, which is a strong requirement. The topic then could have been further elaborated, for instance by mentioning existing open API (LDP, Hydra, GraphQL, etc..), clarifying if Evoke’s ones rely on them or it’s totally built on a core technology/protocol/paradigm (e.g. REST, or SOAP Web Services). The article does not provide any qualitative hint on them, except for pointing, in a foot note, to: http://evoke.ullet.net/api (which, incidentally, is a dead link).

About storage of information: in a world that is progressively going cloud-based, locally-stored annotations are hardly an advantage. The pros described by the authors (mainly, a lightweight application) would make more sense if the entire Evoke was a local system (e.g. a browser plugin) for accessing data on the web. Once a web application is developed, if the feature “add your own thesaurus” is active (thus requiring the possibility to gather contributions from an entire community), supporting management of users and their own data within the platform should be a no-brainer. This does not exclude the possibility to save your own data. If an objection to system-stored data is that the size of the data for each user is a problem, then storing it inside the browser should be an even more serious one

MINOR REMARKS and TYPOs:

“By holding down the mouse” (pag.3). I’m not a native speaker, but I guess this synecdoche is not common in English and it should be “By holding down the mouse button”

s.3.4 p.5 : “adheres to the Web Annotation standard..”. Pls move citation 21 directly after that

Review #2
By Jorge Gracia submitted on 26/Apr/2021
Suggestion:
Major Revision
Review Comment:

This short paper is a tool/system description paper that introduces Evoke, a new tool that provides a web-based user interface for linguistic linked data resources modelled with Ontolex lemon. This work focuses on topical thesaurus in particular, and the tool provides mechanisms for viewing, navigating, extending, and analysing their content. Users can formulate queries to the tool and can extend the data with annotations and potentially with links to other datasets. The main aim of this work is to allow users to engage more fully with published lexicographic content without them infringing on licenses or requiring additional hosting.

The article is clear and well written. It is well structured and easy to read and follow. This work is very timely and relevant and could have an impact on the adoption of linguistic linked data (LLD) techniques by the specific communities of terminologists and lexicographers. Further, it might serve as inspiration to develop similar tools (or future releases of this one) that attract other end user communities as well. In fact, despite linked data is a mature field (as well as its LLD subfield), there is a shortage of proper user interfaces that bridge the gap between final users and Semantic Web experts and developers, and this work constitutes a decisive step in that direction. The tool is accesible online, along with a demo. However the GitHub repository (https://github.com/ssstolk/evoke) seems to be empty at the time of writing this review.

Despite the interest of the approach and the developed tool, there is a number of issues to be addressed in order to further increase the quality of this submission as a journal publication.

The main issue of this work is the lack of a user-centered evaluation. The paper hypothesis is that the use of LLD techniques through the Evoke platform can largely benefit both lexicographic data publishers and users in their daily work and to attain their research goals. However, this has not been confirmed through empirical evidence. Ideally, a study should be conducted with users in which clear metrics are defined and reported. For instance, measure time reduction of some common tasks, or gauge user experience through a survey, or measure any other aspect that might be relevant to validate the hypothesis.

Another hypothesis of this paper is that Evoke can bridge licensing barriers when users interact with lexicographic works available online. Although some qualitative justification is provided in the paper, it would greatly benefit from concrete examples, (i.e., resource X cannot be freely accessed because of its license Y, but Evoke can channel the user query Z without violating copyright because of ...).

I also miss a detailed comparison with other frameworks also devoted to offer proper user interfaces for linguistic linked data, as VocBench (http://vocbench.uniroma2.it/) by Armando Stellato and his team at University of Rome, or LexO, developed by Andrea Bellandi and colleagues at ILC in Pisa. See:

* A. Stellato et al. "VocBench 3: A collaborative Semantic Web editor for ontologies, thesauri and lexicons", Semantic Web, vol. 11, no. 5, pp. 855-881, 2020

* A. Stellato et al. "VocBench: A Web Application for Collaborative Development of Multilingual Thesauri", In Proc. of ESWC 2015, Lecture Notes in Computer Science, 9088, 38-53, Springer International Publishing, 2015

* A. Bellandi and E. Giovannetti. "Involving Lexicographers in the LLOD Cloud with LexO, an Easy-to-use Editor of Lemon Lexical Resources". In Proc. of the 7th Workshop on Linked Data in Linguistics (LDL-2020), pp 70-74, May 2020

* A. Bellandi et al. "Developing LexO: A Collaborative Editor of Multilingual Lexica and Termino-ontological Resources in the Humanities," in Proceedings of Language, Ontology, Terminology and Knowledge Structures Workshop (LOTKS 2017)

According to the SWJ guidelines, impact should be justified for system papers. In that regard, more metrics indicating impact and adoption of the tool should be provided besides its use on the TOE use case. Impact can be proved, for instance, with the number of downloads of the tool, or the number of unique visitors to the web service. If there were any other project, initiative, institution, researcher, using Evoke they should be also reported. Also, current plans to enhance the uptake of Evoke could help to demonstrate potential impact.

I think that the the paper would largely benefit from a figure with an architecture outline.

I section 3.2 it is stated that Evoke assumes Ontolex lemon as data model with its adaptation to topical thesauri (through 'lemon tree'). It is unclear, though, whether it supports general lemon lexicons (not only topical thesauri) as well as dictionaries represented with the Ontolex module for lexicography (lexicog). If not, if would be good to know about future plans to support lexicog, if any.

To make a stronger case for the benefits of linked data for lexicographic work, I would recommend to consult the work by Julia Bosque-Gil, for instance:

* J. Bosque-Gil et al. "Linked data in lexicography". Kernerman Dictionary News, 19–24. July 2016

Other minor remarks:

- I would define some notions such as "topical thesaurus" and "linguistic linked data" the very first time they appear in the introduction, for readers not so familiarised with them.

- Some spacing problems need to be fixed. For instance in section 2 "... COST Action Nexus Linguarum (2019-23).Tooling... " -> " COST Action NexusLinguarum (2019-23). Tooling" or "...tools LingHub[10], which offers " -> "...tools LingHub [10], which offers... ", "...The Historical Thesaurus of English[8],..." -> "The Historical Thesaurus of English [8],"

- The use of opening/closing single quotation marks (') need reviewing.

- The example in Listing 1 needs a more detailed explanation.

- Figure 4 possibly needs a more detailed caption. What is the sense information specific to riddle47?

- In section 4 it is stated that "Examples of research done within this project are linking up words (or word senses, rather) from Old Frisian and Old Dutch to the thesaurus taxonomy." But the date sources (Olf Frisian, Old Dutch) are not cited.

- English is good but would benefit from a final checking. For instance in section 3: "...an international standard specifically for expressing datasets..." -> "...an international standard specifically developed/designed/created for expressing datasets..." ; "allowing them to share it in the
manner of their choosing" -> "... choice"

Review #3
Anonymous submitted on 24/May/2021
Suggestion:
Major Revision
Review Comment:

The paper clearly fits in the proposed section ('Tools and Systems Report') since it presents a client-based user interface to browse through thesauri resources available as LOD endpoints. The core of the paper provides a precise description of the technical setting, in particular the relation to the back ends (typically SPARQL endpoints), the features in the interface and the necessary storing (of annotations) and linking mechanisms to make the whole environment work.
Still, the technical content is encapsulated within an argument justifying the approach used by the implementer that is clearly not well articulated and even at times misleading. The main issues are the following ones:
The title relates to lexicographic resources at large but the actual tool only tackles the very specific genre of thesauri
The introduction refers to a whole series of issues with existing dictionaries which is very difficult to follow, because one cannot identify if this relates to the paper format (“physical, rigid structure of paper editions”), existing digital implementations (which ones), publishers’ policies (licenses) or user side specificities (annotations). At the beginning of the second page, when one reads “The two issues”, I must say I could not see what was referred to.
It is then very disturbing that the mentioned issues are not used right afterwards in the core part of the paper to elicit in which way the proposed system answers them, and one has to wait until section 3.2 to see some references to the issues of interoperability or licensing.
The paper also lacks a real comparison with other possible approaches. Section 2 (related work) is exclusively oriented towards the LLOD domain, but the notion of thesauri has been tackled at large either in the realm of knowledge systems (à la SKOS) or in relation to terminology management systems. At some point, this lack of explicit comparison creates a feeling of wrongly targeted argumentation when it is stated that the environment brings “an increased level of interoperability” where nothing is said what the comparison point is.
Finally, whereas the issue of annotating content seems an important aspect of the work, hardly anything is said concerning the actual types of annotation mechanisms have been implemented. In the same way, I would have like to see more details on how to make sure that heterogeneous endpoints could implement licensing information in a comparable way. Does there exist a standard vocabulary for this?
I would suggest that the paper refrains from any kind of generic statement and only puts forward the couple of issues it wants to demonstrate to be solving.

Finally, the paper addresses the impact of the tool through the description of a collaborative use case of exploring the Thesaurus of Old English. As such the impact remains within the project’s own range of influence.

Corrections and typos:
- P1 C1 L35“Manfred Görlach” => “Manfred Görlach and colleagues”
- P1 C1 L40 “part of speech” => “parts of speech”
- P1 C1 L47 “within” => “within it”
- P2 C2 L15/16 delete one of “offers grants”
- P3 C1 L26 “explore” => “explore them”