Orbis: Explainable Benchmarking of Information Extraction Tasks

Tracking #: 2877-4091

Authors: 
Adrian M.P. Brasoveanu
Albert Weichselbraun
Roger Waldvogel
Fabian Odoni
Lyndon Nixon

Responsible editor: 
Anna Lisa Gentile

Submission type: 
Full Paper
Abstract: 
Competitive benchmarking of information extraction methods has considerably advanced the state of the art in this field. Nevertheless, methodological support for explainable benchmarking, which provides researchers with feedback on the strengths and weaknesses of their methods and guidance for their development efforts, is very limited. Although aggregated metrics such as F1 and accuracy support comparison of annotators, they do not help in explaining annotator performance. This work addresses the need for explainability by presenting Orbis, a powerful and extensible explainable evaluation framework which supports drill-down analysis, multiple annotation tasks and resource versioning. It, therefore, actively aids developers in better understanding evaluation results and identifying shortcomings in their systems. Orbis currently supports four information extraction tasks: content extraction, named entity recognition, named entity linking and slot filling. This article introduces a unified formal framework for evaluating these tasks, presents Orbis’ architecture, and illustrates how it (i) creates simple, concise visualizations that enable visual benchmarking, (ii) supports different visual classification schemas for evaluation results, (iii) aids error analysis, and (iv) enhances interpretability, reproducibility and explainability of evaluations by adhering to the FAIR principles, and using lenses which make implicit factors impacting evaluation results such as tasks, entity classes, annotation rules and the target knowledge graph more explicit.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Filip Ilievski submitted on 25/Mar/2022
Suggestion:
Major Revision
Review Comment:

This paper describes a framework for "explainable" benchmarking, which extends prior benchmarking systems with more information extraction (IE) tasks, version control, and visualization tools.

I find the goal of shedding light on information extraction evaluation to be very valuable. Similarly, evaluating IE tasks jointly is a good idea, especially for the tasks chosen in this task, which are compositional (SF>NEL>NER>CE). The framework's focus on versioning and visualization of the system and gold predictions would intuitively help tool developers debug and understand the behavior of their system as a function of the different benchmarks, tasks, and KG versions. The related work coverage is good.

There are two main challenges that prevent me from suggesting acceptance at this stage.

1) Presentation - The paper should be better structured to make explicit its contributions, and how these contributions are justified by the proposed framework and its evaluation. The background section is surprisingly long and it combines background information with decisions made in the Orbis system (e.g., in 3.1.2) and with in-depth discussion of challenges with evaluating some tasks (3.3). Meanwhile, this section 3 is not following a consistent logic - the subsection on entity linking task is very detailed whereas the subsection on NER is much shorter. In some places, writing seems misplaced (e.g., the mention evaluation in 3.3.6 seems to belong in 3.2). Conversely, the Orbis section, which is arguably the key contribution of this work, is much shorter and starts with presenting irrelevant information, such as the dark and standard viewing modes of the tool. It is problematic that the set of functionalities of the Orbis system is never described clearly (only buried inside writing), or illustrated with a schema. Section 5 discussed explainable features, but it is vague on whether and how these work. This would likely leave the reader to understand only a subset of its functionalities.

2) Evaluation - The second major issue with this paper is that it falls short on evaluation. Sections 6 discusses the impact of Orbis on tool development, but all of this is nominal and vague, using terms like "helped a lot" and "A smaller number of lenses is preferred to a higher number, as nobody has enough time to examine too many views". The paper critically needs a formal evaluation, perhaps as a user study on tool development, that measures to which extent the nominal claims hold in practice. This user study could compare the effectiveness/efficiency of tool development with Orbis vs other tools in table 8.

In addition, the paper would really benefit from a set of use cases that motivate different aspects of the tool, and show how the Orbis tool fulfills them, ideally in comparison to other tools that don't.

Other remarks:
* Another issue is the terminology used. The term 'explainable' is overloaded nowadays, and my initial expectation was that the paper provides explanations for the decisions made by the human and the automated annotators. This is entirely out of scope of this paper. What is meant by 'explanation' is much weaker - the framework provides analysis of the system predictions based on several dimensions, like entity types.
* The related work section should include a comparison to Orbis. This could mean moving the tool comparison later in the paper to section 2.
* The columns in table 8 should be explicitly described.
* in 5.2 - what does it mean that an error is caused by a KG or by a dataset? This sounds weird to me.
* Please be precise in the writing, e.g., : "In contrast ..., the development of Orbis was also guided by the desire to provide potent means for understanding evaluation results which in turn ..."
* What is "virtual query"? What does it mean to be "solely based on the correctness of the outcome"?
* In 3.4.1.'s definition - why are the number of entities and strings equally denoted by i?
* Section 3 could be shortened to focus on the relevant pieces. Much of the discussion is already present in prior work, e.g., Ling et al. (2015) and van Erp et al. (2016).
* What are candidate mentions that do not refer to a named entity (sec 3.3.1), and why are they described under NEL and not NER?

Review #2
By Emir Muñoz submitted on 12/Apr/2022
Suggestion:
Accept
Review Comment:

This paper introduces Orbit an extensible explainable benchmarking framework, which supports users in four Information Extraction tasks:
- Content Extraction (CE),
- Named Entity Recognition (NER),
- Named Entity Linking (NEL), and
- Slot Filling (SF).

The authors build upon previous work published in SEMANTICS 2018, asis&t 2019, RANLP 2019, and WI-IAT 2020, adding substantial new material to this manuscript to support the tasks mentioned above. Orbit is an extensible framework with a UI to help researchers benchmark different IE tasks.

This contribution provides the right level of detail to understand the evaluation approaches for each of the IE tasks and how an explanation could be provided for the evaluations. The rest of the paper goes about stating the problem in each task, defining the metrics, and the benchmarking issues involved in each task, and comparing Orbis against other state-of-the-art frameworks.

To my taste, Section 4 focuses only on the functionalities of Orbis more than on the architecture. I’d recommend the authors provide a bit more detail on the actual software architecture.

Finally, in general, the manuscript is very well written and provides enough information about Orbis to potential users looking to benchmark IE tasks. I enjoyed reading it; however, I have the following minor comments that I’m sure the authors can address without the need for another round.

- Abstract: “the target knowledge graph more explicit” -> I’m not sure what the authors mean by making the KG more explicit. Could you revise this sentence?
- P1: “the web” -> “the Web”
- P1: “these tasks (e.g., named” -> this parenthesis is never closed
- P2: There are missing references for DBpedia and Wikidata when they are first mentioned.
- P2: “​​Past competitions have been instrumental” -> could you mention some of these competitions for completeness of the text?
- Throughout the document, there are many missing commas, especially, in sentences containing “which”. Check the grammatical rules.
- P3: “black box evaluation” -> “black-box evaluation” missing the dash
- P3: “DS errors” -> the abbreviation is not introduced yet, but only in section 5.2 P10
- P4: Explainability is mentioned as key in this paper, but I’d like to see a sentence with the take of the authors on how they understand this concept.
- P4: “Content extraction addresses this issue by identifying and extracting relevant content from web sources in a form suitable for the subsequent processing steps.” -> I’m not sure why this description is limited to only web sources and not generic sources as that’s how it’s used later.
- P4: Equation (3) denominator is T^e_i, but I believe you meant T^r_i referring to the extracted tokens.
- P5: There are many schema prefixes used in the text, but no definition of them. I’d recommend either adding a table somewhere or providing a link like https://prefix.cc
- P6: There seems to be the assumption that all KGs are published as RDF. This is not fully correct. Could you clarify that in the text?
- P7: “per:parents has a cardinality of two and accepts entities of type Person” -> there seems to be an underlying assumption here that the KG is complete; otherwise, the cardinality should be of up to two, with zero, one, or two as valid values for cardinality. Could you clarify that?
- P7: Equation (12) for the F1 score is wrong and misses a factor of 2.
- P8: “Orbis’ architecture was developed around the idea of flexibility.” -> there are several mentions of Orbis’ architecture but this is not provided in the document. I believe an architecture diagram could help the reader’s understanding.
- P8: “using pip or other Python installation methods” -> here is the first time that a programming language is mentioned. It would make more sense to describe the programming language(s) used to implement Orbis earlier. Also, it would be relevant to mention which version of Python is required. At least, mention the minimum supported version for reproducibility.
- P8: Table (1), could you mention the relevant references for each one of the datasets?
- P14: Figure (4), the contrast between the black text and a dark blue highlight isn’t great. It will get worst in the paper is printed in B&W. I’d recommend checking other colours.
- P14: Table (7), the column separation for the three categories is not very clear. I'd recommend adding some (vertical) division lines between them.
- P17: the third paragraph consists of a single sentence. I’d recommend breaking that paragraph down into at least two sentences.
- P17: the URL https://epoch-project.eu is not well written.

Review #3
By Alessandro Russo submitted on 15/Jun/2022
Suggestion:
Major Revision
Review Comment:

In this paper the authors present Orbis, a system and framework for evaluating and benchmarking information extraction pipelines on textual data, with a focus on content extraction, named entity recognition and linking, and slot filling. The paper addresses a relevant problem and concrete needs and use cases, and the overall work is sufficiently motivated and framed in the context of the state of the art. Overall, the paper is well structured and generally well written. The tool and related material is available on GitHub, and meets all the requirements for the submission. However, I would have expected a publicly available instance of the system (at least for demo/illustrative purposes).

As a "full paper" (as this work is submitted), this work does not include strong research results and contributions. On the other side, as a "Reports on tools and systems" it would probably not yet meet the maturity and impact requirements. Nevertheless, the work has a good potential and can be improved through a careful revision.

This work clearly builds on and extends authors' previous work. While a previous paper is only briefly mentioned in the submitted work (sec 6.1, close to the end of the paper), I believe this should be explicitly mentioned in the introduction, clearly stating how previous work was extended and which are the new, original contributions of the new submission wrt previous papers.

Although related work, presented in section 2, does not explicitly include a comparison between the mentioned tools and Orbis, this is then provided in section 5.1. It would be useful for the reader to add in section 2 a pointer to such a comparison (i.e., mention at the end of sec. 2 that a detailed comparison with other tools is provided in sec. 5.1). Otherwise, as it is, the related work section reads as a description of other approaches without any hint or statement on how they compare to your framework.

The attempt made in section 3 to formalise and systematise the definitions of the four information extraction tasks and related metrics is valuable and provides the background to understand the challenges that Orbis aims at addressing. However, my overall impression is that there is an "overuse" of notation, in particular in terms of subscripts/superscripts that do not help readability and are not used consistently. Most of the formulae introduced in section 3 are not functional to the rest of the paper (they are not used nor referenced elsewhere) or represent well-known metrics or measures (precision, recall, F1, etc.). I therefore suggest to carefully review section 3 (as well as any other part of the paper that may be impacted) so that the used notation is consistent and correct, also taking into account the following comments and suggestions.

* In all definitions and formulae, review the usage of the "i" subscript. In some cases it seems to refer to the i-th document (document d_i in the definition of CE; page 4, second column, line 11) and then appears in all subsequent "entities"; my impression is that you can simply refer to a document "d" and remove all those "i"s. In section 3.2 basically everything has an "i", so that the meaning is unclear: either I misunderstand the symbols, or the same "i" cannot be used to index the string extracted from the document, the surface form of the entities, the entity type, the variables for the start/end position. Similar observations hold for the definition in section 3.3.1, where "i" seems to refer to the i-th entity in a KG, but is also used for the surface form (in a document d, thus no longer d_i) and for the variables for the start/end position of the mention.

* Make explicit the intended meaning of other superscripts used: in section 3.1, it seems that "r" stands for "relevant" and "g" for "gold"; in section 3.3.6 my understanding is that "c" stands for "corpus" and "s" for "system" (although this is not clear when those symbols are first introduced).

* In formula (3), the denominator should be |T_i^r|

* Are formulae (6) and (7) really useful?

* Section 3.2 on named entity recognition does not go beyond the problem statement. What about evaluation metrics, as discussed for the other tasks? Are there any benchmarking issues or other challenges, as discussed for NEL?

* In general, where possible, providing an example for each of the definitions would be useful and would improve the understandability.

The discussion on the architecture in section 4 should be reviewed and improved, maybe starting from the authors' previous work ("Odoni et al., On the Importance of Drill-Down Analysis for Assessing Gold Standards and Named Entity Linking Performance, SEMANTiCS 2018 / Procedia Computer Science 137:33-42, 2018"). In that work, for example, you have a clear figure with an architectural diagram, which is missing in your submission and would greatly improve the understandability of the system. In particular, section 4.1 mixes different perspectives (functional pipeline, implementation details, details on the visual interface) without any clear structure. For example: how do the two viewing modes (standard and dark, mentioned at page 8, first column, lines 17-18) relate to the pipeline? Why is this detail concerning the UI mentioned there? Similarly, after presenting the three stages, the possibility to install external packages using Python is described...but you never even mentioned that the system is implemented in Python. I suggest that you restructure the section so that design choices, the architectural model (with a diagram) and implementation details are clearly presented in a consistent way.

In the discussion in section 5, the clear task-oriented structure (CE, NER, NEL, SF) of section 3 is a bit lost. The focus seems mainly on NEL: error analysis and lenses are basically targeting this task only. I understand from section 3 that the main challenges are related to NEL, but it is unclear if and how the system can concretely help in the case of CE, NER and SF, beyond having side-by-side views of the gold standard and the annotator(s) results (which per se is still valuable). Section 5.1 and Table 3 on FAIR principles are not very convincing and additional details are needed. In particular:
- Which community-based, open vocabularies are used wrt I2 in the "Interoperable" section (page 11, line 21)?
- What are the TAC standards mentioned for R1 in the "Reusable" section (line 24)?
- Concerning R1.3, I do not understand the meaning of "Covers a superset of domain-relevant data".
- I'm also a bit surprised to see JSON mentioned as "formal, accessible, shared, and broadly applicable language for knowledge representation" (I1).
Having some IDs and using HTTP and JSON is not enough to truly meet the FAIR principles.
Please carefully rethink your adherence to the FAIR principles and if you believe you really implement them, provide a clear and convincing explanation for each of them.

Although evaluating a system like Orbis is not easy, the paper does not include any evaluation at all. The paper includes several statements on the added value of the system (e.g., "These tools aid experts in quickly identifying shortcomings within their methods and in addressing them."; "Orbis significantly lowers the effort required to perform drill-down analysis which in turn enable researchers to locate a problem in algorithms, machine learning components, gold standards and data sources more quickly, leading to a more efficient allocation of research efforts and developer resources."), but there is no evidence supporting them. Who has been using Orbis beyond the authors? Have you collected any feedback from those "experts"? Have you performed any kind of evaluation (even a simple System Usability Scale questionnare)? The impact section (6.1) does not help in addressing these questions. While it is intuitively true that the system can help in addressing some issues, strong or vague statements ("significantly lowers the effort...", "quickly identifying...", "more efficient allocation...") should be supported by measurable criteria or other kind of evidence. Do you at least have a plan for its uptake and systematic evaluation?

---

Typos and minor comments

- page 1, second column, line 44: the parenthesis before "e.g." is not closed (but it can be replaced by a comma)
- page 3, second column, line 39: "DS errors" - DS as abbreviation for dataset is first mentioned on page 10
- page 8, second column, line 49: "wrapper" --> "wrappers"
- Table 3: "Accesible" --> "Accessible"
- page 17, first column, line 47: the project's url