LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain

Tracking #: 2816-4030

Authors: 
Vasile Pais
Maria Mitrofan
Carol Luca Gasan
Alexandru Ianov
Corvin Ghiță
Vlad Silviu Coneschi
Andrei Onuț

Responsible editor: 
Harald Sack

Submission type: 
Dataset Description
Abstract: 
LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Furthermore, GeoNames identifiers are provided for location entities, when linking was possible. The resource is available in multiple formats, including span-based, token-based and RDF. The Linked Open Data version, in RDF-Turtle format, is available for both download and interrogation using a SPARQL endpoint.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/Jul/2021
Suggestion:
Major Revision
Review Comment:

The manuscript “LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain” is a data description paper. It describes the LegalNERo dataset, which is a manually annotated corpus for named entity recognition in the Romanian legal domain. The dataset is made available through Zenodo and also through a SPARQL endpoint hosted on a server of the research centre in which the dataset was developed. The dataset itself is available in a number of different formats including BRAT, RDF Turtle and CoNLL-U Plus.

The paper follows the typical structure of a dataset paper (Introduction, Related Work, Annotation Process, Corpus Description, Using the RDF Version, Corpus Usage, Conclusions). All in all, the descriptions are sufficient to enable re-use and adaptation of the dataset.

There are, however, a number of issues with this data description paper.

First and foremost, surprisingly, while the title of the paper implies a certain relationship to the legal domain, neither the paper nor the dataset specifically tackles any aspects related to NER in the legal domain except for adding one additional entity category to the typical person, organisation, location structure and that’s “legal resources” or “legal document references”. In the Related Work section the authors do cite and acknowledge the many different types of entities that are specific to the legal domain but they decided not to introduce any of these specific types or categories themselves (except, as mentioned, “legal document references”). This decision needs to be explained and motivated since the paper and especially the dataset claim a certain focus on the legal domain, which does not exist except for the documents used, which are indeed from the legal domain. The size of the dataset is rather small (370 documents with 265k tokens, 8k sentences and a total of 54k annotated tokens).

In various parts of the paper “time” is mentioned as an entity type. Time expressions are simply time expressions. As explicit time expressions can be recognised easily with a number of regular expressions (like currency expressions), they are often used/implemented in NER tools but it’s simply incorrect to call time expressions “named entities”.

Some additional comments:

Page 1, line 27: suggest to change “international project” to “EU project”.

Page 1, line 29: please explain “comparable” in the context of this paper (or delete it)

Page 1, line 30: “7 languages” should be written as “seven languages”. There are other sentences in which one-figure numbers are written as actual numbers – these should all be changed into the corresponding words (see, among others, page 2, lines 10 and 20).

Page 1, lines 38/39: “All these annotations were realised using automatic processes.” My understanding of the paper is that all annotations were performed by human annotators.

Page 1, 2, lines 45 ff.: The relevance of this paragraph in Section 1 is unclear. It should probably be moved into Section 2.

Page 2, lines 21 ff.: Section 2 is missing in the summary paragraph.

Page 2, line 23 ff.: Please include a link to the annotation guidelines or include the annotation guidelines in the dataset on Zenodo.

Page 3, lines 1/2: I don’t understand why hiding certain information about the annotation process helps with the computation of the inter-annotator agreement.

Page 3, lines 15 and 32: It’s “Cohen’s Kappa” (not Coehn’s Kappa)

Page 3, line 17: The Cohen’s Kappa of 0.87 is surprisingly low given that the annotation task is so simple. The revised metric (0.89) is still low so where exactly are the actual disagreements?

Page 6: In Section 6 the development of various NER models is mentioned but the evaluation of these models is missing.

Finally, the paper is in need of a thorough round of revisions, there are many typos, missing words etc.

Review #2
By Enrico Francesconi submitted on 14/Jul/2021
Suggestion:
Major Revision
Review Comment:

This paper presents LegalNERo: a manually annotated corpus for named entity recognition in the Romanian legal domain.
Annotation was performed by 5 human annotators, under the supervision of two senior researchers.
5 classes are considered: person (PER), location (LOC), organization (ORG), time (TIME) and legal document references (LEGAL).
Each annotator was given instructions on how to annotate the documents.

100 documents was attributed to each annotator. 30 documents (out of the 100) were also shared with other two annotators. This aspect was hidden from the annotators during the process but allowed us to later compute inter-annotator agreement.
Actual annotation was handled using the BRAT annotation tool.
Inter-annotator agreement between each pair of annotators is assessed using Coehn's Kappa measure. This allowed to detect some recurring mistakes with some of the annotators.
A tool to reconcile different annotation was created.

UDPipe is used on the text files for automatic operations such as tokenization, lemmatization, part of speech tagging and dependency parsing. The resulting files were in CoNLL-U format.
Initial annotations (BRAT and CoNLL-U Plus) were converted to RDF format, specific to applications exploiting linked data.
Location entities are resolved to places in the real world using geographical databases, such as GeoNames, while references are described using ELI.

Statistics on the entities extracted are given.

An ontology for distributing such extracted data in a linked data format has been developed: it is based on NLP Interchange Format (NIF), POWLA specifying "document layers" which contains the actual annotations, as well as NERD (named entity recognition and disambiguation) ontology.

While the NER procedure uses available tools, the core part of the paper seems to be the definition of the previously mentioned ontology. Anyway, the paper fails in providing details of such ontology (design choices, relevant relation between classes), apart from the schema reported in Fig. 1. Such description is needed for dataset reuse. Moreover, it seems missing the description of the mapping criteria between the RDF format, derived by the CoNLL annotation, and the ontology for linked data representation.

Review #3
Anonymous submitted on 14/Mar/2022
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions. Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people. We strongly encourage authors of dataset description paper to provide details about the used vocabularies; ideally using the 5 star rating provided here . Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

Summary:
The paper presents a manually annotated corpus LegalNERo for named entity recognition in the Romanian Legal domain. The authors considered classical entity types such as the organizations, persons, locations, time expressions together with legal references to the documents such as laws, government decisions, orders, etc. The corpus consisting of 370 documents are manually annotated by 5 human annotators. Each annotator was assigned 100 documents, out of which 30 documents were shared with two other annotators. Inter-Annotator Agreement (IAA) between each pair of annotator has been calculated using Coehn’s Kappa measure. The corpus is made available as raw text, span-based annotations, token-based annotations, and linked data RDF.
The idea of NER in the legal domain is interesting, however, there are certain concerns:

1. The paper lacks a running example which reduces the readability and understandability of the proposed work. I would recommend the authors provide an excerpt from the original document and show the annotated named entities in the text. (Probably, with the corresponding English translation of the sentences and the annotated named entities because the original documents are in Romanian).
2. The same example or a different example should be used to explain the span-based and token-based annotations.
3. Since, the authors provide the raw corpus as well it would be recommended to have a comparison of the accuracy, precision and recall measure of the annotations using the existing NER tools. This would further help to motivate the need for manual annotation of Legal corpus.
4. The RDF linked data needs further explanation along with Fig 1. The domain and range of the RDF schema are not explained. Also, the different classes are not explained well. For e.g., what does powla: Node denote? Give an example which also would explain its properties hasParent, next, etc. Therefore, I would highly recommend adding a section in the paper explaining each class, the properties and hierarchy of the classes. Furthermore, the SPARQL endpoint link provided in the webpage doesn’t work, however, the link with the example SPARQL queries work. The authors might need to cross-check that.
5. However, the end goal of having the data in form of Linked Data is missing. How this data can be used? What kind of interesting insights can we get from the data. The SPARQL queries provided are basic ones and it fails to give any insights into the data.