Paving the Way for Enriched Metadata of Linguistic Linked Data

Tracking #: 2884-4098

Authors: 
Maria Pia di Buono1
Hugo Gonçalo Oliveira
Verginica Barbu Mititelu
Blerina Spahiu
Gennaro Nolano

Responsible editor: 
Guest Editors Advancements in Linguistics Linked Data 2021

Submission type: 
Full Paper
Abstract: 
The need for reusable, interoperable, and interlinked linguistic resources in Natural Language Processing downstream tasks has been proved by the increasing efforts to develop standards and metadata suitable to represent several layers of information. Nevertheless, despite these efforts, the achievement of full compatibility for metadata in linguistic resource production is still far from being reached. Access to resources observing these standards is hindered either by (i) lack of or incomplete information, (ii) inconsistent ways of coding their metadata, and (iii) lack of maintenance. In this paper, we offer a quantitative and qualitative analysis of descriptive metadata and resources availability of two main metadata repositories: LOD Cloud and Annohub. Furthermore, we introduce a metadata enrichment, which aims at improving resource information, and a metadata alignment to META-SHARE ontology, suitable for easing the accessibility and interoperability of such resources.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Frank Abromeit submitted on 10/Oct/2021
Suggestion:
Major Revision
Review Comment:

With the revision of the paper the authors present the results of the metadata alignment process for the language resource metadata from The Linked Open Data Cloud (https://lod-cloud.net) and Annohub (https://annohub.linguistik.de/) as a CSV file available at
https://github.com/unior-nlp-research-group/melld.

I evaluated the data in the CSV file against the latest Annohub RDF data dump (see above).
First of all, i have to report that the authors obviously build their analysis on an outdated version of Annohub.
(The latest Annohub data was released 10/2020 and can be obtained at https://annohub.linguistik.de/)

MELLD.csv evaluation
====================
The MELLD.csv file includes 666 records, of which 461 are taken from Annohub and 205 from the
https://www.lod-cloud.net website. My evaluation shows, that the metadata in the MELLD CSV file is clearly a subset of the latest Annohub data since 74 datasets are completely missing.
Otherwise, the language metadata is identical with very little change (0.06 %).

Summary
=======
Annohub records in the latest RDF dump : 535
Annohub records in MELLD.csv : 461

1) Identical languages for 451 records
2) Difference in languages only for 10 records
3) Out of the 10 datasets of Annohub in MELLD that have a different language assignment :
- 8 times a single language was missing in Annohub
- 1 time a single language was missing in MELLD
- 1 time two languages were missing in MELLD
4) The total number of languages in Annohub records that appear in MELLD is 13080. Therefor the rate of different language metadata is only about 0.06 percent.
5) Four Annohub records in MELLD appear under a different name in the latest Annohub release.
6) For all 461 Annohub datasets in MELLD an ORCID identifier could be assigned.
More detailed results can be found in appendices A and B.

Similarly, i compared the data from https://www.lod-cloud.net with the data in MELLD. I discovered basically 6 metadata
types that are not present in the original lod-cloud data.

1) Language (e.g. Basque), (is displayed in the html at https://lod-cloud.net,
but not included in the lod-cloud JSON export)
2) ORCID id
3) LCRsubclass (e.g. lexicons and dictionaries)
4) META-SHARE property : (float value, e.g. 813.0)
5) distributionLocation/comment (available yes/no)
6) accessibleThroughQuery (available yes/no)

On the other hand, some metadata from lod-cloud was pruned, for example keyword information, but also several other info like citation info, etc. which i could not find in the CSV.

In general, the data in CSV file is sparse since only 43% (8000/18648) of all possible attributes
are filled with values.

In the data model of MELLD i found several issues. Foremost, i regard the absence of references to the original
metadata records of lod-cloud and Annhub as a fatal error. Following Linked Data principles i would suggest to link
entries in the MELLD dataset to the specific resource entries in the catalogs they refer to, like Annohub, lod-cloud,
Linghub etc., because they provide many other useful metadata information. Also, the language information is not
respresented as an URL or ISO-Code, but simply as plain-text. Finally, a complex datatype for modelling the
(size/amount) of a resource is used. In the respective column, triples numbers were encoded, but i suspect it
to be used for file sizes as well (as the name suggests).

Conclusion
==========
Despite the additions, like adding ORDID ids and checking SPARQL endpoints /availability of datasets, etc.
most of the metadata in MELLD is simply a copy of already existing metadata from Annohub and lod-cloud.
The benefit of the resulting dataset is therefore questionable. Also, i would not regard the approach described in the paper to be best-practice. Instead of creating a compilation (of metadata) i would rather like to see the existing metadata from https://www.lod-cloud.net converted to RDF. This would allow linking its metadata (to Annohub) and other LLOD datasets, but also querying it via SPARQL.

After all, I found the paper to be very informative. The analysis of the LOD (LLOD) cloud provides valuable insights
about available linguistic Linked Data resources. In particular, it reveals shortcomings, such as underrepresented
languages or the problem of the unavailability of resources due to broken links or unavailable services.

Formal issues :
===============
p3, footnote 24, link is not available
https://ckan.org/datahub/

p5, right column 24
"The main reason for choosing these repositories (over which other repositories ?)
A short overview of other available language resource providers, might be useful.
For example http://www.meta-share.org/.

p4, right column 26, check spelling
However the attempts, none of the approaches was able to correct and ...

p6, footnote 39, check spelling
There were only 133 resources ...

p6, right column 7, check spelling
Annohub also comes with tools for type of resources, language and annotation model detection from the resource content
and even encodes metadata in linked data format ->
Annohub also comes with tools for resource type, language and annotation model detection and represents all generated
metadata as RDF.

p7, right, column 37
Information of annotation models, languages and resource types is encoded in dedicated RDF properties
in the Annohub metadata. In rare cases some information gain can be achived by harversting the description info.
For example if an appropriate OLiA annotation model is not availble for a certain tagset,
e.g. for "Interset interlingua for morphosyntactic tagsets", as described in the example.

p9, left, column 27, check spelling

p10, left, column 28, check spelling
In addition to this, in some cases, when available, the resources content has also been considered ...

p11, table 1, The following metadata is included in Annohub, but marked as not available in the table
1. metadataRecordIdentifier : obviously each RDF resource record is identified by its unique URL.
Altough, this is not an explicit property this should be mentioned in the table.
2. ontology : for each Annohub resource the URL of a used annotation scheme is encoded with a dedicated
RDF property. Its value is the URL of the used ontology for annotations, e.g.
http://svn.code.sf.net/p/olia/code/trunk/owl/stable/suc-link.rdf.
3. size/amount : dct:bytesSize
4. contactEmail : vcard:hasEmail
5. downloadLocation : dcat:accessURL
6. accessibleThroughQuery : As a remark, by now, none of the Annohub resources have a designated
SPARQL endpoint.

Appendix A
Records in Annohub and MELLD.csv with different language assignment :
1 title :ASPAC – Swedish-Lower Sorbian (2017-10-16); ASPAC – svenska-lågsorbiska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Lower Sorbian] error
notMelld : []

2 title :ASPAC – Swedish-Czech (2017-10-16); ASPAC – svenska-tjeckiska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Czech] error
notMelld : []

3 title :ASPAC – Swedish-Macedonian (2017-10-16); ASPAC – svenska-makedonska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Macedonian] error
notMelld : []

4 title :EMEA
#languages in Melld 1
#languages in Annohub 2
notInAnnohub : []
notMelld : [German] error

5 title :ASPAC – Swedish-Greek (2017-10-16); ASPAC – svenska-grekiska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Modern Greek (1453-)] error
notMelld : []

6 title :ASPAC – Swedish-Molise Slavic (2017-10-16); ASPAC – svenska-moliseslaviska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Molise Slavic] error
notMelld : []

7 title :ASPAC – Swedish-English (2017-10-16); ASPAC – svenska-engelska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ English] error
notMelld : []

8 title :ASPAC – Swedish-Bulgarian (2017-10-16); ASPAC – svenska-bulgariska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Swedish] error
notMelld : []

9 title :ASPAC – Swedish-Croatian (2017-10-16); ASPAC – svenska-kroatiska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Croatian] error
notMelld : []

10 title :Freedict RDF dictionary Afrikaans-English
#languages in Melld 1
#languages in Annohub 2
notInAnnohub : [Modern Greek (1453-)] ok
notMelld : [English, Afrikaans] error

Appendix B
Records in MELLD that appear under a different name in the latest Annohub release:

1) DBnary - Wiktionary as Linguistic Linked Open Data (English Morphology)
-> DBnary - Wiktionary as Linguistic Linked Open Data (English Edition w. Morphology)

2) DBnary - Wiktionary as Linguistic Linked Open Data (Serbo-Croatian Morphology)
-> DBnary - Wiktionary as Linguistic Linked Open Data (Serbo-Croatian Edition w. Morphology)

3) DBnary - Wiktionary as Linguistic Linked Open Data (German Morphology)
-> DBnary - Wiktionary as Linguistic Linked Open Data (German Edition w. Morphology)

4) DBnary - Wiktionary as Linguistic Linked Open Data (French Morphology)
-> DBnary - Wiktionary as Linguistic Linked Open Data (French Edition w. Morphology)

Review #2
By Manuel Fiorelli submitted on 21/Oct/2021
Suggestion:
Minor Revision
Review Comment:

After reading the cover letter and the revised manuscript, I am generally satisfied by how the authors have addressed my concerns about the original submission. Nonetheless, I still have some remarks that might require another minor revision of the paper.

The authors – answering a question of mine – claimed that their findings generally confirm the results of related work, and that they tried to underline what was done to solve some of the shared issues. I think that this claim should be stated more explicitly in the manuscript.

Concerning the reproducibility of the work, I still cannot find the reference to the (version) of AnnoHub used in the research. Looking at the AnnoHub website, I could only find this link https://annohub.linguistik.de/archive/2020-03-30/annohub-dataset.zip (in addition to this “versionless” address http://annohub.linguistik.de/annohub-dataset.zip).

I am pleased by the additional details on the automatic enrichment procedure. Still, the authors did not provide the source code. If this is intentional, I think that the authors should state it explicitly in the manuscript.

Moreover, I was wondering whether the output of the automatic enrichment has been validated/refined by humans. In particular, I do not know how the authors addressed the (potential) ambiguity of names when looking up ORCID. Actually, the authors mentioned a manual enrichment step, just to add missing information.

Concerning metadata alignment/mapping, I am satisfied by the table just added to the manuscript to make the alignment explicit. Still, I have noticed that downloadLocation is not mapped. I am sure that LOD cloud provides the address of data dumps, and probably AnnoHub as well. ORCID is not mapped either, although in the section on metadata enrichment the authors explain how to derive it from author/contact names.

My biggest concern is about the produced enriched metadata and the modelling decisions, or better the lack of clarity about the latter. The authors have added the URL of the resource, which however is only available as CSV. I was expecting one or more RDF files. Moreover, it is difficult for me to understand the (implicit) RDF representation of the data, using the META-SHARE ontology.
The header of some columns contains prefixed names (e.g. dct:source), which can be mapped to actual URIs with an educated guess of the namespace associated with the prefix. The header of some columns contains an unprefixed name (e.g. LCRsubclass, language), making it difficult to determine the actual URI (e.g. LCRsubclass should belong to the META-SHARE ontology, but language could belong to DC or DCTERMS). If LCRsubclass seems to belong to the META-SHARE ontology, it is not clear why ms:resourceName has been explicitly prefixed (assuming that ms identifies the namespace of the META-SHARE ontology).
Language is identified by name but depending on the actual property (DC vs DCTERMS) this may or may not be acceptable. In my opinion, language codes or language IRIs are more suitable to a catalog, with human-readable names should be provided as a rendering option in a user-facing interface to the catalog.
The authors should look at lingualityType to differentiate between monolingual, bilingual, multilingual, etc.. resources (http://www.meta-share.org/ontologies/meta-share/meta-share-ontology.owl/...), instead of using multilingual as a value for the language column.
Using accessibleThroughQuery and accessibleThroughQuery/comment fields are problematic in RDF. What is the value of the property? If this is a literal value, then it is not possible to add a comment. Otherwise, if this is a resource, the authors should show its structure more explicitly.
In fact, accessibleThroughQuery is not a property in the META-SHARE ontology but an instance to be used as a value for the property distributionForm.
Another problem with (the CSV distribution of) MELLD is related to the columns contactEmail and contactEmail: each column may have multiple values, and it is not clear how the correspondence between items in the two lists is made explicit. Looking at the dataset “Universal Dependencies Treebank Indonesian” the number of contact emails is less than the number of contact names – something that could be explained if a contact has a more addresses or an address is not bound to any contact name. In an RDF graph, I would have expected that each “contact” is a resource with a property “name” and “email”, or something like that.

In my opinion, the authors should make it explicit how the CSV table can be mapped to an RDF graph that uses the META-SHARE ontology and other vocabularies.

Perhaps, the authors could compare their modelling decision against what is discussed in this reference:

Cimiano, P., Chiarcos, C., McCrae, J. P., & Gracia, J. (2020). Modelling Metadata of Language Resources. In Linguistic Linked Data (pp. 123-135). Springer, Cham.

Additional remarks.

The reference to LexInfo seems wrong to me and it should be replaced with this one:
Cimiano, P., Buitelaar, P., McCrae, J., & Sintek, M. (2011). LexInfo: A declarative model for the lexicon-ontology interface. Journal of Web Semantics, 9(1), 29-51. doi: 10.1016/j.websem.2010.11.001

I noticed the addition of Eclipse RDF4J, but the authors should consider better explaining that both Jena and RDF4J are not specific triple stores but rather Java frameworks for RDF data processing supplying default implementations for data persistence. RDF4J, for example, provides two triple store implementations (i.e. memory store and native store), complemented by third-party triple stores compliant to this framework (e.g. Ontotext GraphDB).

Discussing Linked Data, the authors still use the word “principles” for thing that in my knowledge are more commonly referred to as rules.

It seems to me that the authors forgot to revise the description of AnnoHub.

Satisfied by improvements related to Table 6. Authors should also mention that they ignored the actual LCRSubclass instances defined by meta-share itself (http://www.meta-share.org/ontologies/meta-share/meta-share-ontology.owl/...)

Concerning Table 8, the authors should make it explicit that the table only refers to the linguistic subset of the LOD cloud (if I am right).

Review #3
By Sebastian Hellmann submitted on 05/Nov/2021
Suggestion:
Accept
Review Comment:

Dear authors,

I checked all the comments from last review and they were addressed well, so I have no objections against publishing this paper as is.

In particular, the relation between Linked Data and FAIR is well described on page 2, lines 12ff. We recently published a position paper, that was focused a lot on a technical vision: https://dl.acm.org/doi/10.1145/3442442.3451364 https://svn.aksw.org/papers/2021/sci-k_fair-linked-data/public.pdf
I think that you are more concise when you write: that LD shares the FAIR goal (and also has achieved it to some degree), but the metadata is lacking. I thought that this was well described in your new revision.

Some minor comments:
- github repo doesn't have a license. CC-BY should be fine. You could also add the reference to this paper there, once it gets accepted.

Review #4
Anonymous submitted on 21/Nov/2021
Suggestion:
Minor Revision
Review Comment:

The paper analyses the current state of metadata for linguistic linked data resources. It considers two resources: LOD Cloud and Annohub. The paper analyses in detail the status of metadata in these resources and provides an enriched dataset in which the metadata of these resources has been aligned/mapped to the META-Share ontology, considering in particular the following metadata elements: Classification, Usability, Accessibility, Quality. The authors then analyze the resulting enriched metadata quantitatively, noting in particular that Accessibility metadata is in particular problematic as in many cases the information is not available or the URLs do not resolve anymore.

Overall, the paper provides a clear and original contribution. An analysis of the state of metadata is an important research contribution, as this analysis clearly reveals areas of metadata quality and coverage that need to be critically improved for the linguistic linked data cloud to be used in practice to find and retrieve relevant resources. In this sense I regard the results also as significant.

Most of the reviewers generally acknowledge that the paper has improved. In particular,
they agree on the originality and significance of the work. The quality of writing needs to be improved, the process of providing the final version should be shepherded by one of the editors of the journal closely.

The following minor issues remain and need to be addressed for the camera-ready version:

1) As noted by one of the reviewers, the dataset provided by the authors is not based on the most recent version of Annohub and does not consider all data in Annohub. This is to be regarded as a minor issue as the reviewer does not criticize the methodology itself, but the up-to-dateness of the data. While keeping the methodology the same, this issue can be fixed by the authors for the cameray-ready version.

2) One reviewer has noted that it is not clear in which sense the dataset provided by the authors really extends the existing Annohub dataset. This should be clarified as part of the final version.

3) The reviewers have criticized that the dataset is not available as RDF. It should ne provided as RDF with the appropriate metadata.

4) There is (still) some lack of clarity on the methodology. While the reviewers acknowledge that the authors have provided more details, e.g. in Table 1, there is a need to be more explicit in mentioning how exactly the mapping was done and whether it was done manually or semi-automatically. This information could be added somehow in a separate table for each original data source, adding a comment explaining how the mapping was done. The categories of mapping strategies defined in the table should be included in the text.

5) The style and grammar of the paper should be improved. The authors should contact the editors for shepherding this process.

One thing this meta-reviewer does not understand is why there are "N/A"s in Table 3 for the LOD Cloud. Apart from the fact that this is not clear, one wonders in how far the column for LOD Cloud makes sense here given that it is generally "not applicable".

Stylistic issues:

Page 1: use of "howbeit" seems odd. I have never encountered the use of "howebeit" in scientific papers. My research revealed that the word is valid, but the use is archaic. I would advise rather not to use it.

Page 4: Despite the ever growing ... the majority of metadata such as rdfs:comment...
=> not clear what this means, first, these elements are RDF properties and not metadata, second, not clear what it means that the majority is never used. This statement needs to be more precise

However the attempts, none of the approaches was able to correct => "However the attempts" weird as a clause, "In spite of attempts to ..., ..."

several different datasets show => different already implies several I think, so this is redundant

Page 5

Semantic Health Car => Semantic Health Care*e*

META-SHARE has since been mapped => since when/what ?

Page 6

Annohub also comes with tools for type of resources, ... => unclear what a tool for "type of resources" is, suggested rephrasing: comes with tools for the detection of the type of resources, language, etc.

"completed with missing information, as and when the case" => not clear what "as and when the case" means, please delete

Page 9:

so that a user might now => so that a user might know

Page 18:

the status of datasets SPARQL .. => grammatically odd

with values such as "not available", "available" or "empty": if these are (3) possible values, they should be put in quotes IMHO. Not clear what the difference is between "not available" and "empty", this could be clarified