Cultural Heritage Information Retrieval: Data Modelling and Applications

Babak Ranjgar
Abolghasem Sadeghi-Niaraki
Maryam Shakeri

Special Issue Cultural Heritage 2019

Survey Article
Knowledge organization and development of better information retrieval techniques were of great importance from very an early time period in human history. The need has grown high for such systems with the advent of digitization and the web era. Computer systems and web have offered easier retrieval of information in almost no time. However, as the amount of data increased, these systems were not able to work well in terms of accuracy and precision of retrieval. Semantic Web concept was introduced to overcome the issue by converting the web of documents to a web of data. Semantic Web technologies make data machine-understandable so that information retrieval can be more precise and accurate. The Cultural Heritage community is one of the first domains to adopt Semantic Web recommendations and technologies, which can provide interoperability between various organizations by creating a shared understanding in the community. The data in the CH domain differs widely with types and formats. Also, a lot of organizations and experts from various fields interact through different processes within this community. Due to the mentioned needs, the CH employed Semantic Web technologies step by step along its evolution process for better knowledge management and a uniform understanding among the community. In this paper, we presented this process from its initial steps to the latest developments in the CH information retrieval. By making data machine-readable, the Semantic Web ena-bles a wide set of opportunities to develop smart applications based on rich CH information besides better information retrieval. In this paper, we also reviewed intelligent applications and services developed in the CH domain after establishing semantic data models and Knowledge Organization Systems.
Review #1
By Carlo Meghini submitted on 30/Sep/2019
Review Comment:

General considerations:

The article is extensive in covering information discovery (the term “discovery” is to be preferred over “retrieval”, which has a very precise connotation in the scientific IT community) in the CH domain. However, it has two main drawbacks:

(1) much of the material is dated and of no interest for researchers, PhD students, or practitioners willing to acquire knowledge of the field from a scientific or technical point of view. As argued below, such material is going to have only an historical interest, but in this sense, I believe the paper is out of scope.
(2) a survey paper should provide a solid account of the outstanding open problems the scientific community is facing at present. Likewise, a technical reader should find in the paper the challenges that need to be faced when designing and implementing a system for information discovery in the CH field. Both these are lacking from the paper.

As such, I think the paper is not suitable for publication in the Special Issue.

Since the authors have gone a long way in presenting the historical developments of information modelling and discovery in the CH domain, I would advise them to look for a publication venue interested in historical accounts.

Detailed comments:

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

For the reasons explained above, the paper is unlikely to attract any interest from researchers, PhD students, or practitioners, who are interested to get started on the covered topic. These people are not interested in knowing the historical developments, but rather the open scientific or technical problems and the potential directions of development.

(2) How comprehensive and how balanced is the presentation and coverage.

The presentation achieves balance by providing exhaustive coverage of the topic. While this would be a requirement for a paper aimed at a historical account of the subject matter, it is unnecessary in a survey paper. As a result, the paper is very long and would most surely prove hard to read for the targeted audience (researchers, PhD students, or practitioners).

(3) Readability and clarity of the presentation.

The paper is readable and clear.

(4) Importance of the covered material to the broader Semantic Web community.

Scarce or none at all. The paper recaps the relevant Semantic Web standards in the CH domain, describing their rationale and features. This treatment of the subject adds nothing new to what is largely well-known. The Semantic Web community would be interested in knowing limitations of current languages and technologies in their application to the CH domain, what is missing or why current approaches are not fully adequate to the challenges posed by CH.

Review #2
Anonymous submitted on 07/Oct/2019
Review Comment:

This paper provides a rich and comprehensive overview of existing models, tools and technologies available nowadays for the study of Cultural Heritage. The content is complete and well balanced, very well structured and clear. All the topics are treated in a complete way and provide a valid entry point for each of the technologies described, and sufficient documentation for many of their aspects.

A special focus is placed on resources related to the Semantic Web and especially the Linked Open Data paradigm, which provides the latest and most effective technology for sharing data on the Web. It certainly constitutes an introduction of great interest to readers who intend to start their research activity in the Digital Heritage world.

It should be noted, however, that it may result not of so great interest for the experts in this sector because the survey is limited to reporting the existing technologies and tools but does not provide many applicative examples, especially new ones, which would instead have been of great interest to show these tools in action. The way technologies are applied is sometimes almost as essential as the description of the technologies themselves. I recommend that the authors extend this topic further if the paper is revised.

Overall, however, it is a good paper, clear, fluent and well written. The bibliography is complete and exhaustive, certainly interesting even for those who already have experience in these subjects.

Review #3
Anonymous submitted on 08/Oct/2019
Major Revision
Review Comment:

This manuscript was submitted as 'Survey Article'. It provides a valuable source of information, referring to state of the art standards used in Cultural Heritage Information Retrieval.

Comments on survey article's dimensions:
(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic:
Yes. This is a good dictionary, definitively useful for a reader to start in the domain and to have an overview of what exists.

(2) How comprehensive and how balanced is the presentation and coverage.
The coverage is large, although I cannot say if it is exhaustive. A lot of standards and reference works are listed and described. However, we miss some clear synthesis that would help a novice to make a choice among them. Tables that are there for providing this synthesis should be presented before the details, and they should include additional details like where to find the models or tools, their creation date and if they are still maintained (activity)

(3) Readability and clarity of the presentation.
The article overall is well written, but really hard to read. The main reason is a dictionary-like presentation, with sometimes very long paragraphs (multiple columns). Probably bullet lists should be used instead. In almost all sections, the reader gets easily lost in all the listed items.

(4) Importance of the covered material to the broader Semantic Web community.
This is the first time I read a survey on Semantic Web applied to Cultural Heritage. The coverage seems large and might be an opportunity for the semantic web community to know the specific problems of CH Information Retrieval, how SW solves these problems and what is missing. However, due to the way the paper is written, identifying what is missing and the next steps required is difficult.

General Conclusion:
The article could be accepted under strong revisions, taking into account the comments and in particular revising the text and structure to increase readability.

Here are detailled comments:
-p2 top of 1st column: "(...)for or basically their question is a semantic one." => ???
-p3 top of 2nd column: "without realizing them(...)" => ???
-Section 3: a conclusion for each category (sub-section) is missing
-p6 top of 1st column: should be rewritten. RDF is independent from XML, and there are different syntaxes that can be used to represent RDF data (e.g. RDF7XML, N-Triples, RDFa, Turtle). XML (RDF/XML) is only one of these syntaxes.
-p7: the table should put in a correct place (no big blank space...). I suggest to put it before the details given for each KOS: first the synthesis, and then the details. Additionnally, several things are missing in the table: year of creation, link to the KOS (where is it available), is the KOS still maintained? (activity), ....
Those remarks can be taken into account also for all the tables summarizing SOTA items in the whole document.
-p9 "3.2 Data integration in metadata level": "at the" instead?
-p9 middle of 1st paragraph: "(...) and cannot be put aside easily." => why do you say this? can you explain?
-p9 footnotes number to revise
-p9 paragraphs are too big: consider making one per item or coherent group of items. 2nd column is badly written and hardly understandable: consider revision. End of the column: "(...) a demonstrator was developed at the end, which offered semantics results for the users." => what do you mean by "semantics results"?
-p10 end of 1st column: "Metadata is constructed with a human processing point of view (...) not appropriate for automated tools to infer (...)" => ok. but this is not true for ontological metadata, which are machine processable by definition.
-p10 end of 1st paragraph: "There is an attempt to understand the concepts (...)" => This is not necessarily true because there are generic domain-independent ontologies.
-p11 "is-a" rather than "isA", and "relation", not "hierarchy"
-p11 1st column, 2nd paragraph: too many things in one single paragraph. + refer to secton 3.4 where you first introduce RDF. + you can also refer to OWL2.
-p11 2nd column, beg. of section 3.4: 1st paragraph unclear, consider revision. "Besides the functions mentioned above (...)" => an example is needed.
-p11 2nd column: CIDOC-CRM is usually the name given. I am not sure CRM is really used, and since it is a generic name, I would not use it for the CIDOC-CRM: CIDOC alone would have more meaning. So please do not say it is also called CRM...
CIDOC-CRM is not really a formal ontology, but rather a "reference ontology for interchange of heritage information (ISO 21127:2014)". You can also refer to the OWL2-DL version (which is formal):
-p12-p13: paragraphs really too big: not readable...
-p14: revise page organisation with the table...
-p14, 3.4.1: "(...)EDM and CRM aer the two most dominant(...)": you have a lot of references for CIDOC-CRM, but only one for EDM. This does not reflect the popularity of EDM...
-p15: revise page organisation (reading order of column parts, with the table...)
-p15, end of 1st column: "(...)cannot be put aside easily." => again? Why do you say this? Also I am not sure this is correct from an English point of view.
-p15 2nd column. This part is interesting, but we would like to know if there are tools to map from one model to the other, and if they can be used jointly since they seem to be complementary. The question behind this is "how to join / integrate the two worlds of CRM and EDM, to avoid having data silos and allow machine reasoning on everything?"
-p16 end of section 3: "This is achieved if the ontology(...) => Why? Give some evidences. By saying this, you say implicitly that CRM-CIDOC is better than EDM, although even if one is more complete than the other, the two can probably be complementary...
-p16 section 4 title: "geo-spacial" instead of "sapatio-temporal" (?)
-p16-p17: again, the paper would gain in readability by using bullet lists, or any other way to avoid very big paragraphs. There is sometimes no apparent order in the way items are presented. In particular in section 4, it starts with GIS and the like about geospacial models, then jumps to 3D models, and then come back to geospacial standards with CityGML and 3DGIS. We understand only at the end of the section why the part on 3D models is there... => this should be reorganized.
-p17, end of 2nd column: "Since CIDOC CRM is an ISO standard(...)" => Why is this part about CIDOC-CRM here? Is this a cut/paste error?
-p19: "(...)we are going to discuss" => use "we discuss" instead.
-p19, middle of column 2: "There is an important point to note (...)" => not clear, rewrite.
-p22 section 5.2.2: Description Logic is not mandatory to allow reasoning...
-p23: 3D models are firstly presented in section 4, so maybe section 5.2.3 should be somehow merged
-p23, end of 2nd column, p24: Use of "ubicomp"... Not everyone knows what this is, so you should introduce the term.
-p24, column 2, 2nd paragraph: "(...)AR is quite close to the concept of ubiquitous computing(...)" => Putting AR in a section called "context-aware applications" is questionable. It should rather be put in a specific section dedicated to AR/VR.