Pattern-based design applied to cultural heritage knowledge graphs

Tracking #: 2517-3731

Authors: 
Valentina Anita Carriero
Aldo Gangemi
Maria Letizia Mancinelli
Andrea Giovanni Nuzzolese
Valentina Presutti1
Chiara Veninata

Responsible editor: 
Special Issue Cultural Heritage 2019

Submission type: 
Full Paper
Abstract: 
Ontology Design Patterns (ODPs) have become an established and recognised practice for guaranteeing good quality ontology engineering. There are several ODP repositories where ODPs are shared as well as ontology design methodologies recommending their reuse. Performing rigorous testing is recommended as well for supporting ontology maintenance and validating the resulting resource against its motivating requirements. Nevertheless, it is less than straightforward to find guidelines on how to apply such methodologies for developing domain-specific knowledge graphs. ArCo is the knowledge graph of Italian Cultural Heritage and has been developed by using eXtreme Design (XD), an ODP- and test-driven methodology. During its development, XD has been adapted to the need of the CH domain e.g. gathering requirements from an open, diverse community of consumers, a new ODP has been defined and many have been specialised to address specific CH requirements. This paper presents ArCo and describes how to apply XD to the development and validation of a CH knowledge graph, also detailing the (intellectual) process implemented for matching the encountered modelling problems to ODPs. Relevant contributions also include a novel web tool for supporting unit-testing of knowledge graphs, a rigorous evaluation of ArCo, and a discussion of methodological lessons learned during ArCo development.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 18/Jul/2020
Suggestion:
Minor Revision
Review Comment:

I'm almost satisfied with the submitted version of the paper which largely improves the previous one. Specifically, the authors have solved the issues raised in the first review. I can also appreciate the restructuring of the paper, the modifications, and the additions introduced as well.
In my first review, I observed the following:
------------------------------------------------------------------------------------------------------------------------------------------
My major concern regards the proposed definition of eXtreme Design (XD). The relationship between XD and XP is not clear at all. How XD is inspired by XP? What does XD inherit from XP? What does not? In which point XD is different from XP? In which does not? Why XD is necessary and why XP is not enough in the context of ontology design? The authors should answer those questions in a dedicated section that illustrates clearly the relationships between XP and XD. Moreover, there is no reference on how to apply XD to the design phase of an ontology, which represents the main difference between the development process of an ontology-based software and an entity-relation (ER) database-oriented software. In fact, the authors limit themselves to present the ontology patterns adopted in the implementation of ArCo without mentioning how to identify them: applying well-known patterns is a standard practice in software engineering and does not depend on the paradigm adopted. Without such clarification the reader has the feeling that XD is simply an à la "Spiral" approach (as it seems to be confirmed in Section 4.6) and that the “test driven” approach involves only the implementation (coding) of the ontology and not its design process.
-------------------------------------------------------------------------------------------------------------------------------------------

In Section 3, the authors clarify most of the aspects of my observation. Therefore, I am still not convinced about the justification of XD, since it appears to be Extreme Programming but applied to the ontology design process. Moreover, the authors assert:
-Page 5, lines 21-30. Where XP diminishes the value of careful design, this is exactly where XD has its main focus.
Extreme Programming does not diminish the values of careful design, it is the very opposite: it is intended to produce high-quality software, to reduce its cost through short development cycles. An accurate design phase is a mandatory step of XP (see for instance, Extreme Programming Installed, Jeffries, Anderson, Hendrickso).
In my opinion, the authors may use the term XP in replacement of XD, describing how it has been applied to ArCo. The paper may be accepted for publication as it is, but I suggest the authors to mind my last concern.

Review #2
Anonymous submitted on 28/Jul/2020
Suggestion:
Accept
Review Comment:

Globally, the paper has been greatly improved and the authors addressed most of my criticisms, thanks for that. Still I think that the paper better fits the category:

"Descriptions of ontologies: short papers describing ontology modeling and creation efforts. The descriptions should be brief and pointed, indicating the design principles, methodologies applied at creation, comparison with other ontologies on the same topic, and pointers to existing applications or use-case experiment"

because the great majority of results and modeling/methodological choices have been already analyzed in published works. However, this is not a short paper, and by shortening it the reader would not understand the main contributions of the ArCo project. In my view, this complete and exhaustive analysis merits to be published even though it does not contain substantial new results, but I leave to the editors the final word.

The paper would benefit at least a further effort to clarify some remaining critical and unclear points.

(1) The semantics adopted for the conceptual graphs needs to be clarified and made explicit. For instance, it is not clear how a relation between two classes translates into OWL axioms. Suppose A --(REL)--> B. Does this simply translate into $\exists REL.\top \sqsubseteq A$ and $\top \sqsubseteq \forall REL.B$ (in DL) or $REL(x,y) \to A(x) \land B(y)$ (in FOL) or there is something more/different? A note on that would help the reader.

(2) Even though the authors added some clarifications about the notion of situation still it would be very helpful to have an explicit comparison with the W3C ontology patterns for representing n-ary relations in RDF and OWL, see https://www.w3.org/TR/swbp-n-aryRelations/.

(3) In Section 4.3 the authors try to explain their idea of introducing some redundancies. More specifically, some binary relations that are in principle detectable via complex queries are also directly introduced in the vocabulary of the theory. First, I would appreciate a clarification of the reason, is it a matter of computational efficiency? Second, how the authors assure that the two redundant mechanisms (the information they represent) are (is) not discrepant?

(p.3) "By formalising the semantics of cultural properties"
- ??

(p.3) "a formal evaluation of ArCo ontologies based
both ..."
- based ON both ...

(p.3) "Section 4 provide"
- provideS

(p.3) Section 8 discusses relevant related work and Section 7 summarises the lessons learned from the experience of developing ArCo...
- Why do you talk about Section 8 before talking about Section 7? One could also avoid Section 8 and introduce Section 8.1 at the end of Section 5 and Section 8.2 the end of Section 3 (or 6). In addition Sections 5.4 and 5.5 concern methodological, more than conceptual, aspects. Maybe they can be moved.

(p.10) "Requirements coming from user stories, as well those extracted from ICCD standards, are translated into Competency Questions. All CQs, and related SPARQL queries, that so far guided ArCo KG design and testing are available online."
- I'm wondering what happens when a substantial change in the ontology (maybe also concerning quite high-level concepts) is necessary. In addition to take into account how to migrate the data, one also needs to consider the new translations of CQs (and maybe also the CQs themselves). The authors do not discute this aspect.

(p.11) "Table 1 lists some representative CQs for each module"
- It sounds strange to me that no CQ involves more than one module.

(p.13) "and to (ii) D&S [11, 34] distinctions for second-order entities"
- I'm not sure to understand what "second-order entities" means here.

(p.13) "For example, the Uffizi in Florence can be categorised as a Building (physical object), a Museum (a social object), and a relative Location (a spatial region), with experts understanding the Uffizi as a complex entity, whose heterogeneous features are not supposed to be analysed into three different categories, since they emerge out of co-predication [35]."
- I don't see in which sense the DUL generalization is better or different from a logical disjunction (of the three categories of Building, Museum, and Location). If the generalization corresponds to a disjunction, by categorizing Uffizi under this generalization the inferential power of the theory is quite low. If I'm not wrong, DOLCE adopts a multiplicativist approach, i.e., in the example, it would probably assume the existence of three different, but related, entities (a physical object, a social object, and a spatial region). This would increase the information present in the theory and its inferential power.

(p.15) "ArCo situations do not commit to the distinction between objects and events as applied in DOLCE (and CIDOC CRM), since
a situation is defined as the occurrence of a description as observed, diagnosed, aggregated, invented, etc."
- First, the term "occurrence" is often considered as a synonymous of "event" or "perdurant". Second, as far as I understand, DOLCE does not commit on a realist view on perdurants, i.e., DOLCE-perdurants may depend on cognitive processes as observation, diagnosis, invention, ecc.

(p.15) "The constructive stance prioritises the dependence of an event-like entity on its framing, so that a requirement for ontology design can be directly matched by a frame."
- A clarification on that would be interesting and useful.

(p.15) "having physical size, constitution, unique qualities, or authorship are factual situations for a cultural property c, while establishing constitution via Carbon-14 or attributing authorship for c are interpretation situations."
- Physical size, constitution, etc. are measured by means of physical instruments with given tolerances and resolutions and prone to possibles misuses (or maybe are just the reports of experts). It seems to me that if there is a difference with respect to Carbon-14 technics, this difference is mainly qualitative, maybe Carbon-14 measurements are less reliable.

(p.17) "As it describes a real-word object, it can be defined as an information object"
- First, "real-world"? Second, I don't see the implication.

(p.21) "A cultural property can be involved in many different situations during its life: it can be commissioned, bought or obtained, used "
- It is hard to me to accept that a commissioned cultural property exists also before been realized (or maybe without ever being realized).

(p.22) Fig.13. In the RDFS label of data:NumismaticProperty/1400019640 the authors included the attributed authors / date, ecc. Are these labels dynamically produced on the basis of the information in the KB?

(p.39) "a knowledge graph of Italian Cultural Heritage (CH)"
- The acronym CH has already been introduced. Use ICH or just Italian CH.

Review #3
Anonymous submitted on 29/Sep/2020
Suggestion:
Minor Revision
Review Comment:

The paper reports and reflects on the experience of the authors in applying the eXtreme Design (XD) method to develop ArCo, a knowledge graph on the cultural heritage (CH) domain. ArCo has been developed in a homonymous project in cooperation with the Italian Central Institute for Catalogue and Documentation.

The paper discusses how XD was tailored for the ArCo project and how these adjustments helped the authors overcome some of the method's limitations, such as the lack of test automation tools and the lack of guidelines for defining an ontology's architecture. A big chunk of the paper presents how certain ontology patterns were chosen to meet the some requirements defined for the ArCo ontology, while discussing the respective implications.

This revised version of the manuscript has addressed the most relevant concerns identified in the first round of reviews. The structure of the paper makes a lot more sense now and many overlooked issues have now been addressed, such as the ontological foundations of the ArCo ontology, the contextualization of the wide number of metrics provided in the evaluation section, and the clear analysis and comparison with related work. Additionally, it is a lot easier to understand the ODP selection and application rationale, since all patterns are now presented alongside examples and showcased in a graphical format. All that being said, I must also recognize that the paper is a little tough to digest, particularly because of its length (43 pages with double columns make quite a lengthy paper).

My general assessment is that this is a very good paper for practitioners, as it discusses a lot of methodological aspects often overlooked/hidden in papers that describe an ontology or a knowledge graph. It is even more useful for institutions that are considering or starting to embark on a linked data journey, who are potential "customers" of the ArCo ontology.

My main criticism regards the use of the two DOLCE variants, DOLCE UltraLite and DOLCE-Zero, as the foundation for the ArCo ontology, as they go in the opposite direction of what the state-of-the-art foundational ontologies advocate (e.g. BFO, UFO, [original] DOLCE). DOLCE-Zero, in particular simply throws away useful ontological distinctions by creating union classes of core disjoint classes. Moreover, the argument in favor of co-predication is weak and based on the assumption that DBPedia has a good ontology design, which it simply does not. While most discussions and lessons learned in this paper are very useful for practitioners, I think this aspect is actually a disservice. It seems that the authors mix the discussion on conceptual ontological concerns with the practical implementation limitations of using RDF/OWL.

Please find my arguments for this criticism and other related issues below.

Page 13, Line 30. “For example, the Uffizi in Florence can be categorized as a Building (physical object), a Museum (a social object), and a relative Location (a spatial region)”. I strongly disagree with the author’s argument for supporting co-predication, particularly with the provided example. The Uffizi as an organization is not the same entity as the Uffizi as a building. This is a typical example of systematic polysemy, i.e. a word being used with different (but often interconnected) meanings. Good ontological modeling advocates exactly for the opposite, i.e. disentangling the different concepts collapsed into a single term. It surprises me quite a lot that the authors see these distinctions but intentionally choose to merge them. The argument that DBPedia would generate millions of inconsistencies is not an argument in favor of their modeling choice, but one that DBPedia is not modeled properly. In sum, the authors started with a strong foundational approach with DOLCE, but are throwing its value out of the window with DOLCE-Zero. If ambiguous definitions and little ontological commitments are what the authors are looking for, why use a foundational ontology in the first place? Please note that I realize that one of the authors of this paper is one of the creators of DOLCE, which only makes this approach even more astonishing to me.

Page 14, Line 7. I find the authors’ strategy to use of core:Situation to capture dynamically changing information that needs to be time-index perfectly reasonable and useful. However, their explanation of the ontological nature of their notion of situation is quite confusing and should certainly be revisited. First, the argument based on the same cognitive structure of n-ary relations and events (and event types, actions…) is a little cryptic. I mean, how can an event and an event type have the same cognitive structure? For one, the former is an individual and the latter is a type. Is the authors’ proposal to treat both an event and an event type as a situation? But how can an event type be a situation? Second, core:Situation is said to be equivalent to d0:Eventuality, which in turn is the union of dul:Event and dul:EventType. It seems that the authors use core:Situation to mean some sort of stative event, i.e. a kind of static event that happens throughout a certain period. For instance, the state in which I’m a father (e.g. playing the role of Father), the state in which the Monalisa is located at the Louvre. Thus, I fail to see why the authors made core:Situation equivalent to the union of dul:Event and dul:EventType, instead of simply making it equivalent to dul:Event. The second paragraph in section 4.4 is further evidence for my point here. All the concepts mentioned, namely E4 Period, E5 Event, E3 Condition State, and E2 Temporal Entity seem good candidates to specialize dul:Event, but certainly not dul:EventType.

Page 15, Line 15. “... ArCo situations do not commit to the distinction between objects and events as applied in DOLCE”. I’m not sure what is meant by this phrase, but if the authors mean that they do not want to make the distinction between events and objects (or perdurants and endurants; occurents and continuants) why have they picked DOLCE in the first place? I’m not convinced by the argumentation in the paper that it makes sense to pick a foundational ontology that adheres to a 3D view of the world and then adopting this constructivist approach (which seems similar to a 4D stance).

Page 15, Line 39 “Once an entity is recognized as being part of cultural heritage, it never stops being a :CulturalProperty. For example, a commissioned artwork is not an instance of ArCo’s :CulturalProperty, unless or until it is officially recognized as such. Hence, according to the definition by [47], being a cultural property is an essential characteristic of all instances of :CulturalProperty”. When Guarino and Welty [47] say that a rigid property is an essential property to all its instances, they mean that at every possible point in time, not only after the individual instantiates the property for the first time. It is not like the addOnly constraint that existed in a previous version of UML. If things need to be recognized as cultural heritage, then they must necessarily not be so at a certain point in time, which makes the property anti-rigid. Take the example of a photograph P (given on page 16). It is essential for P to be a photograph from the moment it comes into existence to the moment it is destroyed. Thus being a photograph is a rigid property. However, no photograph would come into existence already as a cultural property because, as the authors explain, it requires external recognition. Thus, being a cultural property is an accidental property of P, which makes it an anti-rigid property. If the authors want to implement CulturualProperty in their OWL model as a rigid class, that is a design choice, but it does not make the concept of Cultural Property rigid.

============

Please find some minor comments below, like typos and simple suggestions for improving the manuscript.

Page 1, Line 36. "lessons learned during ArCo development" => ArCo's development

Page 2, Line 16. "a resource that contributes to this vision by..." It is not clear which vision the authors are referring to. Do they mean the recent trend of cultural institutions publishing open data?

Page 2, Line 25. "liked data projects" => linked data projects

Page 2, Line 30, Left. Please consider using a bullet list to improve legibility.

Page 3, Line 31. "Section 8 discusses relevant related work and Section 7 summarises the lessons learned..." Out of order.

Page 4, Line 36. "The quality of the database..." Quality in which sense? Good design, accuracy, completeness? The same vague term is used at the beginning of section 2.2.

Page 4, Line 25. "a PDF document that contains, as shown in Figure 2: a table listing..." => as shown in Figure 2, a table

Page 4, Line 46. "For each of the 30 typologies..." => the 30 types

Page 6, Line 29. “Their adoption guarantees a high level of the overall ontology quality, and favor its re-usability [21]”. Claiming their adoption guarantee quality is too strong (and not proven). The cited paper indicates a positive correlation between the use of ODPs and some aspects of ontology quality.

Page 6, Line 45. “A very recent and promising contribution to fill this gap is CoModIDE [23]…”. Has this tool been used in the ArCo project? It seems like something that would make sense. However, if it hasn’t, I don’t see the point of mentioning its details.

Page 6, Line 11. “Experiments have proved its positive impact on ontology engineering and ontology quality [8, 23]. “ This claim is also exaggerated. How about “indicated”, “suggested”, or “demonstrated”?

Page 6, Line 32. “…each involving one or more teams: a customer team,…”.
- Please consider using bullet points to improve readability.
- Are members of theses teams expected to be from the “customer side”?
- Does the XD method say something about people participating in multiple teams? E.g. a person who moves between the customer and the design teams. Is it encouraged, discouraged, forbidden?

Page 7, Fig 4. Not all steps reported in the figure are properly explained in the text. I’m left curious, for instance, about what happens in the project initiation? Is there something, in particular, that is done with the domain experts? The steps that are not covered are project initiation, data production, release, and versioning.

Page 7, Line 28. “An example of simple user story is…”. I would appreciate some further details on what the user story in itself should look like in XD, in general, and how it looked like in their project, in particular. The example in Fig. 5 is a description of an individual that should be handled by the ontology. It looks very different, however, from how user stories are commonly used in software engineering, which is “As a , I want , [so that ]”. In fact, on page 10 we can see a story that is quite close to this template.

Page 8, Line 14. “Testing and Integration”. In software development, developers usually write their own unit tests. In test-driven development, in particular, developers are even encouraged to write tests before coding. Do the authors have anything to report on their experience in this project regarding the separation between designers and testers? Is it actually beneficial? I mean, how do the designers know their model satisfies the requirements if they don’t write tests for it? They could event break previous tests when making changes in the ontology.

Page 8, Line 34. “… make the test positive”. I may be ignorant of this terminology, but why not simply say the test broke or failed?

Page 9, Footnote 26. “All unit tests passed so far. “ I don’t get this footnote. What are the authors referring to? That all test must have passed or that no unit tests were broken during regression testing? By the way, it may be useful to add a sentence to explain what regression tests are for lay readers (i.e. without a computer science background).

Page 9, Line 9. “… and possible additional unit tests, on the whole ontology, after integrating the new piece”. Where would additional tests come from if all the others come from user stories?

Page 9, Line 16. “So far, we have described eXtreme Design (XD) according to [8, 9, 26].” Consider starting a new subsection here. Something like “Limitations of the eXtreme Design methodology”.

Page 9, Line 27. “(i) we opened the process of requirements collection in the style of open-source projects”. Could the authors elaborate on this? I don’t think most readers will know what requirements collection style they are referring to. At least a reference to a publication discussing this style would be appreciated.

Page 9, Line 41. “Furthermore, proposals for improvement and bugs can be submitted GitHub issues”. -> “via GitHub issues” or "as Github issues"

Page 10, Line 1. “A story is a non-structured text of maximum 250 characters…” I suggest rephrasing this to make it clear that this is the size adopted in the ArCo project, so it is compliant with what was said before.

Page 13, Line 15. What does D&S stand for?

Page 14, Line 5. “While apparently this is a representation problem, …”. I would change this to “While this is a representation problem, …”

Page 14, Section 4.3. Isn’t this approach to duplicate information risky? First, it may cause inconsistencies in the data, if only part of the data is inputted. Second, if a functional property, like locatedAt, is derived twice from two situations that identify distinct locations of a cultural property at t1 and t2, would we generate a contradiction? Or are all the constraints for object properties removed? I recommend that the authors further motivate this strategy and argue why it was a good solution for their project.

Page 14, Line 30. "CIDOC CRM E5 Event, subclass of E4 Period, is 30 defined as ….”. What do E4 and E5 before the class names mean?

Page 18, Line 1. Reflection: Did the authors really need to port the concept of a cataloging record to the linked data world? Wouldn’t it be enough to bring in the data it contains? The record itself reflects an old way of keeping track of information about cultural properties.

Page 20, Fig 10a. Show a rdfs:subClassOf arrow between from cis:Site to clvapit:Feature to help the reader realize that onSite is a subPropertyOf atLocation (as stated in the text).

Page 21, Fig 12. Isn’t Interpretation missing a relation with the agent who made it (as mentioned in the text)?

Page 22, Fig 13. The coin example made me wonder: Do the authors mean that the specific coin has Calandra Davide as an author or that this is an instance of coin designed by him? For instance, when we say that the Euro coins were designed by Luc Luycx, we mean the different types of coins, not each specific coin that has been produced.

Page 22, Line 23. “… the file format for a digital photograph (e.g. “.gif”, “.jpeg”),…” The authors' first state that the technical statuses only apply to physical cultural properties. Then, they provide an example in which a technical status is a digital file format applying to a digital photograph, which is not a physical cultural property.

Page 24, Line 4. "entity is neglected in literature". => in THE literature

Page 25, Fig. 15. Some of the relations used in the example are not shown in the pattern, namely hasTimePeriod, hasImmediatedPreviousSituation, hasImmediatedNextSituation, hasTimePeriodBeforeNextSituation

Page 24, Line 40. "Annotating reused patterns supports the identification of ontology alignments" and "...ODP annotations may ease the process to understand and explore an ontology". Could the authors provide any evidence for these claims? Or at least discuss in this section how this annotation was useful within the ArCo project (e.g. Did it help ontology designers, testers, or users?) Otherwise, if there is nothing to be said about this, maybe the authors could consider removing this section from the paper.

Page 26, Line 36. "...by means of indicators that might suggest quality weaknesses or strength". There is something off with how this was phrased.

Page 27, Line 29. "when missing, over test data generated using Fuseki" Fuseki is a SPARQL server, so how does it generate data? If the authors mean the testers manually created artificial data using Fuseki, I suggest they rephrase this. Otherwise, please provide an alternative URL for Fuseki's data generation feature.

Page 27, Footnote 71. I tried the testalod demo with the default values, but I got an application error. I simply clicked GO! and then TEST! on the COMPETENCY QUESTIONS feature. The same thing happened with the CONSISTENCY test. Maybe it's a Heroku problem?

Page 28, Line 7. "This is of utmost important to assess whether ArCo addresses its intended use, i.e. compliance to expertise." Compliance with expertise is a requirement, not a usage.

Page 29, Line 5. The keywords you listed are in Italian, and Arco is in English (as far as I could tell), as well as CIDOC-CRM and Europeana Data Model. How as this analysis?

Page 31. Table 3. For the metrics that should be considered relative to another one (e.g. NoR and #Classes), I suggest showing the ratios instead, as it is done for NoC in the text (page 33).

Page 35, Section 7.1. Although I highly appreciate pattern-based methods and tools for ontology engineering, the authors' insistence on the need of annotating an ontology with the used patterns is unjustified. The benefits of such an annotation provided in the paper are either speculative or abstract.

Page 36, Line 7. "Effort are also being made" => Efforts are

Page 38, Line 34. "... changes of the physical location of a cultural property are represented by move events...". This has already been partially explained on the previous page. I suggest merging these two passages.

Page 39, Line 45. "In making this choice, a cultural heritage..." This paragraph is way too big.