An application of Semantic Web Technologies to GDPR compliance of University Processes and Personal Data processing

Tracking #: 2652-3866

Beniamino Di Martino
Pasquale Cantiello
Luigi Colucci Cante
Alfonso Diana
Antonio Esposito
Mariangela Graziano
Michele Mastroianni

Responsible editor: 
Guest Editors ST 4 Data and Algorithmic Governance 2020

Submission type: 
Full Paper
The recent GDPR regulations have had a huge impact on higher education and research institutions, especially in cases where personal data from students or other involved subjects are involved. This has led to a profound review of administrative processes and research protocols, and to the necessity of automatic means to verify the conformity of existing processes to current regulations. Many institutions are trying to formalize their internal processes and protocols by using standard formalisms, BPMN being the main formalism adopted. By developing semantic models to enable the annotation of such formally described processes, it is possible to define logical rules that verify their conformity against the GDPR regulations. In this paper, we provide a semantic model for the description of GDPR concepts, together with a semantic meta-model that contains concepts used to describe the structural elements of the analysed BPMNs and to annotate them with concepts from the Domain in which the Business Process itself operates. We then define conformity rules to apply to the annotated BPMN to validate it. A use case, referred to processes developed within an Italian University, is described to demonstrate the applicability of the approach.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 24/Feb/2021
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper is about the management of privacy policies and consent, as required by the GDPR. Privacy requirements are formalized with semantic languages and compliance is checked with Prolog inference.

The topic is interesting, due to the number and variety of organizations and companies that have to comply with the GDPR in processing personal data. The reviewer believes that KR&R approaches are the key to the development of effective solutions in this field.

The paper's contribution consists mostly in a compound ontology that links privacy-related concepts with business processes, which are an important aspect in automating compliance checks (when a BPM is adopted). The second contribution is a set of Prolog rules that are used to verify the compliance of organization policies with the GDPR.

The overall impression is that the paper needs more work before being published on a journal; it is still too sketchy in many places, and some important methodological steps are not reported (most notably, validation). Moreover, one relevant competing approach is not discussed, and the language needs some polishing in many places.

1) The paper is too informal and sketchy. A delicate task such as GDPR compliance checking (and the heavy consequences of violations) calls for a rigorous validation, otherwise organizations and companies are not going to trust the framework. Is the reasoning sound and complete? Currently, it is impossible to tell, because the key element (the Prolog rules) is just sketched and the way the rules are constructed is not described at all. Similarly, the axioms of the ontology are not illustrated - the authors provide only three screenshots of Protege. The set of detailed questions below includes some about this aspect.
Similarly, it is not clear how rich the semantic model is, hence which inferences and which compliance checks can be made by the system. The lack of detailed examples further hinders the comprehension of the system's capabilities.

2) The paper does not mention the approach developed in the H2020 project SPECIAL, see the deliverables on .
Some relevant outcomes of this project are: an OWL profile for encoding policies, consent, and an objective fragment of the GDPR; a real-time reasoner for compliance checking; an enforcement architecture based on a big data architecture.
Some relevant papers are:

[1] Piero A. Bonatti: Fast Compliance Checking in an OWL2 Fragment. IJCAI 2018: 1746-1752

[2] Sabrina Kirrane, Javier D. Fernández, Wouter Dullaert, Uros Milosevic, Axel Polleres, Piero A. Bonatti, Rigo Wenning, Olha Drozd, Philip Raschke:
A Scalable Consent, Transparency and Compliance Architecture. ESWC (Satellite Events) 2018: 131-136

[3] Piero A. Bonatti, Luca Ioffredo, Iliana M. Petrova, Luigi Sauro, Ida Sri Rejeki Siahaan:
Real-time reasoning in OWL2 for GDPR compliance. Artif. Intell. 289: 103389 (2020)

The formal compliance checking framework reported in the above papers is an appealing competitor to the formalism proposed in this paper, as far as the representation of regulations and consent, and related compliance checks are concerned. On the other hand, the above papers do not address the integration of privacy policies and compliance checking with a business process model, which is in my opinion the novel contribution of this paper, and should be explained more carefully.

3) English should be extensively revised. The paper contains also many typos.

Detailed comments:

- p.2 line 43: relative works -> related works

- p.2 line 47: inferencial engine -> inference engine

- p.3 lines 16-17: "so is high the probability of find" is not really English.

- p.3 lines 18-19: "most Italian universities use BPMN": reference/evidence needed.

- similarly, I have seen no evidence that BPMN is extensively adopted by the big industrial players whose revenues are based on the processing of personal data, such as - for example - telcos and financial institutions. This fact may hinder the adoption of the proposed framework.

- p.3 line 22: "defined the orchestration": who is the subject of this sentence?

- p.4 line 33: "In the article [16] is proposed a consent ontology": bad English.

- p.7 lines 22-44: you frequently write that a term "belongs" to a class, however those terms look more like classes than instances. Do you means that they are *subclasses* of the given class?

- lines 32-33: how are the terms representing rights logically connected with the terms that represent the controller's obligations?

- lines 40-42: (English) something is wrong in the description of "Principle".

- The class "Rule" could potentially be quite complicated. Is it just an atomic term here? does it have subclasses or properties?

- I don't see any reference to sensitive data, that must satisfy special requirements according to the GDPR (cf. Art.9). This is one of the aspects that makes me think that the proposed framework might be incomplete and not sufficiently validated.

- p.8 line 45: how is "Retention" specified in detail? There are at least two approaches, one (intentional) inspired by P3P and one based on the integer datatype.

- p.8 line 49: in "(e.g., biometric, personal, judicial)", biometric is surely personal and judicial can be, too. So "personal" is not an alternative to the other two classes - actually one may argue that it subsumes them.

- line 50: how is severity described/formulated?

- p.12, line 29-30: The phrase: `the blue "1" dot' is not easy to understand at a first reading; similarly for the other `dots' in the following.

- lines 32-37: are the "special place holders" skolem constants? So is this part actually related not to representation, but rather to reasoning with Prolog rules? If so, this part would be more appropriately placed within a description of the reasoning method (that is virtually missing at the moment).

- p.14 lines 6-7: it seems that locations, like "Italy-EU", are individuals. Classes would be more flexible: Italy can be a subclass of EU, cities can be subclasses of Italy, etc, at arbitrary levels of granularity.

- lines 20-22: the phrase "With this conversion ..." is incomprehensible.

- p.20: Something is wrong in the first reference.

Review #2
By Harshvardhan J. Pandit submitted on 03/Mar/2021
Major Revision
Review Comment:

# Summary
The paper describes a process for representing information regaring University processes using BPMN and mapping it to GDPR concepts through the use of an OWL ontology based on GDPR's ROPA requirements. The data produced using the ontology is then assessed for GDPR compliance using BFN/Prolog. While the paper describes each of these steps in a structured manner, it lacks sufficient demonstration of their relevance in GDPR, does not provide access to resources, and does not have a corresponding evaluation to show the effectiveness of outcomes. I thus recommend the authors to work on the following aspects of the paper and to clarify their contributions through more information.

# Related Work
The section goes through numerous existing research works, tools, GDPR-specific ontologies, and information/knowledge extraction and representation methods. It lacks a clear structure or goal, and instead comes across as merely specifying the existence of these methods. It is not clear how each of those cited works (and their associated resources) are relevant to the contributions presented by the paper. I would recommend the authors to instead divide the related work based on relevance to each of their 'steps' in the framework, i.e. ontology for representing ROPA, use of BPMN, and compliance checking. In addition, I would suggest presenting some analysis or relatedness of each work that answers (or pre-empts) questions such as - why did you not use any of the cited ontologies and decided to create your own?

For ontologies, how did you assess them as being related? Given that none of the cited works address ROPAs, did you instead create competency questions or requirements and see if any ontology was reusable or any of the tooling was applicable? E.g. there are a few BPMN based approaches, and also a few compliance checking methods.

When citing GDPRtEXT [18] (disclaimer: am author), the paper fails to mention that it provides a vocabulary of concepts, and instead only mentions its mapping from DPD to GDPR. Similarly, when citing DPV [19] (disclaimer: am author), the paper does not mention what these vocabularies provide or what DPV is exactly.

For BPMN, the paper mentions use of PrOnto to annotate BPMN and use it for evaluation. It is not clarified (here or later) as to how it relates to the author's BPMN implementation or why it could not be reused or what its design/methods were (AFAIK they are not open-source and unavailable). Some relevant works that the authors might find interesting to know are:
Pullonen, P., Tom, J., Matulevičius, R., & Toots, A. (2019). Privacy-enhanced BPMN: Enabling data privacy analysis in business processes models. Software & Systems Modeling.
Tom, J., Sing, E., & Matulevičius, R. (2018). Conceptual Representation of the GDPR: Model and Application Directions. International Conference on Business Informatics Research, 18–28.

Regarding compliance evaluation and constraint checking for GDPR, there is a lack of related work presented. Only superficial information about use of PrOnto (and its basis in deontic logic) is mentioned without delving into the representations or the complexity or the effectiveness. This is also relevant later, as the authors need to justify why they chose BFN/Prolog or how it relates to existing compliance checking methods. Suggested related work to look into this includes:
* work published through the SPECIAL project - Bonatti, P. A., Kirrane, S., Petrova, I. M., & Sauro, L. (2020). Machine Understandable Policies and GDPR Compliance Checking. ArXiv:2001.08930 [Cs]. and Westphal, P., Fernandez, J. D., & Kirrane, S. (2018). SPIRIT: A Semantic Transparency and Compliance Stack. Proceedings of the 14th International Conference on Semantic Systems (SEMANTiCS), 4.
* logic-based compliance checking Satoh, K., Vos, M. D., Padget, J., & Kirrane, S. (2019). Reasoning about Judgement in GDPR Litigation by PROLEG (Demonstration Paper). GDPR Compliance - Theories, Techniques, Tools Workshop of Jurix 2019, 8.
* using ODRL Vos, M. D., Kirrane, S., Padget, J., & Satoh, K. (2019). ODRL policy modelling and compliance checking. 3rd International Joint Conference on Rules and Reasoning (RuleML+RR 2019), 16.
* using SHACL (disclaimer: am author) Pandit, H. J., O’Sullivan, D., & Lewis, D. (2019). Test-driven Approach Towards GDPR Compliance. 15th International Conference on Semantic Systems (SEMANTiCS2019).

Given the reliance on ROPAs for creating ontologies based in GDPR, the following work is also of interest: (disclaimer: am author) Ryan, P., Pandit, H. J., & Brennan, R. (2020). A Common Semantic Model of the GDPR Register of Processing Activities. In S. Villata, J. Harašta, & P. Křemen (Eds.), Frontiers in Artificial Intelligence and Applications. IOS Press.

I would also caution against generic use of "AI techniques", given that they imply a vast amount of processes and techniques, and to instead stick to boring but specific and accurate descriptions.

# Data Protection Ontology
I found it absurd that the citation to Data Protection Ontology [28] is to a DPVCG wiki page! The ontology has publications published which should have been used instead, and it is hosted (somewhere on GitHub last I checked) so its url should have been used. It is also pertinent to know that the authors of that work have since published a comprehensive set of ontologies and KG for compliance checking work based on PrOnto under the DAPRECO project.

## Extension
Can the authors provide access to the ontology? Without it, the only indication of what the ontology contains is through the screenshots e.g. Figure 3 - which are IMO not sufficient to inspect the ontology or determine its validity. For example, I'm quite curious to know what subclasses Consent has - and my guess would be Explicit consent.

Table 1, which presents ROPA concepts is all the way up in P.4 whereas its reference is on P.8. I found this table and its fields inadequate to represent the ROPA requirements as outlined in A.30. The purpose and legal basis should have separate fields - and in addition the separation of consent in the next field is confusing. Surely consent would be considered part of the legal bases used to justify a purpose? A.30 also requires Data Subject category (which is absent), and Personal Data Categories (which is mislabeled under Data Collection). A.30 similarly requires specifying Recipients and their jurisdiction/location - which is distinct from data storage and transfer. This is assuming the Data Controller is the university, though in many cases there could be joint-controllership of data, in which case the ROPA must also specify this information. The authors have mentioned that they utilised the common ROPA template for universities provided by the (an) Italian authority [2] as a reference, and it is possible that the table was adapted from that document. However, the citation does not contain a finadable reference to that document, so I would recommend adding that to either the reference or as a footnote. I found this document: which seems to be the right one but hosted on a university's website instead of authority's.

For the additional concepts added to the extension, Personal Data is mentioned even though Bartolini's ontology already contains PersonalData.

## University Domain Ontology
Are the university domain concepts also taken from the guideline document that was used for ROPA?

I'm unclear as to how the term 'Office' identifies the Recipients - perhaps this means the recipient witthin the university, and through the services used - external parties? Similarly, the 'office' in 'office belong data controller' property also raised the question as to what office this relates to. Regarding liabilities, the concept 'vendor' is unclear - does it mean vendors whose products or services the university uses, e.g. email, recruitment. Either way, this description can be made more clear.

The relationship depicted by LegalPersonData is unclear to me regarding how it relates to GDPR's concepts of Data Controller and Data Subject. The term 'entity' could have been used to capture all three if its purpose is to create a parent class.

The mapping of domain ontology to data protection ontology does not clarify on what basis the mapping was conducted. For example, why was only consent represented as a property (consentGrantedTo), and other legal bases were not (e.g. legitimate interest).

# Automated Verification

The checks mentioned P.15.L.35 contain questions - where did they originate from? And how do they relate to GDPR requirements? I found a lack of what aspects of GDPR are being verified/checked here e.g. clauses, or specific obligations. It would have been a better evaluation if the use-case was demonstrated in terms of showing its input/output so the reader understands what comes out of the automatic verification process. It is also not clear what happens after the checks - are the results stored back in the KB? Are they presented to the user for fixes with an indication of where the violations lie? Is there any form of log or reporting?

# Other
* Language - The paper can utilise editing to ensure there are no typos and that the sentence structure is not discomforting to the reader. A spell check and any reasonable grammar tool would suffice.
* Citations required, or claims made explicit: (page) P2. (line) L24, P3.L18 most Italian Universities use BPMN
* Code in Representation 1 is clipped

Review #3
By Rob Brennan submitted on 21/Mar/2021
Major Revision
Review Comment:

This manuscript addresses a highly relevant topic (GDPR ROPAs and automated validation) and provides some evidence of an implementation, deployment and ontology development to support the design presented. The work has novelty as ROPAs are not often directly considered and live deployment experiences are to be welcomed. However the design is not sufficiently well presented to demonstrate the contribution, especially in the area of experimental evaluation of the operation of the system but also the ontology linked to does not follow best practice in terms of ontology engineering, several ontologies are not published and no source code or demonstrator is available online. Finally the presentation suffers from a number of minor grammatical mistakes and is not structured to best highlight the main contribution.

Thus a major revision is recommended with the following steps:

1. Fully justify, for example by defining requirements, the use of a GDPR Ontology based on old GDPR draft text and which does not follow ontology engineering best practices (see comment below). There are no details on the development methodology. The ontology also has a series of open questions (below) on how well it actually represents the GDPR domain from a practical point of view (now that we are operating GDPR).

2. Re-structure your Related Work section to provide a more detailed analysis of the works cited to illustrate the gap you are filling. Extend the discussion of semantic BPMN validation.

3. Move section 8 forward in the paper to provide the main contribution earlier on.

4. Provide formal experimental results for the use of your system to back up the claims of usability, efficiency etc. this should profile the types of staff used, how they are trained, how long the annotation process took, how often errors are detected and how much work it is to fix them. Also provide lessons learned based on your practical experiences.

5. In general many diagrams that are not Protege screenshots have text that is too small to read and would benefit from using a formal notation to make them clearer.

Detailed Comments and Questions:

2. L12 "use automatic analysis and reporting tools to extract, from the regulatory text (and related documents), the
data needed and the semantic links between them."

It is unclear why this argument is made since the paper does not use such an ontology. In principles-based legislation like GDPR this maybe is not realistic since the regulatory text does not specify what data is needed for accountability, right? Of course, as you mention in your related work,  there are several well-established ontologies derived from GDPR already so generating a new one via ML is not necessary.

2.15 "so is high the probability of find, in the company under investigation, an updated business process description using

While there may be significant interest in BPMN, it would be better to avoid the subjective "high" probability of finding it present in an organisation. Many organisations do not use BPM and even among those that do, there is huge variation in the notations used, eg less than 50% of BPM tools used BPMN in 2017:
Matthias Geiger, Simon Harrer, Jörg Lenhard, Guido Wirtz,
BPMN 2.0: The state of support and implementation,
Future Generation Computer Systems, Volume 80,2018, Pages 250-262, ISSN 0167-739X,

L2. 33 "In our approach, we use the Record of Processing Activities as the base document to implement the Domain Ontology
and, since this document is composed by structured text in a standard format, is really simple to use it to build the

Except that extreme variations in the interpretation of what is required for a ROPA has been detected by analysis of the ROPA templates provided by Regulators [Ryan and Brennan Jurix]
An alternative structure for Table 1, ROPA requirements, could be
Art 30(a) Name of the controller
Art 30(a) Contact details of the controller
Art 30(a) The joint controller
Art 30(a) Contact details of the joint controller
Art 30(a) The controller's representative
Art 30(a) Contact details of the representative
Art 30(a) Data protection officer
Art 30(a) Contact details of the data protection officer
Art 30(b) The purposes of the processing
Art 30(c) A description of the categories of data subjects
Art 30(c) The categories of personal data
Art 30(d) The categories of recipients
Art 30(e) Identification of third country or international organisation receiving transfer
Art 30(e) Documentation of suitable safeguards
Art 30(f) The envisaged time limits for erasure of the different categories of data
Art 30(g) A general description of the technical and organisational security

It would be useful to discuss the missing fields, for example the identity of the recipient for data transfers.
Ryan, Paul, Pandit, Harshvardhan J. and Brennan, Rob (2020) A common semantic model of the GDPR register of processing activities. In: 33rd International Conference on Legal Knowledge and Information Systems (JURIX 2020), 9-11 Dec 2020, Prague, Czech Republic (Online). ISBN 978-1-64368-150-4

P3.44 This related work section (3) is just a list and does not critique and identify the gap in existing work, despite generally identifying important papers on semantics and GDPR. It is necessary to identify how your work differs, builds upon or improves the state of the art.
In contrast there is a huge body of work on semantic BPMN validation and this is discussed very lightly. It is key to see how author's work differs, what the limitations of other work are.

There is also no discussion of accountability and how this is relevant, for example see the ICO Accountability Tracker

P6.7  Surprised that you do not distinguish between a data controller and processor in the relations

P6.26 would like more detail from [24] here to make more understandable

P6.27 Have you any estimate of how much effort it takes to manually semantically annotate a BPMN? This would seem to be very labour intensive... perhaps on the same order to manually evaluating the BPMN for compliance?
In addition what happens when the BPMN changes (as business processes do)
For a discussion of the importance of change in the business process lifecycle, for example see
A. Meidan, J.A. García-García, M.J. Escalona, I. Ramos,
A survey on business processes management suites, Computer Standards & Interfaces, Volume 51, 2017, Pages 71-86, ISSN 0920-5489,

L6.36 Why was an ontology based on a draft of GDPR legal concepts used rather than one reflecting the final text?
For example in the W3c page describing the ontology it says the links to the GDPR text are incorrect in some cases. The bitbucket dump of the ontology includes annotation properties (generated by Protege perhaps) that hijack existing ontolgies like dublin core. No property comments are included in the ontology to make it self-documenting, for example for HTML generation by tools like Widoco. For an OWL2 ontology it is interesting that extensive use has been made of rdfs domain and range statements for properties rather than using more flexible restriction class-based property constraints. is also re-defined in the ontology.
Links to GDPR articles are only captured in comment annotations rather than in a semantic way, for example using the GDPRext ontology
There is a lack of ontology level metadata normally demanded by best practice that would declare the authors, versioning, licensing etc.

L6.37 "This legal ontology
formalizes data protection norms and models the GDPR main conceptual cores:"
Given the principles-based nature of GDPR accountability is it really possible to fully cover the full scope of potential data protection activities, given that local policies can vary and conformance to those local policies is what is required by GDPR rather than an explicit  set of compliance points or concepts?

L6.34 What was the basis for deciding to use this ontology given that there are a number of GDPR ontologies as laid out in your related work section. It would be good to explain the rationale to the reader.

p7.38 It would probably be useful to add a JointController to this class.
In terms of naming, would Agent or LegalPerson be preferable to "Person" since this class can denote an organisation?

p7.40 PersonalData, would it be useful to extend this ontology to include the concept of Special Category or Sensitive data that GDPR uses? Several existing DP/GDPR ontologies, like DPV, include this concept.

L7.48 It would also be beneficial to explain in detail the requirements for the extensions created and the process you used.

L9. 35 Figure 4. In general there are too many Protege screen shots with insufficient supporting documentation to justify or explain the ontologies created.

L9.40 This section (6) is too brief in terms of defining the ontology development methodology used and the motivating use case or defined requirements. This would be more useful than brief definitions of key classes. Also no link to the ontology source is given so it is impossible to verify its quality.

L11.36 Section 6.2
Again this section gives us some example interlinking properties but it does not  tell us why this is important or evaluate to what extent this process has been successful or tell us about any quality checks or formal processes used to build confidence that it is appropriate.

L12.21 "The annotation of the BPMN through the tool described in section 4 is quite straightforward"
Can you provide any evidence of how long it took? How much training was required? The types of expertise of the annotators? How much preparation was required to deploy the tool? What type of machine it executes on? How difficult it was to secure access to the BPMN from the organisation?
Without this your statement is hard to evaluate.

Similarly in P12.L30-50 you describe the annotation process and how it can be sped up, it would be convincing to provide the details of a formal experiment to quantify the gains observed and qualify them against the training or preparation time so that readers can see the tradeoffs involved.

P14.21 "inferential engine inferences using the Prolog Rules that come from Knowledge Base Rule - Rules that contains
Rules to evaluate the compliance of the BPMN."
It would be useful to explain this process in more detail, for example what are the base rules, what are the types of issues detected, what is the consequences of incomplete annotation, what quality assurance process is necessary/available etc.?

P14.48 Figure 9, Would be preferable to have a formal diagram like a UML behavioral diagram to show the data flows and systems in a less ambiguous way.

P15.1 Section 8
I recommend that this section is moved closer to the front of the paper. It would then motivate many of the previous sections.

P15.25 I think these rules could also explicitly cover Joint Controllers which is becoming an increasingly common structure under GDPR processing rules in practice.

P17.34 "the speedup provided by the
tool may significantly reduce the effort for obtain (and mainly, maintain) GDPR compliance."
No evidence has been provided to back this up

=Typos and Formatting Issues=

figure 2, text on buttons too small to rad

P2.17 typos "so
is high the probability of find"->"so
_there is _a_ high __ probability of find_ing_"
P1.37 typo: soud->sound

P3.37: typo: "data field"->"data fields"

P6.34 typo "called Data  Protection Ontology" -> "called _the_ Data Protection Ontology"
also P6.38

P6.38 typo "is composed by"-> "is composed of"

P6.40 typo "legal bases" -> "legal basis"

p6.44 typo "This class describe" -> "This class describe_s_"

p6.45 typo "as subclass"->"as subclass_es_"

p7.28 typo DataSubjectRigth -> DataSubjectRig_ht_
typo "This class describe"-> "This class describe_s_"

p7.32 typo RigthToNoProfiling->Rig_ht_ToNoProfiling

p7.38 typo DataProtectionOfficier->DataProtectionOffic_er

L7.48 This section refers to "a custom version of GDPR ontology" whereas the last section talked about the "Data Protection Ontology", it would be good to harmonise the naming used for maximum clarity.

P14.48 Figure 9, text too small to be readable.

P15.25 typo: governative, do you mean governmental?

P18.47 Figure 10 the diagram text is too small to read