Semantic-enabled Architecture for Auditable Privacy-Preserving Data Analysis

Tracking #: 2719-3933

Authors: 
Fajar J. Ekaputra
Andreas Ekelhart
Rudolf Mayer
Tomasz Miksa
Tanja Šarčević
Sotirios Tsepelakis
Laura Waltersdorfer

Responsible editor: 
Guest Editors ST 4 Data and Algorithmic Governance 2020

Submission type: 
Full Paper
Abstract: 
Small and medium-sized organisations face challenges in acquiring, storing and analysing personal data, particularly sensitive data (e.g., data of medical nature), due to data protection regulations, such as the GDPR in the EU, which stipulates high standards in data protection. Consequently, these organisations often refrain from collecting data centrally, which means losing the potential of data analytics and learning from aggregated user data. To enable organisations to leverage the full-potential of the collected personal data, two main technical challenges need to be addressed: (i) organisations must preserve the privacy of individual users and honour their consent, while (ii) being able to provide data and algorithmic governance, e.g., in the form of audit trails, to increase trust in the result and support reproducibility of the data analysis tasks performed on the collected data. Such an auditable, privacy-preserving data analysis is currently challenging to achieve, as existing methods and tools only offer partial solutions to this problem, e.g., data representation of audit trails and user consent, automatic checking of usage policies or data anonymisation. To the best of our knowledge, there exists no approach providing an integrated architecture for auditable, privacy-preserving data analysis. To address these gaps, as the main contribution of this paper, we propose the WellFort approach, a semantic-enabled architecture for auditable, privacy-preserving data analysis which provides secure storage for users’ sensitive data with explicit consent, and delivers a trusted, auditable analysis environment for executing data analytic processes in a privacy-preserving manner. Additional contributions include the adaptation of Semantic Web technologies as an integral part of the WellFort architecture, and the demonstration of the approach through a feasibility study with a prototype supporting use cases from the medical domain. Our evaluation shows that WellFort enables privacy preserving analysis of data, and collects sufficient information in an automated way to support its auditability at the same time.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Andre Dekker submitted on 08/Mar/2021
Suggestion:
Accept
Review Comment:

I think this is a mature and well written paper on using semantic web tech to enable auditing and consent management in a privacy preserving manner.

I have only one relatively minor comment which is that there are established and upcoming federated data infrastructure where the analysis is sent to the data rather than centralizing the data as is done in this paper. The paper says something about this approach "As a consequence, organisations often refrain from collecting data centrally and, for example, offer applications that analyse data locally instead." but then incorrectly states "While this reduces the attack surface, it means losing on the potential of data analytics and learning from the entire collected data, and thereby hinders innovative services and research. As an example, identifying trends over cohorts of users is not possible"

Federated learning incl. common additional approaches such as secure multiparty computation and homomorphic encryption is able to handle almost all analyses in a privacy preserving manner. This should be better described in the text. I think the main addition of this paper is not the privacy preserving analysis part but the way they handle consent and adaptability using semantic web. This is for me the main merit.

Review #2
By Víctor Rodríguez-Doncel submitted on 02/Apr/2021
Suggestion:
Minor Revision
Review Comment:

SUMMARY
In this paper the authors present a platform based on Semantic Web technologies for auditable and privacy preserving data analysis, which can be useful for companies to carry out studies using personal data, while obeying users' consent and storing provenance data for auditability purposes.

GENERAL APPRAISAL
The proposed platform seems to be an original idea as there are already solutions for auditability checking, for representation of consent and usage policies or privacy-preserving techniques, however no approach can be found that uses these tools to provide a platform for auditable privacy-preserving data analysis. These results apparently advance the state of the art in this area, however there are a couple of very relevant issues to be considered:

1) The results in this paper paper cannot be replicated. SWJ authors are encouraged… “....to write their papers and more specifically the evaluation sections in a style and level of detail that enables the replication of their results.”. Well, I tried to register at https://wellfort.ifs.tuwien.ac.at/ with no luck. I think demonstration of the usage of the platform and / or open access code of the component’s implementations would significantly improve the reproducibility of the results.
Also, no data related to Section 6 is disclosed. I think publishing the data (at least the sample CSV tables, if not the RML mappings, a couple of FHIR records or the synthetic data), would have facilitated the understanding. No git repo to be checked either. Why don’t you publish an ‘electronic material’ web along?

2) The Wellfort PROV-O extension
I could find the Wellfort PROV-O extension nowhere. Anyway, this extension ignores a previous effort with the same intention: GDPRProv. Either the authors build on that effort or they provide a good reason not to do it, but I think it cannot be ignred. Visit https://openscience.adaptcentre.ie/ontologies/GDPRov/docs/ontology

3) Paper category. The authors chose “Full Paper”. Well, the paper is full in length, but perhaps the category "Reports on tools and systems” would have fit better. In that case matureness has to be considered, but even if the system is not mature a good demo would have been acceptable, considering the magnitude of the effort.

I would recommend this paper for publication if the Issue 1 above was corrected and issue 2 at least considered.

Other than this, the quality of the writing is good and the paper is well structured. There are a few typos and a few sentences that can be improved, which are described after some more detailed comments.

DETAILED COMMENTS
Section 1 (Introduction) introduces well enough the paper and the area being discussed, however I was expecting to have a clear definition of ‘auditability’ and ‘privacy-preserving systems’. Also, it is not really clear why the authors focus on privacy-preserving data publishing (PPDP) and privacy-preserving data mining outputs (PPDMO) instead of other areas of the privacy-preserving data analysis domain.

In Section 2 (Requirements) the authors specify that the requirements of the system are specified according to privacy standards, however they don’t specify which ones they are referring to. Also, in the given example they introduce two companies which sometimes are referred to as company A and company B and sometimes as company H and company M. This should be addressed to have homogeneity, even in Figure 1.

In Section 3 (Conceptual architecture) the architecture of the system, the components and the processes that occur within the described architecture are well specified. However, the authors state that in the Secure repository component “There are no components in this group allowing to view or analyse data” (P4R32-36), which is confusing since the analysts can have a download link for anonymised data. Also, it is not clear how the data is generated and uploaded from the user’s app to the platform - does it assume that the data from the app comes in a certain format every time? Is it equipped to deal with RDF and non-RDF data? Another unanswered question that comes up later on Section 3.1 is how the metadata is automatically generated (P6L11).

In Section 4 (Semantic-Web Methods for Auditable Privacy-preserving Data Analysis) the used semantic technologies are introduced and specified using example implementations. The permanent URIs do not resolve: the access to the meta (http://w3id.org/wellfort/ns/meta#) and id (http://w3id.org/wellfort/id) vocabularies results in a 404 error; however the meta could be at least found https://wellfort.ifs.tuwien.ac.at/ns/metadata/. In this vocabulary, authors define :hasDataCategory, however they could have used dpv:hasPersonalDataCategory.
I find a problem with http://w3id.org/wellfort/ns/dpv (already published here “User consent modeling for ensuring transparency and compliance in smart cities”), because DPV classes are massively copied in the ontology --either use import or directly refer to the original URIs, but dont replicate the triples!- not relevant for the paper though.
Is correct the use of wasGeneratedBy as in the figure? wasGeneratedBy should be connected to an Activity, not to an Agent.

In Section 4.1, the authors state that “RDFS reasoning is used to expand the dataset selection based on the subsumption relations between data categories.” (P7R41-45), but it is not clear whether the same process is used to reason over purposes, processing categories and so on. The light blue color used in the Listing 2 cannot be easily seen. Also, in Section 4.2 the further subsections are introduced but Section 4.2.4 is missing. In Section 4.2.2, where the auditability competency questions are presented, I would also include some questions regarding the used types of personal data and processing activities in the user-centric questions. Also, in Section 4.2.3, the wprv:AnalysisEnvironment and the wprv:StudyPurpose are missing from Fig. 4 and the wprv:StudyResult is presented in Fig.4 but it is not specified in the text. The wprv:StudyPurpose is introduced as “the context of a study, such as recipient, purpose, processing type and duration.” (P11R4). This can be confusing, as the study purpose is one component that describes the context of the study, so I would rename this entity to something like wprv:StudyContext instead of wprv:StudyPurpose. Also regarding Fig. 4, it is not clear in the illustration what is the meaning of the lines connecting the concepts and the respective properties and the properties are too small making it difficult to read. Finally, the formalism presented in 4.1.3 is inadequate, for the inclusion in Formula 1 is unclear (concept inclusion?).

In Section 5 (Prototype), the prototype implementation of the platform is further specified and well detailed. Some minor comments are: in Section 5.1 the authors state that “The Experiment Setup Interface, also built with Flask, is the only component the Analyst can interact with”, which is not very clear since the analyst can interact with the Analysis Interface in the Trusted Analysis Environment, so maybe a rephrasing should be done; and, in Section 5.3, the expiry times in Listings 3 and 4 don’t match.

In Section 6 (Evaluation), the platform is well evaluated according to the scenario defined in Section 2. However, the letters in Figures 8 and 9 are too small and Fig. 9 in particular seems to be incomplete and its description does not mention the use case it refers to, in this case UC2. The existing tables in this section also have different formatting and the position of the description of the table varies (for instance Table 6 and 7 have the description below the tables and Tables 4 and 5 have it above). The Audit Box component is also sometimes written differently, i.e., Audit-Box, so the authors should use only one term consistently across the paper. Also, the SPARQL query in Listing 7 seems to use ?sp and ?searchparameters to refer to the same value and the requirement R2.4 presented in Table 10 is not discussed in the text.

In Section 8 (Related Work), the state of the art in Privacy-preserving data analysis, auditability and consent management is discussed. The Synthetic Data Vault approach presented in Section 7.1 should be further explored as it is used by the platform. The objective of the BioSHaRE project is also not well defined in the text. Also, in Section 7.3, it would be worth mentioning GConsent (http://openscience.adaptcentre.ie/ontologies/GConsent/docs/ontology), GDPRov (https://openscience.adaptcentre.ie/ontologies/GDPRov/docs/ontology) and the BPR4GDPR project (https://www.bpr4gdpr.eu/).
Regarding the final Section (Conclusion and Future Work), the authors do a good job on summarizing the main contributions of the paper, however the first question “What are the key characteristics of auditable privacy-preserving data analysis systems?” does not have a clear answered written in the conclusion text.
To conclude, I would ask the authors to review the formatting of the references to ensure consistency.

MINOR COMMENTS
Minor comments (L / R - left / right column):
P1R46-49: Thus, it is not... -- confusing sentence, can be improved
P2R34-37: A novelty … -- confusing sentence
P3R5: recommendations
P3R13: relationship
P6R19: loaded in the Analysis Server (should it be Analysis Database?)
P7R20: reasons
P8L13-16: For our approach … -- confusing sentence
P9L23: Furthermore, there are a number
P9L48-51: This is especially …. -- confusing sentence
P9R27: We reuse a similar
P10L12: have developed
P10L28: the methodology on how
P10L38: offer a methodology
P10R17: This subsection concerns with … - confusing sentence
P10R23: 1) user-centric, which … - confusing sentence
P11L30: according with Fig.4 it should be wprv:Upload and not prov:Upload
P11R1-3: confusing sentence
P13L13: the OWL-API and the Hermit reasoner
P13L35: dpv:Identifying
P13L41: The Synthetic
P14R6: The Audit Box
P14R43: by the Secure Repository in Listing 3
P15L1: are captured
P15R1: to allow
P15R43: practical reasons
P15R48: should it be analysis instead of analyst?
P16L35-37: Each biomarker is represented with its code (biomarkerCode) and full name (biomarker).
P16R5: from the data table
P16R42: using the Synthetic
P17L1: confusing sentence
P17R34: initialisation, the platform
P17R44: in case they use
P18L22: Patient table
P18L23: Observation table
P18R37: calculates the mean and standard deviation of the heart rate
P19L31: of the Vitality Index
P19L44: Patient table
P19L45: Observation table
P19R29: in the description of Fig. 11should be “predictive attributes”
P19R42: The implementation
P20L47-18: since the possibility of re-identifying people from the database based on this information is minute.
P20R36: We selected
P20R42: On these cases, the auditor
P21L39: This query demonstrates
P21R23: using the RDT-extension
P21R26: results shown in
P21R36: heart rate
P21R38: rest of the data
P21R45: certain analysts
P21R47: by one analyst
P22L28: indicator on how
P22L34-35: the dataset does not contain the personal
P22R34-36: R1.9 aims to -- confusing sentence
P23L34-37: Furthermore, only metadata… -- confusing sentence
P23L48: This section discusses
P24R2: Moreover … - confusing sentence
P24R6: Therefore… - confusing sentence
P24R16: Nevertheless… - confusing sentence
P25L3: systems comply
P25L12: current alternatives
P26L32: direction have emerged
P26L35: of the Trusted

Review #3
Anonymous submitted on 04/Jul/2021
Suggestion:
Accept
Review Comment:

This paper presents an approach for auditable, privacy-reserving data-management and analysis based on Semantic Web technology, a prototypical implementation in the Wellfort system as well as an extensive evaluation of the approach (and tool) using 2 real-world use-cases from 2 SMEs.

This is a very well written paper, with an excellent structure and clear storyline. First, there is a well studied Requirement Analysis with realistic personas, which logically leads to the conceptual architecture WellFort is based on. The paper is relevant from a Semantic Web perspective, as the advantages of representations using SW technology are credibly argued. The evaluation, finally, is extensive, and adds trust for the usability of the approach and system.

Admittedly, I am not an expert in this specific area of privacy-preserving systems for Data Analysis, so I cannot fully guarantee that all related work is taken into account.

But all in all, I am very impressed by the clarity of the paper, and the very timely contribution, so recommend acceptance of the paper to the journal.

---

P17, Line 34 Colm 2: teh -> the


Comments

I request the authors to use the following citation for Data Privacy Vocabulary (DPV) instead of the URL used in the paper: Pandit H.J. et al. (2019) Creating a Vocabulary for Data Privacy. In: Panetto H., Debruyne C., Hepp M., Lewis D., Ardagna C., Meersman R. (eds) On the Move to Meaningful Internet Systems: OTM 2019 Conferences. OTM 2019. Lecture Notes in Computer Science, vol 11877. Springer, Cham. https://doi.org/10.1007/978-3-030-33246-4_44

Additionally, the canonical/pURL of DPV is its IRI, which is http://w3.org/ns/dpv (instead of https://dpvcg.github.io/dpv/ which is what it currently resolves to). It is important to use purl since the target urls change, and in the case of DPV, it will point to https://w3c.github.io/dpv/dpv soon following the migration of the group to utilise W3C's infrastructure under GitHub.