Review Comment:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.
Review swj1387
1. Originality
This is an excellent paper, yet with no validation tests and walkthroughs with end-users. The ontology has not yet been tested. Thus, it is a bit preliminary, but quite interesting methodologically. I especially liked the practice-discovery approach. The authors write: “We designed a collection of 57 SPARQL queries to extract information from the PrivOnto knowledge base, with the dual objective of (1) answering privacy questions users often have and (2) supporting researchers and regulators in the analysis of privacy policies at scale.” I do believe they are reaching the first objective, but there is no evidence that the second one is achieved. (Any experience with rulers and regulators? I don't think so)
2. Writing
The authors should mention since the beginning that the project and their results are only valid for companies, consumers and end-users under US laws. Otherwise, as it happened to me, the reader could find himself in a misleading situation. The paper is well-written and well-structured. Images should be improved. I suggest to substitute screenshots (difficult to read)with more elaborated figures (e.g. diagram of the ontology).
My comments should be understood as a contribution from the field of Law & Technology to improve the final version of this article.
3. Significance
(i) I agree with the general scope, but I would refrain from using expressions that could be considered as wishful thinking. E.g. “We demonstrate (!) how semantic technologies (STs) can be a viable and scalable approach to help address some of the problems that affect user privacy: STs can help consumers better understand the implications of their online activities, and support policy regulators in facing the intertwined challenges of preventing privacy abuse and reducing the information asymmetry between consumers and companies”.
These are quite different goals. I am not sure that they can be accomplished at the same time and with the same tool. This should be worked out more carefully. Some more information about consumers’ behaviour and actual practices would help to stress them. Cfr. Lillian Ablon, Paul Heaton, Diana Catherine Lavery and Sasha Romanosky, Consumer Attitudes Toward Data Breach Notifications and Loss of Personal Information, RAND Corporation, 2016.
The authors raise a big problem, but perhaps to fix this asymmetry some political and economic information on the accessibility of the legal knowledge that consumers and citizens should have could be also added. This concern has been addressed (a) in many National and European projects. See e.g. “Computable models of the law”, Sartor, Casanovas, Casellas et al. http://link.springer.com/chapter/10.1007/978-3-540-85569-9_1; especially the BEST project, (b) by some ODR companies that automate dispute resolution, e.g. COGNICOR, cfr. the works by Rodríguez-Doncel, Santos, and Casanovas. "A model of Air Transport Passenger Incidents and Rights." JURIX. 2014, and “Ontology-driven legal support-system in air transport passenger domain." CEUR Workshop Proceedings, 2014.
Would the authors acknowledge the obstacles and difficulties to institutionalise semantic solutions, rather than assuming naively that they can be simply overcome by technical means, and the general presentation of this article would improve significantly. Between the “relevant patterns of privacy practices” and their legal value something is missing: what does it mean “relevance” (social and legal relevance) in this context? Something should be said about it. What theory of relevance is used in the PrivOnto project?
(ii) “This disconnect between Internet users and the practices that apply to their data has led to the assessment that the “notice and choice” legal regime of online privacy is ineffective in the status quo. Additionally, policy regulators—who are tasked with assessing privacy practices and enforcing standards—are unable to assess privacy policies at scale.” This is a too strong assertion. Again, this general statement should be nuanced: literature about privacy, data protection, and security by design is simply ignored. “Data protection by design and by default” are not scientific, but legal concepts by now that could (and should) be connected with the general scope of PrivOnto (or at least discussed in the article). A quick look at the last version of Code (by Lessig, crowdsourced, 2006), and at some recent works by Privacy Commissioners in UK, Australia and Canada could be helpful to address this point.
The state of the art in consumer technology is also missing. See: Alireza Faed, An Intelligent Customer Complaint Management System with Application to the Transport and Logistics Industry, Springer 2016; cfr. the works by Cristiana Santos, Victor Rodriguez-Doncel, Paulo Novais, Francisco Pacheco. Check JURIX (IOS Press) and ICAIL (ACM) Conferences, and AICOL Workshops (Springer) for a mixed legal-computer science perspective based on semantic technologies. For an updated (and more general) approach, cfr. the SWJ Special Issue on law and the semantic web (2016, problems encountered, trends etc.): http://www.semantic-web-journal.net/content/special-issue-semantic-web-l...
(iii) As already said, things are changing quickly from the regulatory point of view in the legal field. E.g. after 4.000 amendments, two different drafts, and four years of discussions, some of them quite harsh, the new EU Regulation on Privacy and Data Protection was approved on April 14th 2016 by the European Parliament. This sets a new EU legal framework to develop this kind of policies that is at odds (if not at loggerheads) with other views. The authors could differentiate technical protocols and ISO standards from policies (by government agencies) and rules with legal content (a. EU Directives and Regulations, b. National statutes, c. Government and EU policies) in different legal cultures. It would be worth distinguishing the different consumer laws and privacy conceptions in (i) US (legal realism) (ii) Common Law cultures (UK and Australia) (iii) civil law cultures (Europe) (iv) mixed (Canada).
This is not impairing the content of the annotations, the answer-question method used in this research, the ontology-building process, nor the discovery of practice patterns. However, it shapes its cultural context (context of discovery). Perhaps having in mind some results presented at specialised workshops (e.g. PrivON) would help to gain some distance. The main difference between the European GDP Reform and the US lies on the fact that Privacy/Data Protection is considered a fundamental right. See the collective volumes edited by Gutwirth, Leenes, and de Hert (2014, 2015). Whereas EU laws consider “privacy” as a human right, US envisage it as a liberty (in the open market) over the state. This leads to many differences in practice. E.g. The Fourth Amendment protects the right to privacy, but it does not cover privacy violations committed by non-governmental actors (e.g. corporations, big technological companies...). Sectorial laws, self-regulation and PETs are the way to handle them. This sets specific contexts(or “ecosystems”) which belong in fact to a different legal privacy system.
(iv) PrivOnto takes into account three functionalities; (a) policy representation (declarative representation of policies in a system); (b) models of interaction (a set of queries that can extract relevant information from the system); (c) policy violation (which formalises the cases when user preferences and data practices collide, leading to consequences that put users’ data at stake). This is a quite promising approach (the second functionality is key to appreciate the innovative aspcet of this paper). However, “policy” is never defined, and neither are “law” nor “principles”. Thus, the authors could be implicitly gathering and representing rules of different nature under the same label (policy).
(v) Why the authors never compare their approach to other research projects on privacy? They mention related works (PETs, Privacy Enhancing Technologies), but scarcely in the SW field, where several attempts have been already made to formalise data protection and privacy (including some ontologies, and several criticisms to the use of semantic technologies to attain this goal). Quotes of related works are short. The reader would expect some more related research trends. E.g. (a) EU F7 and H2020 projects, (b) US (e.g. in Carnegie Mellon too, the semantics of purpose restrictions in the Health area by Tschantz and Datta (2012).
Again: the state of the art and a discussion section are missing (too limited) and, as a result of that, the references are a bit outdated (or selected only according to the purposes of PrivOnto). As said, privacy and data protection (the authors do not differentiate them) are hot topics now. Cuenca Grau’s paper was published in 2010 [observation: Cuenca is his name, there are no middle names in Spanish, and he cannot be referred by his mother family name, Grau, as it is confusing – use “Cuenca Grau” or “Cuenca-Grau” instead]. Since then, an entire framework of data protection and privacy principles –including ethics- shifted, and has been settled and discussed extensively in the literature (EU GDPR: 2012-2016).
(vi) “The UPP project integrates machine learning, natural language processing and crowdsourcing to improve the analysis of privacy policies and facilitate the development of more accessible privacy notices by extracting and highlighting those data practices that are most relevant to users.”
OK. This is a mixture of bottom-up and top-down processes, but privacy is a complex issue that encompasses other kinds of regulatory knowledge, i.e. legal knowledge, to frame it. It is not only a market issue. Perhaps this has not been taken sufficiently into account at the knowledge acquisition process. Please specify the methodological boundaries of the PrivOnto approach, and the role that relevance and compliance (which have not been defined) play in it. By the way, why “crowdsourcing”? For what I’ve seen no aggregated information method is used in the project.
(However, it is true that the second functionality –extracting data practice patterns- can be valued as a good research result and triggers the final part of the paper, in which inner consumers' inconsistencies are highlighted).
(vii) “Identifying suitable queries to extract privacy information, is a data intensive task. In the UPP project we address this issue with an extensive data annotation effort conducted by domain experts”.
This is not clear enough, and this is crucial. What kind of legal and political knowledge the experts are bringing in? What is their expertise about? Which content is actually formalised? Knowledge acquisition process should be clarified in more detail, beyond the outline already provided.
The general points are: (i) Transparency, (ii) Data minimisation, (iii) Proportionality, (iv) Purpose limitation, (v) Consent, (vi) Accountability, (vii) Data security, (viii) Rights of access, (ix) Rights of correction, (x) Third country transfers, (xi) Rights of erasure. But not all of them have been equally considered; because they have a different treatment according to different legal cultures, and PrivOnto does not constitute an exception. E.g. “Rights of erasure” have been officially adopted by the GDPR in Europe, but not in the US (they are not the same as “deletion”); rights can (and will be) enforced under the EU new Regulation, but not in the US; a system of fines (according to the size of the company) will apply in Europe, but it does not make sense in the US etc. The end-users’ questioning (and extracted patterns) may mirror or reflect the type of principles and rules that are implemented in each different system and consumes may have in mind.
Annotators are case law US students; it is not coming as a surprise that they are considering —for what I see in the references— only national-centred regulations under the FIPs framework set through the A.Westin’s tradition. Sources have been: CalOPPA, COPPA, and the HIPAA Privacy Rule, in the most US classical tradition. In fact, the selection process to build the corpus has been conducted according to these exclusive sources. This is stated clearly: “(…) for uniformity, selection from geographic sectors was restricted to ensure that all privacy policies in the corpus were subject to US legal and regulatory requirements.”
Nevertheless, if I understand the project correctly, as showed by the selection of Websites, PrivOnto has a broader and global scope (encompassing customary and commercial international law). This should be better explained. Perhaps a down to earth description of the context in a more realistic way —as a non-homogeneous regulatory context— would help the reader to situate and calibrate the risk scenarios that are emerging from the controversies.
Linguistic frame-analysis carried out by experts (lawyers) are not refined, or not enough, for such a purpose. They have adopted the pragmatic point of view of a generic user as ideal-type. General categories —User Choice/Control; User Access, Edit, & Deletion; Data Retention; Data Security; Policy Change; and Do Not Track; International & Specific Audiences— perhaps could be matched with the particular choice of the points mentioned above.
(vii) The PrivOnto ontology has been built and populated stemming from these previous categories. I found very interesting the construction of the annotator and the structure of the ontology. However, the results could be somewhat “directed” or “guided” by the previous conceptual work. If so, they are the direct result of a cognitive pre-established design (rather than an inductive empirical process).
I tried to have a closer look to the ontology: I couldn’t, as it is the object of a shallow description, with only the broadest categories in it. Data practices and the differentiation into segments and fragments seem to reflect a practical way to display the content of the 115 policies.
However, why a more extended use of NLP and data mining have been excluded from the extraction and the ontology-building process? Why they have not been combined with the manual work described in the article?
Authors write “Our work required only marginal effort for translating unstructured natural language questions into formal queries, as our frame-based annotation process embedded ‘saliency’ in the corpus of annotations in the form of ontology categories and attributes. For this reason, the ontology-based analysis of privacy policies proposed in this article did not require dealing with the diversity and ambiguity of natural language text. The queries we present in Section 4.2 match by design the privacy questions that domain experts deemed as relevant for policy analysis, and that originated the PrivOnto framework in the first place.”
This can produce a kind of looping, leaning on the previous conceptual work produced by the annotators (legal experts). For at least ten years now, similar tasks have been carried out on privacy policy analysis using NLP techniques (e.g. parsing) to avoid the kind of bias that constructing ontologies from scratch (annotations) might produce. The authors mention natural language processing since the beginning, but I cannot see a good description of how it has been implemented in the project (there is a loose reference to Gandon’s work). I’d appreciate a discussion e.g. about the use of terminologies, large term banks, and term bases. A comparison with some works could be fruitful: Rodríguez-Doncel, V., M.C. Suárez-Figueroa, A. Gómez-Pérez, M. Poveda, 2013a. License linked data resources pattern, in: In Proc. of the 4th Int. Workshop on Ontology Patterns , 2013, Sydney. Proceedings of the 4th Workshop on Ontology and Semantic Web Patterns. CEUR-WS n. 1188, http://ceur-ws.org/Vol-1188/; J. Gracia, E. Montiel-Ponsoda, D. Vila-Suero, and G. Aguado-de Cea, "Enabling language resources to expose translations as linked data on the web," in Proc. of 9th Language Resources and Evaluation Conference (LREC'14), Reykjavik (Iceland). European Language Resources Association (ELRA), May 2014, pp. 409-413; Bosque-Gil, Julia, Jorge Gracia, Guadalupe Aguado-de-Cea, and Elena Montiel-Ponsoda. "Applying the OntoLex Model to a Multilingual Terminological Resource." In European Semantic Web Conference, pp. 283-294. Springer International Publishing, 2015.
There is another point to be brought into the duscussion: there is no mention of the deontic dimension of practices that have attired the attention of legal ontology builders (legal core and domain ontologies) and W3C policy engineering experts (Right Expression Languages, R. Iannella) so far. Surely, for the purposes of the article, it is not necessary to describe actions and data practices in terms of permissions, prohibitions and prescriptions. But in the end this is the regular normative form of policies and rights (think of e-Hohfeld language).
(viii) “The notion of “sociotechnical system”, according to which the interaction between people and technology is a central aspect of our society, can be a useful paradigm to understand privacy in the Digital Era”
I would suggest having a look at the notion of “artificial socio-cognitive systems”, as they have been described by Noriega, Padget, Verhagen, and d'Inverno, "The challenge of artificial socio-cognitive systems." (2014) (and in several Dagstuhl Seminars).
|