PrivOnto: a Semantic Framework for the Analysis of Privacy Policies

Tracking #: 1387-2599

Authors: 
Alessandro Oltramari
Dhivya Piraviperumal
Florian Schaub
Shomir Wilson
Norman Sadeh
Joel Reidenberg

Responsible editor: 
Guest Editors Linked Data Security Privacy Policy

Submission type: 
Full Paper
Abstract: 
Privacy policies are intended to inform users about the collection and use of their data by websites, mobile apps and other services or appliances they interact with. This also includes informing users about any choices they might have regarding such data practices. However, few users read these often long privacy policies; and those who do have difficulty understanding them, because they are written in convoluted and ambiguous language. A promising approach to help overcome this situation revolves around semi-automatically annotating policies, using combinations of semantic technologies, machine learning and natural language processing to analyze them. In this article, we introduce PrivOnto, a semantic framework to represent annotated privacy policies with an ontology developed in collaboration with privacy experts. PrivOnto has been applied to a corpus of over 23,000 annotated data practices, extracted from a dataset of 115 privacy policies. We designed a collection of 57 SPARQL queries to extract information from the PrivOnto knowledge base, with the dual objective of (1) answering privacy questions users often have and (2) supporting researchers and regulators in the analysis of privacy policies at scale. We present respective findings, after examining the process of developing PrivOnto. Finally, we outline future research and open challenges in using semantic technologies for privacy policy analysis.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Luca Costabello submitted on 30/Jun/2016
Suggestion:
Major Revision
Review Comment:

The paper describes PrivOnto, a novel ontology for modelling company or services privacy policies. The ontology has been used in a knowledge base of data practices extracted from a corpus of user policies retrieved from the web. The ~900k triples in the knowledge base have been created by domain experts (law school students), using a web-based annotation tool developed by the authors.
The authors were supported by the same domain experts to define 57 meaningful privacy-related SPARQL queries. Such queries have been used to extract patterns from the knowledge base.

The paper does not target specific Liked Data services, but it makes a coherent use of Semantic Web technologies in the domain of data privacy on the web, thus still being pertinent with the scope of the call.
The authors make a sufficiently compelling case for their contribution; combing through privacy policies legalese is indeed a relevant problem, and the proliferation of online services just makes things worse.
The work describes an interesting inter-disciplinary approach, with domain experts involved at multiple steps (requirements analysis to design the ontology, manual policy annotation, definition of meaningful SPARQL queries).
The ontology is grounded on sufficiently comprehensive state of the art, and the ontology design process has been adequately described. PrivOnto supports time dimension. Future plan to support OWL-time is appreciated.
The policy corpus is said to include diverse policies, from wide-scope large conglomerates to small companies.
Sec 4.2/4.3 includes some interesting facts on the data retrieved by the SPARQL queries (e.g. interesting to see that the purpose of data collection id rarely mentioned in the same fragment, and that the majority of the collected data types are unspecified. Also, it is interesting that half of the policies only offer a "take-it-or-leave-it" choice).
The paper is well written and adequately structured.

My main concern though, is that the content of the paper may not be sufficient for such a venue. The authors contribution looks like an in-progress work, that certainly deserves attention, but that is still in an early stage of development.
Ontology design and knowledge base population are described with sufficient detail, but the paper does not include any knowledge discovery activity - besides running SPARQL queries.

For example, as acknowledged in the conclusion, the next revisions of the paper should discuss in detail how to deal with policy analysis at scale. This is a challenging research question with the a potential for a significant contribution to the community (i.e. combine domain-specific NLP techniques at scale with domain experts manual annotation), but unfortunately the authors haven't gone that far yet.

Scalability-wise, I also wonder how much work will be required to make sure PrivOnto also works in other jurisdictions (i.e. outside the US), besides of course the need for multilingual support to populate the knowledge base.

Conflicting interpretations in Sec 5: although the future roadmap seems interesting, this work should have already been included in the paper to beef up the authors contribution. Besides, the authors should perhaps better clarify that "natural language processing and machine learning techniques" will be used in the future, since I have not found such points in the paper. As for before, this could be an interesting extension of the current paper. Also, adding a discussion about inter-annotator agreement could be valuable.

The authors claim to target a broad audience: regulators, privacy-analysts, end-users. Nevertheless, I think this point deserves deeper attention (perhaps multiple user interfaces, with different levels of granularity). True, Future work will include query-driven search functionalities on the UPP portal, but briefly hinting at this feature is insufficient. Also, if the ultimate goal of PrivOnto is "helping reduce the complexity of policies" and "bypass convoluted language", shouldn't end users be its ultimate target audience (perhaps with a web app like [4])?
Another extension might focus on targeting software agents instead of an human audience (this might be an additional use-case for the SPARQL queries, besides the knowledge base analysis described in sec 4.3).

The description of the ontology could be extended, (e.g. finer granularity for all classes and properties), since the ontology is a key contribution of the article. Also, a running example would help.

Unfortunately, I could not find public URLs of the policies corpus, the knowledge base, the PrivOnto Ontology, or the collection of SPARQL queries. Disclosing such datasets could be a valuable contribution for the paper and the community.
The ontology should be published on the web with a comprehensive namespace document (have a look at [3] for best practices on ontology publications on the web of data).From fig 5 `privonto2` prefix is associated to `http://www.usableprivacy.org/v3/privonto.owl`, but HTTP 404 is returned.

In Sec 4.2, I have the feeling that discussing SPARQL response time falls out of the scope the paper. Besides, the discussion about SPARQL processing time does not take into account the wide body of literature on this topic in the semantic web community (e.g. formal analysis of SPARQL language [1], empirical studies [2]). For instance, it has been proved and empirically verified that negations in SPARQL queries add considerable complexity (e.g. MINUS), as long as OPTIONAL patterns.

All in all, this is an interesting, clearly-written paper that fits the scope of the call, but that lacks sufficient maturity for publication - unless considerable major revisions are added to the current draft.

Other comments:
+ Fig 1,2 should be replaced with something more readable than a Protege screenshot
+ Fig 4 Typo: "Fuskei" --> "Fuseki"
+ Fig 5 poor readability. Also, a screen shot of Fuseki does not add any real value.
+ Table 1 should include an additional column with the SPARQL query, to provide few examples to the reader.
+ SPARQL first example in sec 4.2: Nested query is overkilling. Also, instead of string literal "Export" in `privonto:access_type "Export"^^xsd:string.`, better define an entity `privonto:Export` in the PrivOnto vocabulary.
+ SPARQL second example in sec 4.2: lines 1,2 of graph pattern disconnected from lines 3,4,5: is that on purpose? Also, instead of string literal "Merger/Acq" better define an entity `privonto:Merger_Acq` in the PrivOnto vocabulary.
+ SPARQL examples in sec 4.2 use two different prefixes `privonto` and `privonto2`.

[1] J. Perez, M. Arenas, and C. Gutierrez. Semantics and Complexity of SPARQL. In
International semantic web conference, volume 4273, pages 30{43. Springer, 2006.
[2] M. A. Gallego, J. D. Fernandez, M. A. Martinez-Prieto, and P. de la Fuente. An
empirical study of real-world SPARQL queries. In USEWOD, 2011.
[3] https://www.w3.org/TR/swbp-vocab-pub/
[4] https://tosdr.org/

Review #2
Anonymous submitted on 17/Jul/2016
Suggestion:
Minor Revision
Review Comment:

This paper presents a framework for analysing privacy policies. Policies are semi-automatically annotated with input from experts and analysed to deduce the occurrence of specific privacy categories. Overall, I find aspects of this research insightful and may contribute to shifting state-of-the-art. Below are some points:

- The introduction is vague and lacks a clear rationale for technical choices. Why will a combination of machine learning, NLP and crowdsourcing be the right choice of technique? The whole paper is presented as if there are no issues with utilised techniques that readers should be aware of.

- There is the philosophical argument on if analysed documents are privacy policies or simply contract statements specifying how service providers or organisations will utilise data.

-What is the specific problem that is being addressed? While it is true that users struggle with reading privacy policies, this is as a result of different factors - readability, subjective nature of privacy, etc. It will be good to clearly state the research objectives.

- Description of PrivOnt knowledge base is well-founded and subsequent evaluation revealing in some aspects.

-Ultimately, privacy policies are supposed to be a means for the end user to understand the consequence of information disclosure. While the authors have mentioned privOnto support for regulators and privacy engineers, it is unclear how this can help common users have a better grasp of privacy policies.

-The evaluation as a whole will benefit from a section on threats to validity of experiment.

Review #3
By Pompeu Casanovas submitted on 01/Aug/2016
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

Review swj1387

1. Originality
This is an excellent paper, yet with no validation tests and walkthroughs with end-users. The ontology has not yet been tested. Thus, it is a bit preliminary, but quite interesting methodologically. I especially liked the practice-discovery approach. The authors write: “We designed a collection of 57 SPARQL queries to extract information from the PrivOnto knowledge base, with the dual objective of (1) answering privacy questions users often have and (2) supporting researchers and regulators in the analysis of privacy policies at scale.” I do believe they are reaching the first objective, but there is no evidence that the second one is achieved. (Any experience with rulers and regulators? I don't think so)

2. Writing

The authors should mention since the beginning that the project and their results are only valid for companies, consumers and end-users under US laws. Otherwise, as it happened to me, the reader could find himself in a misleading situation. The paper is well-written and well-structured. Images should be improved. I suggest to substitute screenshots (difficult to read)with more elaborated figures (e.g. diagram of the ontology).

My comments should be understood as a contribution from the field of Law & Technology to improve the final version of this article.

3. Significance

(i) I agree with the general scope, but I would refrain from using expressions that could be considered as wishful thinking. E.g. “We demonstrate (!) how semantic technologies (STs) can be a viable and scalable approach to help address some of the problems that affect user privacy: STs can help consumers better understand the implications of their online activities, and support policy regulators in facing the intertwined challenges of preventing privacy abuse and reducing the information asymmetry between consumers and companies”.

These are quite different goals. I am not sure that they can be accomplished at the same time and with the same tool. This should be worked out more carefully. Some more information about consumers’ behaviour and actual practices would help to stress them. Cfr. Lillian Ablon, Paul Heaton, Diana Catherine Lavery and Sasha Romanosky, Consumer Attitudes Toward Data Breach Notifications and Loss of Personal Information, RAND Corporation, 2016.

The authors raise a big problem, but perhaps to fix this asymmetry some political and economic information on the accessibility of the legal knowledge that consumers and citizens should have could be also added. This concern has been addressed (a) in many National and European projects. See e.g. “Computable models of the law”, Sartor, Casanovas, Casellas et al. http://link.springer.com/chapter/10.1007/978-3-540-85569-9_1; especially the BEST project, (b) by some ODR companies that automate dispute resolution, e.g. COGNICOR, cfr. the works by Rodríguez-Doncel, Santos, and Casanovas. "A model of Air Transport Passenger Incidents and Rights." JURIX. 2014, and “Ontology-driven legal support-system in air transport passenger domain." CEUR Workshop Proceedings, 2014.

Would the authors acknowledge the obstacles and difficulties to institutionalise semantic solutions, rather than assuming naively that they can be simply overcome by technical means, and the general presentation of this article would improve significantly. Between the “relevant patterns of privacy practices” and their legal value something is missing: what does it mean “relevance” (social and legal relevance) in this context? Something should be said about it. What theory of relevance is used in the PrivOnto project?

(ii) “This disconnect between Internet users and the practices that apply to their data has led to the assessment that the “notice and choice” legal regime of online privacy is ineffective in the status quo. Additionally, policy regulators—who are tasked with assessing privacy practices and enforcing standards—are unable to assess privacy policies at scale.” This is a too strong assertion. Again, this general statement should be nuanced: literature about privacy, data protection, and security by design is simply ignored. “Data protection by design and by default” are not scientific, but legal concepts by now that could (and should) be connected with the general scope of PrivOnto (or at least discussed in the article). A quick look at the last version of Code (by Lessig, crowdsourced, 2006), and at some recent works by Privacy Commissioners in UK, Australia and Canada could be helpful to address this point.
The state of the art in consumer technology is also missing. See: Alireza Faed, An Intelligent Customer Complaint Management System with Application to the Transport and Logistics Industry, Springer 2016; cfr. the works by Cristiana Santos, Victor Rodriguez-Doncel, Paulo Novais, Francisco Pacheco. Check JURIX (IOS Press) and ICAIL (ACM) Conferences, and AICOL Workshops (Springer) for a mixed legal-computer science perspective based on semantic technologies. For an updated (and more general) approach, cfr. the SWJ Special Issue on law and the semantic web (2016, problems encountered, trends etc.): http://www.semantic-web-journal.net/content/special-issue-semantic-web-l...

(iii) As already said, things are changing quickly from the regulatory point of view in the legal field. E.g. after 4.000 amendments, two different drafts, and four years of discussions, some of them quite harsh, the new EU Regulation on Privacy and Data Protection was approved on April 14th 2016 by the European Parliament. This sets a new EU legal framework to develop this kind of policies that is at odds (if not at loggerheads) with other views. The authors could differentiate technical protocols and ISO standards from policies (by government agencies) and rules with legal content (a. EU Directives and Regulations, b. National statutes, c. Government and EU policies) in different legal cultures. It would be worth distinguishing the different consumer laws and privacy conceptions in (i) US (legal realism) (ii) Common Law cultures (UK and Australia) (iii) civil law cultures (Europe) (iv) mixed (Canada).
This is not impairing the content of the annotations, the answer-question method used in this research, the ontology-building process, nor the discovery of practice patterns. However, it shapes its cultural context (context of discovery). Perhaps having in mind some results presented at specialised workshops (e.g. PrivON) would help to gain some distance. The main difference between the European GDP Reform and the US lies on the fact that Privacy/Data Protection is considered a fundamental right. See the collective volumes edited by Gutwirth, Leenes, and de Hert (2014, 2015). Whereas EU laws consider “privacy” as a human right, US envisage it as a liberty (in the open market) over the state. This leads to many differences in practice. E.g. The Fourth Amendment protects the right to privacy, but it does not cover privacy violations committed by non-governmental actors (e.g. corporations, big technological companies...). Sectorial laws, self-regulation and PETs are the way to handle them. This sets specific contexts(or “ecosystems”) which belong in fact to a different legal privacy system.

(iv) PrivOnto takes into account three functionalities; (a) policy representation (declarative representation of policies in a system); (b) models of interaction (a set of queries that can extract relevant information from the system); (c) policy violation (which formalises the cases when user preferences and data practices collide, leading to consequences that put users’ data at stake). This is a quite promising approach (the second functionality is key to appreciate the innovative aspcet of this paper). However, “policy” is never defined, and neither are “law” nor “principles”. Thus, the authors could be implicitly gathering and representing rules of different nature under the same label (policy).

(v) Why the authors never compare their approach to other research projects on privacy? They mention related works (PETs, Privacy Enhancing Technologies), but scarcely in the SW field, where several attempts have been already made to formalise data protection and privacy (including some ontologies, and several criticisms to the use of semantic technologies to attain this goal). Quotes of related works are short. The reader would expect some more related research trends. E.g. (a) EU F7 and H2020 projects, (b) US (e.g. in Carnegie Mellon too, the semantics of purpose restrictions in the Health area by Tschantz and Datta (2012).

Again: the state of the art and a discussion section are missing (too limited) and, as a result of that, the references are a bit outdated (or selected only according to the purposes of PrivOnto). As said, privacy and data protection (the authors do not differentiate them) are hot topics now. Cuenca Grau’s paper was published in 2010 [observation: Cuenca is his name, there are no middle names in Spanish, and he cannot be referred by his mother family name, Grau, as it is confusing – use “Cuenca Grau” or “Cuenca-Grau” instead]. Since then, an entire framework of data protection and privacy principles –including ethics- shifted, and has been settled and discussed extensively in the literature (EU GDPR: 2012-2016).

(vi) “The UPP project integrates machine learning, natural language processing and crowdsourcing to improve the analysis of privacy policies and facilitate the development of more accessible privacy notices by extracting and highlighting those data practices that are most relevant to users.”
OK. This is a mixture of bottom-up and top-down processes, but privacy is a complex issue that encompasses other kinds of regulatory knowledge, i.e. legal knowledge, to frame it. It is not only a market issue. Perhaps this has not been taken sufficiently into account at the knowledge acquisition process. Please specify the methodological boundaries of the PrivOnto approach, and the role that relevance and compliance (which have not been defined) play in it. By the way, why “crowdsourcing”? For what I’ve seen no aggregated information method is used in the project.
(However, it is true that the second functionality –extracting data practice patterns- can be valued as a good research result and triggers the final part of the paper, in which inner consumers' inconsistencies are highlighted).

(vii) “Identifying suitable queries to extract privacy information, is a data intensive task. In the UPP project we address this issue with an extensive data annotation effort conducted by domain experts”.

This is not clear enough, and this is crucial. What kind of legal and political knowledge the experts are bringing in? What is their expertise about? Which content is actually formalised? Knowledge acquisition process should be clarified in more detail, beyond the outline already provided.

The general points are: (i) Transparency, (ii) Data minimisation, (iii) Proportionality, (iv) Purpose limitation, (v) Consent, (vi) Accountability, (vii) Data security, (viii) Rights of access, (ix) Rights of correction, (x) Third country transfers, (xi) Rights of erasure. But not all of them have been equally considered; because they have a different treatment according to different legal cultures, and PrivOnto does not constitute an exception. E.g. “Rights of erasure” have been officially adopted by the GDPR in Europe, but not in the US (they are not the same as “deletion”); rights can (and will be) enforced under the EU new Regulation, but not in the US; a system of fines (according to the size of the company) will apply in Europe, but it does not make sense in the US etc. The end-users’ questioning (and extracted patterns) may mirror or reflect the type of principles and rules that are implemented in each different system and consumes may have in mind.

Annotators are case law US students; it is not coming as a surprise that they are considering —for what I see in the references— only national-centred regulations under the FIPs framework set through the A.Westin’s tradition. Sources have been: CalOPPA, COPPA, and the HIPAA Privacy Rule, in the most US classical tradition. In fact, the selection process to build the corpus has been conducted according to these exclusive sources. This is stated clearly: “(…) for uniformity, selection from geographic sectors was restricted to ensure that all privacy policies in the corpus were subject to US legal and regulatory requirements.”

Nevertheless, if I understand the project correctly, as showed by the selection of Websites, PrivOnto has a broader and global scope (encompassing customary and commercial international law). This should be better explained. Perhaps a down to earth description of the context in a more realistic way —as a non-homogeneous regulatory context— would help the reader to situate and calibrate the risk scenarios that are emerging from the controversies.

Linguistic frame-analysis carried out by experts (lawyers) are not refined, or not enough, for such a purpose. They have adopted the pragmatic point of view of a generic user as ideal-type. General categories —User Choice/Control; User Access, Edit, & Deletion; Data Retention; Data Security; Policy Change; and Do Not Track; International & Specific Audiences— perhaps could be matched with the particular choice of the points mentioned above.

(vii) The PrivOnto ontology has been built and populated stemming from these previous categories. I found very interesting the construction of the annotator and the structure of the ontology. However, the results could be somewhat “directed” or “guided” by the previous conceptual work. If so, they are the direct result of a cognitive pre-established design (rather than an inductive empirical process).

I tried to have a closer look to the ontology: I couldn’t, as it is the object of a shallow description, with only the broadest categories in it. Data practices and the differentiation into segments and fragments seem to reflect a practical way to display the content of the 115 policies.

However, why a more extended use of NLP and data mining have been excluded from the extraction and the ontology-building process? Why they have not been combined with the manual work described in the article?

Authors write “Our work required only marginal effort for translating unstructured natural language questions into formal queries, as our frame-based annotation process embedded ‘saliency’ in the corpus of annotations in the form of ontology categories and attributes. For this reason, the ontology-based analysis of privacy policies proposed in this article did not require dealing with the diversity and ambiguity of natural language text. The queries we present in Section 4.2 match by design the privacy questions that domain experts deemed as relevant for policy analysis, and that originated the PrivOnto framework in the first place.”

This can produce a kind of looping, leaning on the previous conceptual work produced by the annotators (legal experts). For at least ten years now, similar tasks have been carried out on privacy policy analysis using NLP techniques (e.g. parsing) to avoid the kind of bias that constructing ontologies from scratch (annotations) might produce. The authors mention natural language processing since the beginning, but I cannot see a good description of how it has been implemented in the project (there is a loose reference to Gandon’s work). I’d appreciate a discussion e.g. about the use of terminologies, large term banks, and term bases. A comparison with some works could be fruitful: Rodríguez-Doncel, V., M.C. Suárez-Figueroa, A. Gómez-Pérez, M. Poveda, 2013a. License linked data resources pattern, in: In Proc. of the 4th Int. Workshop on Ontology Patterns , 2013, Sydney. Proceedings of the 4th Workshop on Ontology and Semantic Web Patterns. CEUR-WS n. 1188, http://ceur-ws.org/Vol-1188/; J. Gracia, E. Montiel-Ponsoda, D. Vila-Suero, and G. Aguado-de Cea, "Enabling language resources to expose translations as linked data on the web," in Proc. of 9th Language Resources and Evaluation Conference (LREC'14), Reykjavik (Iceland). European Language Resources Association (ELRA), May 2014, pp. 409-413; Bosque-Gil, Julia, Jorge Gracia, Guadalupe Aguado-de-Cea, and Elena Montiel-Ponsoda. "Applying the OntoLex Model to a Multilingual Terminological Resource." In European Semantic Web Conference, pp. 283-294. Springer International Publishing, 2015.

There is another point to be brought into the duscussion: there is no mention of the deontic dimension of practices that have attired the attention of legal ontology builders (legal core and domain ontologies) and W3C policy engineering experts (Right Expression Languages, R. Iannella) so far. Surely, for the purposes of the article, it is not necessary to describe actions and data practices in terms of permissions, prohibitions and prescriptions. But in the end this is the regular normative form of policies and rights (think of e-Hohfeld language).
(viii) “The notion of “sociotechnical system”, according to which the interaction between people and technology is a central aspect of our society, can be a useful paradigm to understand privacy in the Digital Era”
I would suggest having a look at the notion of “artificial socio-cognitive systems”, as they have been described by Noriega, Padget, Verhagen, and d'Inverno, "The challenge of artificial socio-cognitive systems." (2014) (and in several Dagstuhl Seminars).