PriPal: Your Privacy Pal

May 03, 2021

PriPal

Your Privacy Pal

Problem

Privacy policies are long and hard to read. They have a lot of jargon and deciphering the correct meaning is subjective. Users are unaware what they are signing up for and end up accepting terms uninformed. Users are concerned about their privacy but have no way to find what information they are giving away. There is no easy to read information available on how the companies are using the data and who all have access to the data.

Why is it relevant

Nowadays there’s a heavy reliance on online services. People sign up and use various new services daily. However with these various services, an average user callously ignores one of the most important steps of verifying what they are agreeing to while signing up. And from this step on whatever the user does can be tracked and sold for business profits. These policies which are crafted cleverly to shift the blame to the user is a problem that needs to be fixed.
Recently we saw a major backlash arising after WhatsApp updated their Policies which gave them a green light to share user data amongst internal services. Thus misinformed users must be educated and an immediate solution to give better insight into crafty policies needs to be found.

Survey

We conducted a survey to find out how relevant the problem was. The following were the results:

Forms response chart. Question title: How often do you read privacy policies before accepting the terms and conditions of an application?. Number of responses: 55 responses.

82% of our participants don’t read Privacy Policies while accepting them.

Forms response chart. Question title: When was the last time you updated your privacy settings on your social media accounts?. Number of responses: 55 responses.

72% of the participants have not changed their privacy settings for a long time, even after many policy updates.

Forms response chart. Question title: How easy are privacy policies to read, understand and interpret?. Number of responses: 55 responses.

Most of the participants (96%) find the privacy policies difficult to understand. This shows a need to simplify the policies so that they are easily understood by the people.

Forms response chart. Question title: Are you concerned about your privacy and privacy policies that you accept?. Number of responses: 55 responses.

90% of the participants are concerned about their privacy and the policies that they accept, even though most don’t read them. This is very alarming and puts the consumers in a very vulnerable situation.

Forms response chart. Question title: Do you feel uninformed about how businesses use your data?. Number of responses: 55 responses.

Around 70.9% participants are uninformed about how businesses use their data.

Solution

We aimed to identify metrics to categorize the different sections of a privacy policy, do a detailed analysis of the segments of a privacy policy corresponding to the different categories and compare and analyze the privacy policies of different companies. We trained an ML model to divide the privacy policy into the specified categories. Then normalized scores were assigned to each privacy policy based on each of the identified metrics/categories.

Dataset

We used the OPP-115 Corpus (ACL 2016) dataset. Each privacy policy in this dataset has been read and annotated by three graduate students in law. Following categories were formulated:

First Party Collection/Use
Third Party Sharing/Collection
User Choice/Control
User Access, Edit and Deletion
Data Retention
Data Security
Policy Change
Do Not Track
International and Specific Audiences

Most privacy policies are longer than 2000 words. Thus, on average it takes longer than 15 minutes to read a privacy policy. With their complex nature, interpreting them will take even longer.

For both the categories above there are certain commonly occurring words that indicate that the text belongs to that category. Thus, a machine learning model would be able to categorize the text.

Pipeline

Baseline Model

To label unseen privacy policies, a baseline model was to be created. We tried to build a basic classification model from scratch using the dataset. However, with further research we found a model off the shelf model to classify our dataset into said categories. We took inspiration from the model and built our own model. Currently the model only classifies into 7 main categories. We wish to extend this to cover the subcategories available in the corpus.

Flask Web App

We created a flask web app in which the above model has been deployed. Users can input any privacy policy and it would be divided into categories and show the text associated with each category. This would help users understand the privacy policy better. You can access the web app from here.

Scoring

To assign a score to a privacy policy:

An overall score was assigned for each category which is equal to the mean of all the different scores of text in that category.
For example, the data retention has sub categories such as retention period and purpose of retention. The retention period has values such as limited and unspecified.
Having an unspecified retention period means that it is a negative sense of retention. The user has no information about how long the data is retained for and this can be coupled with the purpose of retention such as legal reasons, etc.

Evaluation Criteria

To get insights of our scoring, we cross-referenced tos;dr grades. This website has crowdsourced information about privacy policies of various websites. By analyzing the snippets for select websites, we compared our scores and inferred that our scores were closely related to the inferences from tos;dr data. The top highlighted points for most websites were reflected in the scores we assigned to various categories. Furthermore, the overall score correlates to the grade given on tos;dr.

Future Work

Crowdsourcing

We wish to get user feedback on PriPal score card prototype and improve the baseline model. We will create a crowdsourced platform to improve reliability and accuracy of PriPal. This will help keep the model up to date with upcoming changes and trends and evaluation.

Complete extension

We aim to create a Privacy Management Portal, a web based tool, that will detect privacy policy pages, get live analysis and scores and compare with other privacy policies on the go. This platform will save and track all the policies that you have accepted. It would allow users to define their own standards of acceptable privacy score.

Search This Blog

CSE648-PSOSM-2021

PriPal: Your Privacy Pal

PriPal

Your Privacy Pal

Problem

Why is it relevant

Survey

Solution

Dataset

Pipeline

Baseline Model

Flask Web App

Scoring

Evaluation Criteria

Future Work

Crowdsourcing

Complete extension

Video

Comments

Post a Comment

Popular posts from this blog

Sperrow

#Tractor2Tractor

Dawn of Decentralization