PriPal: Your Privacy Pal
PriPal
Your Privacy Pal
Problem
Privacy policies are long and hard to read. They have a lot of jargon and deciphering the correct meaning is subjective. Users are unaware what they are signing up for and end up accepting terms uninformed. Users are concerned about their privacy but have no way to find what information they are giving away. There is no easy to read information available on how the companies are using the data and who all have access to the data.
Why is it relevant
Nowadays there’s a heavy reliance on online services. People sign up and use various new services daily. However with these various services, an average user callously ignores one of the most important steps of verifying what they are agreeing to while signing up. And from this step on whatever the user does can be tracked and sold for business profits. These policies which are crafted cleverly to shift the blame to the user is a problem that needs to be fixed.
Recently we saw a major backlash arising after WhatsApp updated their Policies which gave them a green light to share user data amongst internal services. Thus misinformed users must be educated and an immediate solution to give better insight into crafty policies needs to be found.
Survey
We conducted a survey to find out how relevant the problem was. The following were the results:
82% of our participants don’t read Privacy Policies while accepting them.
72% of the participants have not changed their privacy settings for a long time, even after many policy updates.
Most of the participants (96%) find the privacy policies difficult to understand. This shows a need to simplify the policies so that they are easily understood by the people.
90% of the participants are concerned about their privacy and the policies that they accept, even though most don’t read them. This is very alarming and puts the consumers in a very vulnerable situation.
Around 70.9% participants are uninformed about how businesses use their data.
Solution
We aimed to identify metrics to categorize the different sections of a privacy policy, do a detailed analysis of the segments of a privacy policy corresponding to the different categories and compare and analyze the privacy policies of different companies. We trained an ML model to divide the privacy policy into the specified categories. Then normalized scores were assigned to each privacy policy based on each of the identified metrics/categories.
Dataset
We used the OPP-115 Corpus (ACL 2016) dataset. Each privacy policy in this dataset has been read and annotated by three graduate students in law. Following categories were formulated:- First Party Collection/Use
- Third Party Sharing/Collection
- User Choice/Control
- User Access, Edit and Deletion
- Data Retention
- Data Security
- Policy Change
- Do Not Track
- International and Specific Audiences
For both the categories above there are certain commonly occurring words that indicate that the text belongs to that category. Thus, a machine learning model would be able to categorize the text.
Pipeline
Baseline Model
To label unseen privacy policies, a baseline model was to be created. We tried to build a basic classification model from scratch using the dataset. However, with further research we found a model off the shelf model to classify our dataset into said categories. We took inspiration from the model and built our own model. Currently the model only classifies into 7 main categories. We wish to extend this to cover the subcategories available in the corpus.
Flask Web App
We created a flask web app in which the above model has been deployed. Users can input any privacy policy and it would be divided into categories and show the text associated with each category. This would help users understand the privacy policy better. You can access the web app from here.
Scoring
To assign a score to a privacy policy:
- An overall score was assigned for each category which is equal to the mean of all the different scores of text in that category.
- For example, the data retention has sub categories such as retention period and purpose of retention. The retention period has values such as limited and unspecified.
- Having an unspecified retention period means that it is a negative sense of retention. The user has no information about how long the data is retained for and this can be coupled with the purpose of retention such as legal reasons, etc.
To get insights of our scoring, we cross-referenced tos;dr grades. This website has crowdsourced information about privacy policies of various websites. By analyzing the snippets for select websites, we compared our scores and inferred that our scores were closely related to the inferences from tos;dr data. The top highlighted points for most websites were reflected in the scores we assigned to various categories. Furthermore, the overall score correlates to the grade given on tos;dr.
Future Work
Crowdsourcing
We wish to get user feedback on PriPal score card prototype and improve the baseline model. We will create a crowdsourced platform to improve reliability and accuracy of PriPal. This will help keep the model up to date with upcoming changes and trends and evaluation.
Complete extension
We aim to create a Privacy Management Portal, a web based tool, that will detect privacy policy pages, get live analysis and scores and compare with other privacy policies on the go. This platform will save and track all the policies that you have accepted. It would allow users to define their own standards of acceptable privacy score.
Comments
Post a Comment