How Much Do They Know?

Check out the website live on:

hmdtk.herokuapp.com

Motivation

When a person signs up to a new social media service, whether for work, or on the recommendation of a friend, or because they heard of it down the vine, they invariably sign a privacy policy. A legally binding agreement to let a company harvest and sell data they generate, which comes with the caveat of the company updating the policy to harvest and sell this data in any way they desire in the future, without informing the user of changes.

While a person may sign up to a privacy policy only once, it springs into effect every time any user data is generated. By design, the user thinks about it only once, often without comprehending the entire text.

Solution

To make the user more aware of their data and how it can be used, we have to actively combat the design of privacy policies. With the right presentation of information, the simplicity of signing up to a social media, and controlling user information from their obtuse menus can be combated. Making people aware of the methods of modern data analysis also puts the data harvested from them into sharp relief.

Our project, ‘How Much Do They Know’ catalogues and presents this data as succinctly as possible, shows the capabilities of modern data analysis on data collected from users, and gives readers updates whenever privacy policies are changed. With this perspective, we hope that users of social media don’t view privacy policies as a one-time checkbox, but an ongoing legal agreement with a corporate interest.

Homepage of our website

Features

1. The What do they know? table

We have crafted a table that conveniently shows what exactly a company knows about you. This makes the user aware about the extent of their data that these companies have access to.

What do they know? table

2. Summarizing Privacy Policies Algorithm

By design, privacy policies aren't user-friendly, even to tech-savvy consumers, and so we decided to make them easy to read by summarizing them.

One can select any company from the set of tech giants like Google and Facebook, or one can upload the privacy policy of any company.

The machine learning model answers key questions such as retention of data/personal information or the data being shared with third parties as the summarized version of their verbose privacy policy.

Snapshot of the summarized version of the privacy policy for LinkedIn

The Algorithm

The summarizer consists of 2 main parts, first the algorithm which helps in retrieving the appropriate sections from the privacy policy which answer the given questions, and the second part is the summarization of different relevant sections into smaller points.

We used TF-IDF and Okapi BM25 to get the top 10 relevant sections coupled with a cross-encoder (SBert) trained on MS-MARCO set to get the top 5 from those.

The shortening into points is done using LexRankSummarizer which has one of the highest BLEU scores compared to other algorithms for the same purpose.

Together we get a much more human readable form of the privacy policy, answering the main questions which otherwise stay hidden in the long and verbose privacy policies.

3. User Profiling Algorithm

Our original plan was just to collect data of our team in accordance with social media privacy policies, and run data analysis on them to convince the average user that data collection is much more insidious and comprehensive than they would assume.

But now, we have generalized it. Any person can upload their recent search history, and our algorithm can predict things they’re interested in like football, music, smartphones, etc.

To be able to predict what topics a user is interested in, we had to create a machine learning model that is able to predict the probability of a fixed set of topics given some text.

To do this, we had to first manually create a training dataset containing text with 23 topics: Automotives, Basketball, Business, Cricket, Education, Food, Football, Gaming, Health/Medicine, Laptops, Literature, Movies/TV shows, Music, Nature, Politics, Religion, Smartphones, Space, Tennis, Tourism, Other News, Other Sports, Other Tech.

We then trained our model with this data and achieved an accuracy of around 80%.

Then, to predict what topics a user is interested in, we extract the article titles from the URLs given in the user’s browsing history (which is collected from an extension) and then pass these titles into the model. This then returns a vector containing probabilities of each of the 23 topics. Using this, we can find out the general topics that the user browses about.

For easier readability, we display a bubble chart of the user's interests and also mention the top 5 topics they're interested in.

Top 5 interests & bubble chart of one of our members

This shows that with just the search history, one can know so much about the users. Social media companies collect way more data than just search histories.

4. Newsletter and Updates

To test our algorithm, we had to read the privacy policies quite extensively. In doing so, we found a disturbing clause, that every company could update their policy, without informing the user, at any time. This has included severe changes in the past, most notably WhatsApp last month.

In order to increase awareness, and as mentioned above, to make privacy an ongoing concern, we decided to publish newsletters for any changes made to privacy policies, using MailChimp.

Newsletter for the privacy policy update of WhatsApp

These are published any time the privacy policies are updated, or if a new site feature has an invasive effect on privacy.

The Website

The tech stack used in the website is the following:

Python
Flask
HTML/CSS
Javascript

The website has been hosted online using herokuapp website on a free server. You can click on this link to be redirected to it. Since, we have used a free server, It is a bit slow during heavy traffic. The source code is available on GitHub here.

User Evaluation

We conducted a user evaluation study to survey the effect of our website. Here are some of the charts, analysis, feedback, etc.

Video

You can also see this video to know more about the project.

Team

Mohnish Agrawal
Rhythm Patel
Abhinav Ennazhiyil
Shaunak Pal
Shashikant Kumar
Ankit Mishra
Rajat Prakash
Aatish Sahai
Sitaram
Prince Yadav

Search This Blog

CSE648-PSOSM-2021