InfoExtends

May 03, 2021

We are sharing too much personal information online. Even when we think that the information we are sharing can’t be used against us. A person in Japan was charged with stalking after he used a woman's pictures and reflection in her eyes to get her location. Netflix’s series, You, depicts this type of Internet-Age stalking and obsession. Any content that we post online can be misused against us. So much so that it can be used to commit serious crimes.

Proposed solution

An informed decision can be made by a person if we’re able to highlight (sensitive*) data that can be inferred from their social media presence of a user and effectively communicate the same with them. We have created a browser extension that gives the user a visualization of their sensitive information highlighted based on the text data we can extract from their screen.

*Sensitive data is subjective for every person, thus we allow user authentication to be able to capture the user's sensitive data as per the details they provide.

Dataset

We needed a dataset that contained sensitive information. The sensitive information should be annotated in the dataset in order to be able to train a model that automates the process of identifying sensitive information. The dataset should contain information shared by users, and user profile information in order to have separate levels of sensitivity for different users. We explored several datasets for the same:

Reddit Dataset
LinkedIn Profile Dataset
Twitter User Data
Gretel.AI

We used Gretel.AI to generate emails that have PII. This is the dataset we used for our project.

Methodology

We annotated the data for classes such as: 'Person', 'Address', 'IDs', 'Email', 'Org', 'Location' etc.
We used this annotated data to create a custom spacy model for performing Named Entity Recognition (NER) on the data.
This model was pickled and loaded to our server (created using Django and hosted using Ngrok).
The server had been set up to allow users to sign in, sign up, edit their details and analyse data using API calls.
Entities found in the text are compared with the user's data using string similarity measures and word embeddings in order to generate separate warnings. For a distance score below a certain threshold (16%, i.e. similarity more than 84%), we have displayed a warning to the user that their information is present in a text field.
Chrome extension on being loaded checks for any login credentials as authorization tokens in the local storage, in case they're not present the user will be redirected to a login page.
On clicking the extension icon, our extension is activated and this can be verified by the icon can be seen near tweet buttons. (as shown in the figure below)
Clicking on the icon activates the function that reads the tweet text and communicates it to the server along with the user’s saved credentials. It receives a response from the server as a JSON object containing the identified entities and warnings that are displayed by the extension in the following manner.

User Feedback/Survey

We shared the chrome extension with some of our peers and asked them to evaluate their experience with the extension. Some useful insights that we were able to gather:

Users gave a second thought into whether or not they wanted to share the information they were about to tweet.
Sometimes it highlighted text that wasn’t sensitive

Video

Drive Link for the video of our Chrome Extension.

Group members

Deepali Chawla (2017043)
Kritika V (2017061)
Raunqvir Kour (2017181)
Eshita (2017149)
Shreya Prasad (2017191)
Shubhi Sharma (2017195)
Aanchal Bakshi (2018377)

Search This Blog

CSE648-PSOSM-2021

InfoExtends

Comments

Post a Comment

Popular posts from this blog

Sperrow

#Tractor2Tractor

NewsBuddy