Post It Safe!
Privacy and Security in Online Social Media
Introduction
Platforms like Twitter and Instagram are used by millions of users every day across the World. For some, these are their main source of news and information on current events across the world, but they can be misused for all sorts of cybercrimes. Here we tackle one such problem of information retrieval from images and videos.
Twitter is a world-known microblogging and social networking site which allows users to post and interact with messages known as tweets. Although these tweets now support images and videos, back in 2006 when twitter started it didn’t support any of these. After June 2011, images could be now integrated with tweets, and then in 2016, tweets also contained videos. Twitter has 330 million active monthly users with more than 200 thousand tweets every minute.[1]
Instagram is a photo and video-sharing social network. The service allows users to upload media that can be edited with filters and organized by hashtags and geographical tagging. Instagram has more than 1 billion registered users with more than 700 thousand images/videos being uploaded every minute
Millions of images and videos are uploaded on Twitter and Instagram every day. Apart from these social networking sites controlled by users, images and videos are also uploaded to clouds and data backups. So, one must be cautious as to what is being uploaded on the internet. People knowingly or unknowingly upload images containing sensitive information. Usernames and passwords, credit card or payment card details, bank account credentials, protected health information, customer’s data, student data and personally identifiable information(PII) such as Email addresses, Phone numbers, Social security numbers, Aadhar numbers all can be labelled as sensitive information.
People can scrape such information from these social networking sites and cause problems and commit crimes. Initially, manual scraping was possible only for a very small number of people. But, with the advancement in technology and development of methods like web-scraping and data mining, scrapping such information from images and videos for large numbers of people is possible. This sensitive information can be misused for something as basic as stalking or something as a significant crime as financial fraud or identity theft.
Millions of images and videos are uploaded on Twitter and Instagram every day. Apart from these social networking sites controlled by users, images and videos are also uploaded to clouds and data backups. So, one must be cautious as to what is being uploaded on the internet. People knowingly or unknowingly upload images containing sensitive information. Usernames and passwords, credit card or payment card details, bank account credentials, protected health information, customer’s data, student data and personally identifiable information(PII) such as email addresses, phone numbers, social security numbers, aadhar numbers all can be labelled as sensitive information.
Our project aims to alert users of any possible sensitive information in images. Through this, the users can make an informed decision whether to upload the picture or not. Also, we have curated datasets from both Twitter and Instagram. We have tried to classify the photos as containing sensitive information or not and present some analysis on the same. We aim to create awareness about the content (which may be sensitive) users upload knowingly or unknowingly on the internet through our study.
Sensitive Information Extraction
First, we explored various types of sensitive information that can be extracted from media files on social networking sites.
EXIF Data
Other than information in the there is also some crucial information in the image file that can reveal a lot. This information is termed the EXIF data. EXIF data is device-generated information about an image or a video which includes the timestamp, the latitude and longitude of the place where the media was shot, and information about the device from which it is being shot.
This information can be categorized as sensitive because it can be used to track user’s activity and location. Further from a collection of images and videos of a user, the whole timeline of the user can be traced.
Different social networking sites have different privacy policies when it comes to handling EXIF data. Sites such as Facebook, Instagram, and Twitter don’t store, but they use this EXIF data to give location recommendations for the post. Other famous sites like LinkedIn, Snapchat, Pinterest, and Tumblr store the EXIF data with every image and video ever uploaded.[3]
We sometimes upload pictures on social media sites and share them without realizing that they might contain sensitive information. That can be further used to harm us financially and socially. People extract sensitive information from images in many ways, and with high-end technological tools (like computer vision), it has become easier to do so.
One example would be that when people go on trips, they usually upload their boarding pass on social media without realising that it gives away a lot of personal information. First of all, it gives information about where you are going and which flight you are taking which may give sufficient information about your whereabouts. Also, when people upload these photos, the barcode of the boarding pass is visible, which can be decoded by using certain crypto techniques. Once decoded, the extracted information may even be used to perform security breaches like accessing flight’s website login credentials of the passenger or obtaining personal information like address, phone number, etc. from the databases of flight [4]
Car number plates carry a lot of information about the owner of the car. This information can be easily extracted by using the registration number through some publicly available websites. Using the vehicle registration number, these websites search available databases that hold information about the owner’s Name, Vehicle Registration Date, RTO, Vehicle chassis/engine number, and even information about the financer. This bunch of data gives us the name, approximate address location, and certainly a lot of information about the person’s vehicle. All this information is private to an individual that they would not like to disclose publicly. Therefore, posting such images poses a threat to their privacy and creates a possibility for a crime to occur.
People tend to share their phone numbers in images very often and this could lead to many problems. The privacy and security of an individual can be hampered to a very large extent. Spam and fraudulent calls also can become a problem in many cases. Using modern-day tools, mobile phones can be tracked and cloned. All this can happen with a simple image with phone numbers.
Other documents with sensitive information like Aadhar number, PAN card number, social security number, passport details, banking credentials and payment card details can be directly used to commit financial frauds, identity theft and document forgery.
As things are becoming more digital. Different documents are getting linked to one other in a central database . If someone gets one of your PII they can have access to other PII’s as well which makes you more vulnerable to identity thefts, financial frauds etc.
Data Collection
Seeing so many images and videos publicly floating around on sites like Instagram and Twitter that could possibly contain personal information of users we decided to collect them as much as possible to try and find out to what extent information can be dug from social media.
We used different hashtags that could potentially be associated with varying sorts of sensitive information or PII. For instance, to collect pictures that might have Aadhar card, we used #Aadhar for social security number, we used #SSN and #PANcard for collecting images that may contain PAN cards. We collected images from both Instagram and Twitter that support our cause and thus made datasets for both Twitter and Instagram and analysed what percentage of images had some sensitive information and what type of sensitive information is present.
We used the Twitter API function api.search to find tweets with hashtags, and from those tweets, we downloaded the images in the tweet. For Instagram, we used the python package of insta-scrape and scrapped all images associated with a hashtag.
Finding from the dataset
From all the data collected, a curated batch of 8271 images, mixed from both Instagram and Twitter, was selected to conduct statistical analysis and infer some useful insights and conclusions. We ran our application script on this batch, which detected 8653 PIIs in just around 3300 posts out of 8271 total images. 40% of the images from the selected batch contained PIIs in one form or the other. This was one indication of the need of alerting the users to check their images for sensitive information. The distribution of 8653 detected PIIs was as follows :
Relative percentages for the PIIs obtained looked as follows:
Almost 50% of the PII were location related like home address, the city the person lives in, etc and 90% of the PII belonged to only 4 of the 14 total categories of identified PII. Although the relative percentages of crucial sensitive content like voter ID, License and Aadhar information was low, there was still an undeniable need for alerting the users about the sensitive content because this was a measly sample of images when compared to the ones being uploaded on these social networks daily.
The percentage of different PII in the considered batch of images was as follows:
Web Application
So we could see above how a significant proportion of users are involved in posting potentially sensitive information on social media knowingly or unknowingly. In the hope to tackle this posting behaviour of users, we designed a web application that could identify sensitive information in an image, and alert the user about it. After the alert is made to the user, the user can make an informed decision about whether to post the image or not. This web application has the capability to identify several sensitive information like location, mobile number, vehicle numbers, addresses and PII like Aadhar number, PAN card number, Social Security number.
After uploading the image locally, the application classifies the data in the images as sensitive or non-sensitive and displays the type of personally identifiable information it contains. On being classified as sensitive, we can warn the users that their image contains sensitive information (before uploading to the social networking sites or cloud).
The classifier to detect and outline the PII in the picture was developed using Google vision APIs and customized regular expressions for various information types. Other technologies involved in implementing this application were Amazon web services, Flutter, Firebase, and Python.
↓
Video
The Team
- Ritik Khanna
- Naman Jain
- Aditya Singh
- Navya Aggarwal
- Manish Kumar
- Shubham Sonthalia
- Bhavam Hans
- Nishant Chaubey
References
- Twitter. (2021). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Twitter&oldid=1020355289
- Instagram. (2021). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Instagram&oldid=1021620659
- Johani, M. A. (n.d.). Personal Information Disclosure and Privacy in Social Networking Sites. 152.
- What’s in a Boarding Pass Barcode? A Lot – Krebs on Security. (n.d.), 2021, from https://krebsonsecurity.com/2015/10/whats-in-a-boarding-pass-barcode-a-lot/
Comments
Post a Comment