#स्वदेशीSocialMedia: Analyzing Privacy on Indian Social Media Platforms







Abstract

Our project, #स्वदेशीSocialMedia aims at three major Indian platforms namely Koo, Tooter and Kutumb. First, we coded scripts for scraping data using selenium and beautiful soup and used publicly available APIs for Koo and tooter. For Kutumb we took manual screenshots and used Optical Character Recognition (OCR) for data extraction. After collecting the required content we extracted personal information from it using regex and manual extraction. This extracted data was then analysed over several parameters to identify how these platforms differ in terms of shared Personally Identifiable Information (PII). Some of those parameters were, kinds of PII available, the scale of PII sharing, variations with region/language/gender etc. Finally, we draw some inferences based on our analysis and suggest some future extensions to our work. 

Introduction

Indian Social Media Platforms are rapidly expanding in the current times. With ‘Made In India’ and ‘Vocal for Local’ campaigns getting promoted by the Indian government, a lot of people are shifting from conventional social media platforms to Indian social media platforms. As the number of users and the content is increasing, concerns regarding the privacy and security of user data also arise. We aim to extract publicly available data from these platforms and analyse it for different kinds of personal data available.


Key Outcomes

  • Data collected across three Indian Social Media Platforms - Koo, Kutumb, Tooter. Also, the scripts can be used to collect more data from these platforms.
  • PII analysis on the Indian Social Media Platforms.
  • PII analysis on conventional platforms like Twitter, Reddit, Facebook, Instagram.
  • Interacted with users to understand their perspectives.
  • Ground truth data of 214 people for Profile Linking tasks.

Methodology

Data Collection

    • Koo: For scraping data from Koo, we extracted user profiles and their posts. To extract user profiles, we used the concept of recursion on the 'followers page' of the user. Which can be easily accessed on Koo by adding '/followers' after the user id. We used API requests and beautiful soup for extracting followers. To extract user posts, we used the profile page of the user itself. For each user, we were able to extract around 5-6 posts and 15-20 followers on average. So, by defining different initial user ids we were able to extract around 87k posts and around 17k user profiles. The HTML of Koo has well-defined divs and have static rendering, so using beautiful soup and requests was sufficient to extract data.

    • Tooter: The major problem while extracting data was that classes in the HTML were not well-defined. The second challenge we faced was that Tooter has a dynamic rendering of web pages, so the traditional methods of data scraping were ineffective. Therefore, we used the chrome web driver and added a sleep time so that the page is able to load and we can extract data. Here, we used the same methodology as Koo, extracted user profiles by applying recursion on the 'followers list' and then extracted user data from the profiles. In the user profiles, we focused on the bio and the identity resolution of the user.

    • Kutumb: It is a mobile app, so we were unable to run any web scraping methods. Here, we manually collected data by first clicking screenshots of user profiles and then ran OCR for extracting textual information. For this, we tried many OCRs, the challenge we faced was that the majority of people on Kutumb have their bios written in the Hindi language. So, we needed to find an OCR which supports Hindi. To overcome this challenge, we used EasyOCR and we specified all languages we wanted to detect.

Extraction of PII

This was done using regex patterns or manual extraction in the case of Hindi and regional languages. We tried extraction for the listed PIIs:

    • Linked PII

      • Full name
      • Home address
      • Email address
      • Aadhar number
      • Passport number
      • Driver’s license number
      • Credit card numbers
      • Date of birth
      • Telephone number

    • Linkable PII

      • First or last name (if common)
      • Country, state, city, zip code
      • Gender
      • Race/ Religion
      • Non-specific age (e.g. 30-40 instead of 30)
      • Job position and workplace
      • Profiles across different social media platforms
      • Political Affiliation


Data Extracted

  

The number of posts and profile collected across the three platforms.

Analysis

PII on Kutumb

                                                               
Users sharing personal information on the platform

Kutumb is a community based social networking platform. We analysed 250 profiles across 7 communities on the Kutumb App. These were Shree-Vishwakarma-samaj (83), Vaishya-Samaj-UP (37), Khaatu Shyam (35), Muslim-Mahasabha (32), Hindu_samaj_Party (25), Brahmin Biradri Ekta Manch (22) and 
Railway Karamchari Parivar (15).
The extracted profiles were analysed for PIIs present in the bio of the users.

  • Demographics of users analysed
The maximum users analysed belonged to Rajasthan followed by Uttar Pradesh

  • A total of 80 phone numbers were retrieved from 250 profiles. Interestingly we found that one of the communities Khatu Shyam had more than 90% of users sharing the numbers. 
  • Few Date of births(7) and emails(6) were also extracted.
  • 88 addresses were also found in 250 profiles which reflects that around 35% of the users shared their address. The fraction of people sharing their addresses ranged from 0.3-0.5 in every community. 

  • Around 17% of people reflected their religious opinions and 8% reflected political opinions.     

  • Around 30% of people shared their profession in the bio.


PII on Koo

Koo is a personal update and opinion sharing micro-blogging platform. We analysed 80k+ posts and 17k+ user profiles. The platform offers Hindi and other regional languages as well.

Language distribution of posts collected. 

  • Most of the Koos were collected in Hindi followed by English, Marathi, Karnatka, Telugu and others. 
  • 901 phone numbers in 744 posts were found but only 393 were unique. Maximum phone numbers were collected in the Tamil Koos with more than 2% posts containing phone numbers. 

  • Few people also shared credit card/LPG Id details. However, the percentage of people sharing this PII was really low. 

  • Similarly few emails were also retrieved. (52) Most of them were in English only.


PII on Tooter

Tooter is a microblogging social network offering multiple regional languages along with Hindi and English. 



Personal Information getting shared over Tooter

  • 214 accounts contained profile linking information.  


    Usernames for different platforms extracted from Tooter About section.

  • Around 5% of users shared their phone numbers in their profile.

  • About 44% of users shared their political affiliations on the platform. 



Inferences

  • People on community-based platforms like Kutumb share personal information like Full Name, Profession, Phone Number very easily in their bio itself.
  • Tooter promotes the linking of profiles across different platforms and provides this data publicly. Most people tend to reveal their profession on Tooter as well.
  • Koo is a counterpart of Twitter and thus a politically oriented platform. The direct PII thus shared is restricted.
  • On considering a random sample, the PII found on Indian Social Media platforms like Tooter and Kutumb is greater than that on Twitter or Facebook.

Future extensions
  • Study the platforms under the lens of opinion mining to see what prompts users to reveal their personal information except when seeking help.

  • Infer PII from images and videos posted.

  • Develop interventions to prevent leakage of PII on different platforms.

  • Educate users about the potential risks and harms associated with the easy availability of PII.

Slide Link
Video Link

Team Members

  • Anshul Raj 2018020
  • Anuj 2018220
  • Bhavey Wadhwa 2018135
  • Dibya Gautam 2018282
  • Himanshi Mathur 2018037
  • Khushali Verma 2018290
  • Nandini Bansal 2018056
  • Pruthwiraj Nanda 2018075
  • Ritik Garg 2018305
  • Snigdha Gupta 2018316
  • Utsav Singla 2018321
  • Vibhu Agrawal 2018116

Comments

Popular posts from this blog

Sperrow

#Tractor2Tractor

BotShot