Hashtag Hijacking Analysis

Since the onset of hashtags, online communication has become trend-driven. Hashtags, in a way, drive people’s attention and affect their opinions, help them find people with similar interests, or even build an audience for marketing.

At the same time, the hashtag suffers from spam and malicious content being shared under its name which is counter-productive for the brands or other organizations.

This tool analyses the characteristics of tweets from different topics like entertainment, politics, sports, etc., through Exploratory Data Analysis and then develops a model which can efficiently flag the tweets if they are hijacked or not.

Introduction with the problem statement

Twitter is one of the most active and open platforms for sharing information and opinions on any subject matter.
Since the onset of hashtags, online communication has become trend-driven.
It has also helped in making the content on the platform somewhat structured. People can choose to follow any trend popular at any location in the world.

Hashtags, in a way, drive people’s attention and affect their opinions, help them find people with similar interests, or even build an audience for marketing.
Brands, political parties, and other organizations try to leverage hashtag promotions for campaigns and endorsements.
With enough people talking about a particular topic, the topic appears in the “trending” section and becomes more likely to get people’s attention.
At the same time, the hashtag suffers from spam and malicious content being shared under its name which is counter-productive for the brands or other organizations.

Motivation

A difficult problem for Twitter is to flag hijacked tweets from accounts with lesser interaction
Spread of malicious and spam content through links is easily possible because external web documents are outside the scope of the Twitter ecosystem to process.
Many sensitive topics and important topics like mental health get diverted. #JusticeforSSR and #FarmersProtest were hijacked for personal agenda by certain people/parties. This discourse has promoted the spread of fake news/manipulated media, bullying, trolling, and slut-shaming among many other problems.
It is necessary to create a healthy ecosystem to exchange information.

Solution/Implementation

Theme: Fraud + Awareness from Manipulated media

We will use the Twitter API to collect the tweets corresponding to the hashtags which are trending. These tweets will be used to create the dataset. The dataset will be manually annotated into categories of whether the tweet is hijacked or not. After the creation of a sufficient size of the dataset, we will proceed with dataset cleaning and using NLP techniques like tokenization, stop-words removal, and stemming. Techniques like tf-idf used for scoring words. In addition, we will be analyzing if the sentiment of the current tweet matches with the sentiment of the majority of tweets of that hashtag. The cumulative score from both these parameters would ultimately decide whether the tweet is hijacked or not.

DataSet:

We made our own dataset by collecting tweets with trending hashtags and have annotated it manually whether the trending hashtag used is hijacked or not. We collected about 20,000 tweets and manually annotated around 13894 of them. There were several hashtags that were included in the final dataset like #snyderscut, #modirozgardo, #allstargame, #dragrace, etc.

Implementation details/ methodology:

We have applied various techniques on the given posts/tweets and obtain the sentiment, retweets/reposts, likes and dislikes, uploader, uploader’s data( like followers, following, verified or not), related hashtags. We then let a neural network along with different models like the random forest, logistic, SVM, KNN, multinomial Naive Bayes, and 3 boosting classifiers like XGB, Adaboost, and gradient boost run through the combined dataset which we have prepared manually and upsampled it to remove class imbalance, which contains a detailed examination of the posts and annotated classes( hijacked or not). After which we will use the obtained weights to validate the rest of the dataset which ultimately returned us an accuracy and precision measure. We then used all these models and applied voting classifiers on these classifiers to obtain the best-generalized model to avoid overfitting and saved this model and other individual best models using pickle for future use.

The best model is MLP which gave us an accuracy of 97.1%.

Team Members

Mukul Kumar: 2018054 (CSE)
Tejas Dubhir 2018110 (CSE)
Abhinandan Kainth: 2018001(CSE)
Anmol Kumar 2018382 (CSB)
Rakshit Singh 2018079 (CSE)
Pratham Gupta 2018072 (CSE)
Tanmaya Gupta 2018200 (ECE)

Search This Blog

CSE648-PSOSM-2021

Hashtag Hijacker

Hashtag Hijacking Analysis

Comments

Post a Comment

Popular posts from this blog

Sperrow

#Tractor2Tractor

BotShot