BotShot: A deep dive into Twitter Bots

Introduction

Twitter, as we all know, is undoubtedly one of the most popular social networking sites, with over 100 million daily active users and 500 million tweets sent daily. Since it is extremely popular, it is not just used for connecting with other people, it is also used by a number of people as a platform to gather news stories and information regarding the latest events happening across the globe. However, this also presents a major problem since information on Twitter is not verified and people with malicious intent can post misinformation and further they can use bot accounts which on their command will spread this misinformation to a vast majority of the population.

But what exactly is a Twitter bot?

Twitter bots are automated user accounts that interact with Twitter using an application programming interface (API). These bots can be programmed to perform tasks normally associated with human interaction, including follow users, favor tweets, direct message (DM) other users, and, most importantly, they can tweet content, and retweet anything posted by a specific set of users or featuring a specific hashtag.

Many bots are used to perform important functions, such as tweet about earthquakes in real-time, and serve as part of a warning system. However, there are other bots that are programmed with malicious intent like trying to spread misinformation (fake news) or create confusion. With enough resources, it is possible to build or even buy an army of bots to flood Twitter with any information, news, or #hashtag, until people start to believe it, and it takes off on its own. Doing so may also result in the manipulation of opinions of a lot of people by constant spam posting. For these reasons, it is important to be able to distinguish between a human account and a bot account pretending to be human.

Features to identify bots

A few features or behaviors consistent with bot accounts are discussed below.

Tweets per day Bots are usually much more likely to post a large number of tweets as compared to humans who post relatively far fewer tweets in a day.
Retweet Ratio On most occasions, bots are limited to just retweeting tweets by other users or other bots, rather than tweeting original content, hence having a high retweet ratio.
Follower and following count
Bots have a high number of followers and might also be following a lot of accounts. In other cases, some bot accounts are identifiable because they send a lot of tweets but only have a few followers.
Repetitive content
Most bots tweet the same content as other users at roughly the same time. Also, most tweets will be regarding the same or similar issue like political propaganda.
Recent creation date
Many Twitter bots have a relatively recent creation date. This is because they are created to cater to a specific event or controversy, hence they will be created just after such an event began.
Username
Since Bot usernames are autogenerated, they mostly contain numbers.
DP / Bio
There is often no biography, or indeed a photo, associated with bot Twitter accounts. If they do have a DP, it will most likely be a generic photo and not a specific person.

Botometer

Botometer is a machine learning algorithm trained to calculate a score for a Twitter account to distinguish between humans and bots. A low score indicates likely human accounts and a high score indicates likely bot accounts.

How it is done?

To calculate the score, Botometer compares an account to tens of thousands of labeled examples. When an account is checked, the browser fetches its public profile and hundreds of its public tweets and mentions using the Twitter API. This data is then passed to the Botometer API, which extracts over a thousand features to characterize the account's profile, friends, social network structure, temporal activity patterns, language, and sentiment. Finally, the features are used by various machine learning models to compute the bot scores.

Data Collected

Using the tweepy API, we retrieved accounts that posted tweets with specific hashtags, and since bots are mostly used to spread propaganda, false news, trending a hashtag, etc, we collected tweets based on the farmers' protest that became infamous because of the Lal Quila incident. We collected tweets that contain the following keywords:

Farmer's Protest , #IndiaTogether , #IndiaAgainstPropaganda , #IStandwithFarmers , #StandWithFarmers , #FarmerProtestHijacked , #farmerslivesmatter , #KisanAndolan , #KisanMajdoorEktaZindabaad,#Tractor2twitter, #RedFortAttack , #Modi_Hates_Farmer, #KisanoKiGundagardi , #RedFortCaptured.

For the analysis, we took 2000 accounts that tweeted with the above hashtags and created an excel sheet containing all the results and comparisons as shown below.

Manual Detection of Bots

For the analysis other than just considering the factors mentioned above in the Features to identify bots section we also did other manual analysis like comparing tweets and understanding the relationship between tweets at a specific time or days.

Other than looking at the data in the excel sheet and to emphasize manual analysis we also looked at each of their accounts on Twitter and analyze factors such as profile/cover picture, profile bio, the relationship between profile and username as most of the people either make it flashy like programmednoob or something with their names like harryk9, namannerd, etc. and other factors.

After the analysis of each and every profile on Twitter, we put all the results into the excel sheet marking whether the account was a bot or not and marking the attribute in a different color to clearly show which factor lead to the conclusion and if the factor was not provided in the given columns in the excel sheet we put the remarks in an additional column.

Result

Activity Comparison

The above plots show the difference between the activities of a bot vs a human account. We can easily observe that the activity of a bot account is limited and these accounts get active once or twice a week and on a particular hour of a day. On the other hand, human accounts activity stats are more uniform and don't have any pattern.

Percentage of Bot vs Human Accounts

In the above plots, we can see the percentage of human and bot accounts identified by the botometer and by our method. There is a large difference between the final results of these two methods which we had used. The percentage of human accounts identified by the botometer is 53% and 43% are bot accounts out of the total data whereas, from our method, we had found that 81% are humans and only 19% of these accounts are bot accounts.

Confusion Matrix

Out of 2000 users, 960 users were False Negative that is manually and botometer said they are not bots, 215 False positives that are manually classified as a bot, 460 were True Negative that is boto meter said bot and 365 were True positive that is both botometer and our method classified them as bot.

WordCloud

The above wordcloud shows which words are most frequent in the tweets done by these bot accounts. By observing the plot, we can see that the most frequent word is rt which basically means “retweet”, this indicates that most of the tweets done by these bots are basically retweeted tweets.

Conclusion

After comparing the result of both our manual analysis as well as the results of botometer API, we were able to notice the high False Negative rate relative to the overall number of accounts. So this basically refers to the number of accounts falsely classified as bots by the API, when in fact they are not, which is an inaccuracy in botometer API. Botometer itself uses machine learning classification technique in order to classify accounts as human or a bot, so we thought that building another, some replica ML model isn’t a good idea. But the better thing is to analyze where they are going wrong. So our aim was to list out the factors and flaws in order to improve the classification and match some accuracy and compare the activity of bots and humans on twitter.

On analyzing the wordcloud considering the tweets of only bot accounts, we can clearly see that the most dominant word is this RT. RT stands for retweets. This clearly indicates that the bots are involved mostly and heavily on the retweets rather than posting their original content. We can also see that the words in the wordcloud are all related to the same issue i.e. the Farmer's Protest. This indicates that the accounts are being only used for posting tweets regarding the same issue. However, if it were a human account, the tweets would be mixed and discussing several issues, not just one.

Keeping in mind the features for identification of bots, we conclude the following factors in which the API fails and thus not truly reliable -

Highly Active accounts - Highly active accounts which post a lot of content irrespective of a retweet or a tweet are mostly classified as bots.
Inactive accounts - Inactive accounts are identified to be a bot, but with a warning of chances of misclassification, although that is still an inaccuracy.
Considering only the last few tweets - Botometer in its ML model considers only the last few activities of the account, however, it may be noticed that people contributing to an agenda may contribute heavily related to a single issue, that doesn’t mean that it is a bot!
Retweet ratio - However it is true that bots heavily retweet, but that doesn’t mean that any account that mostly retweets is a bot! Regarding the topic chosen i.e. farmers protest, we found that most of the bot accounts posted only retweets from handle @Tractor2twittr.
Spam accounts - Some people are paid just to spam on different agendas, they consistently are inaccurately predicted as bots.
‘Echo chambers’ - People trying to promote a certain type of hashtags or people of a certain parties posts content only related to them and their promotion, and then people with similar interests are shown into each other's feed, thus creating an ‘echo chamber’. Botometer gives high weight to this and thus ends up classifying wrongly.

Video

Team Members

Search This Blog

CSE648-PSOSM-2021

BotShot