A User Credibility App Based on Fake News Detection

Motivation Behind Our Project

The spread of coronavirus over the world has been alarming and many efforts are being made by various authorities around the world to tackle this. It is hence important for the globe to work as a unit by exchanging information easily and freely. While the governments can communicate with each other securely but the information reached to the public is often not fully true. The spread of fake news has been alarming and causes panic among the common people who are finding it already difficult to cope up with the stress. Some recent incidents involving the use of some medicines at home, false news of lockdowns and fake information about locations for help etc were responsible for wastage of time, stocking up important supplies in fear of shortage and also loss of lives.

A recent report published by the esteemed news website BBC took up this topic. The headline read “'Hundreds dead' because of Covid-19 misinformation” and it was justified, as at least 800 people may have died around the world because of coronavirus-related misinformation in the first three months of this year, researchers say. A study published in the American Journal of Tropical Medicine and Hygiene also estimates that about 5,800 people were admitted to hospital as a result of false information on social media. Many died from drinking methanol or alcohol-based cleaning products. They wrongly believe the products to be a cure for the virus. Another recent rumour was spread recently stating that “Women To Not Take COVID Vaccine During Periods” which is totally false. It was debunked later as we can read in this article.

With the spread of the internet, both Twitter and WhatsApp are easy sources of fake information an example is shown below:

Fake Message being passed around via WhatsApp

Tweets containing false information are pretty easy to find and some of them are given below:

Tweets Containing False Information

Also, the tendency of false information going popular is so much more than of true information due to the fake facts and promises that sound good or alarming that the user is tempted to share further ahead.

Ways in which Fake News on COVID-19 is spread

All of this motivated us to come up with a solution that is both easy to use and effective to tackle the spread of Fake News, so that only the verified information reaches the users, showing them the real situation.

Solution

BustIT is a project that aims to help reduce the spread of fake news or false information on Twitter. Built by a team of undergraduate IIIT Delhi students, BustIT is an app that uses various existing learning algorithms to differentiate tweets containing true information from those containing fake news. In addition to that, the app also gives a credibility score to users based on the tweets (true or fake) they posted. This credibility score is an easy way to identify whether the user is a trusted source of information provider or not. Expanding the usability of the application, we also integrated searches based on hashtags, users and tweets that can be further used to get similar tweets and information. We hope that this project will help in reducing the spread of fake news during these tough times and further aim to get this app to be used by the masses.

Methodology

Dataset

We collected a dataset of 8558 tweets with their corresponding labels (real and fake), out of which, 4480 are real labelled tweets and 4078 are fake labelled tweets.

Top 10 Most Occurring Unigrams for Fake News (left) and Real News (right)

Analysing the most common bigrams in the dataset showed that certain bigrams which contained public figures like Donald Trump and bill gate etc were present in fake news and other saddening and fake messages like novel coronavirus, new coronavirus, coronavirus pandemic, cure covid were present in the dataset for fake news to gain attention whereas in real news lot of reported content was there like the confirmed case, state-reported etc.

Top 10 Most Occurring Bigrams for Real News (left) and Fake News (right)

Word Cloud of Real and Fake News was also studied from the dataset:

Word Cloud of Fake News (left) and Real News (right)

Word Cloud suggests that words with political nature like chinese, american, bill gates, Donald Trump, government etc. are common in fake news whereas reporting words like today, confirmed, mohfw etc. are prominent in real news.

Feature Extraction

Using the above analysis, we extracted the 1000 most common unigram and bigram features for predicting on a new dataset. The features are termed Direct content-based features, as they are extracted directly from the content of the tweets only.

We also extracted 7 Indirect features along with the 2000 direct features (1000 for unigram and 1000 for bigram) based on the differences between the real and fake tweets.

The following are the extracted 'indirect' features:

Count of Words
Count of Unique Words
Count of Letters (Length of a Tweet)
Count of Stop Words
Count of Hashtags
Polarity Score
Subjectivity Score

The indirect features selected are based on the insights we collected from the graphs shown below. All the indirect features were normalized.

Histogram of Count of Words (left) and Unique Words (right) in Tweets

The plots above show that there are fewer words and fewer unique words in fake tweets in comparison to real tweets.

Histogram of Count of Letters (left) and Stop Words (right) in Tweets

The above plot on the left shows that there are more letters in real tweets whereas the plot above on the right shows more stop words are present in fake tweets.

Histogram of Count of Hashtags (left) and Polarity Score (right) in Tweets

The above plot on the left shows that there are more real tweets for a particular number of hashtags in real tweets in comparison to fake tweets whereas there are more fake tweets were there for sentiments closer to 0(neutral) whereas real tweets were more distributed in polarity compared to fake tweets.

Histogram of Subjectivity Score in Tweets

Fake Tweets had lower subjectivity in comparison to real tweets as the length of line around 0 (lower subjectivity, more objectivity) is more for fake news as compared to real news. This shows that most of the fake tweets may be produced by a bot or are repeated, whereas real tweets are manually written and therefore are less objective and more subjective.

Modelling and Classification

Our dataset was first oversampled to give 8960 tweets (4480 tweets of each real and fake category) with 2007 features for each tweet. We, then, used a 70:30 split ratio for training various classifiers and checking the accuracy of the models. The following models are used:

Logistic Regression
Random Forests
Naive Bayes
Support Vector Machines (SVM)
Stochastic Gradient Descent (SGD)
Extreme Gradient Boosting (XGBoost)

Following are the corresponding accuracies that each of the models achieved:

Accuracies achieved by Each Model

We also used the ROC Curve to compare the models used:

ROC Curves of Each Model

The area under the ROC Curves are as shown below:

Area Under ROC Curves for Each Model

By looking at the accuracies and the area under curves, we observed that Logistic Regression, SVM and XGBoost performed the best. We used Logistic Regression as our final model.

Source Credibility

Since our model can be used to predict the nature of tweets, we also want to tell the user whether the source that has posted the content is credible or not. So, we made a function that will give a credibility score to a user from 0 to 1, where 0 is the lowest credibility and 1 is the highest.

The credibility score is calculated as:

Credibility Score = 1 - (Number of Fake Tweets Posted by User)/(Total-Content Posted by User)

For our application, we incorporated both Hindi and English tweets by first converting the Hindi tweets to English using Google Translator.

About the Application

The application that we built is an Android app for displaying the tweets/user credibility as found by our ML algorithm. The app's home page displays the latest 80 tweets related to COVID-19 and their real/fake label as found by our ML algorithm. We also have a custom search button where a user can search by selecting (i) hashtag, (ii) username and (iii) tweet option.

Using the hashtag option, a user can input a hashtag and a list of 20 most recent tweets containing the hashtag will be displayed along with their real/fake label.
Using the tweet option, a user can input a text and can check whether that text is real/fake. The text can be copied from a tweet or can be any new text input by a user.
Using the username option, a user can enter a Twitter id and the user credibility of that Twitter id will be displayed on the app.

The user credibility of a Twitter ID is calculated as the percentage of real tweets from the total number of tweets on the Twitter ID’s timeline. There is also an option for users to view the tweet/user profile on Twitter.

To connect the app to our ML algorithm, an API was required to be the app's home page. Lask API was built which obtains the input from the app and the tweets required are obtained using the tweepy library. Then, the pre-trained ML models are run on these tweets and return the required results back to the app. The API was then deployed to the Heroku platform and was ready to be used by the Android app.

Following are the screenshots from our application:

Screenshots of our Application

Future Directions

Our accuracy of predictions is dependent on the dataset. So, if we can get annotated tweets then we can constantly improve our model by retraining our model with new annotated tweets.
We can further improve our accuracy if we can get Hindi tweets as well.
Also, in our analysis, we can make use of Twitter social features like retweets, favourites, the number of followers of the source etc to improve our model.
Moreover, due to API limitations, we are limited with the use of our app which can further reduce if we get non-standard API for usage.

User Feedback

We shared our application with some of our friends and family members to get their views and suggestions regarding our app.

Was the Application Helpful?

Was the User Credibility Feature Useful?

Was the App User Interface Appealing?

Feedback or Suggestions that we got:

UI is great. Credibility was helpful but can be improved.
I should be told how much credibility is good or not which was not told the rest is good.
Include some more features like the user's credibility history, etc.
If possible, expand this app for non-covid tweets as well.
Response Time for user credibility could be faster.
The classification of tweets is not that accurate.
Would be very helpful if you could also include other topics as well. Other than that it is a very nice idea and could be very helpful to a lot of users.
Make it available for other topics.
Improve the UI for easy usability.

Most of the users found the app to be useful and helpful in identifying fake news spreading on Twitter. Also, they liked the credibility feature which we added that could predict how often a particular user on Twitter posts an authentic tweet. We got a positive response towards the interface of our app but there were a few suggestions regarding some improvements in the UI that we hope to improve soon. Also, some users wanted a similar app for other topics related to politics or sports. This is part of our future goals for this project to add various topics which are more prone to fake news. We also foresee making the predictions faster and more accurate using the latest dataset and various other techniques. We also look forward to improve the interface by making it more interactive and easy to operate.

Check out our app by downloading it from here.

Video

The Team

Aishwarya Kumar
Ayush Yadav
Nakul Gupta
Sanket Khajuria
Syed Ali Abbas Rizvi
Durvish Singh
Sohaib Fazal
Utsav Gangwar
Vaibhav Jayant
Sachin Singh

Acknowledgement

This project was done under the supervision and guidance of Prof. Ponnurangam Kumaraguru, at Indraprastha Institute of Information Technology, Delhi as a part of our course, CSE648: Privacy and Security on Online Social Media, Winter Semester 2021.

Search This Blog

CSE648-PSOSM-2021

BustIT: A User Credibility App Based on Fake News Detection