BustIT: A User Credibility App Based on Fake News Detection
A User Credibility App Based on Fake News Detection
Motivation Behind Our Project
The spread of coronavirus over the world has been alarming and many efforts are being made by various authorities around the world to tackle this. It is hence important for the globe to work as a unit by exchanging information easily and freely. While the governments can communicate with each other securely but the information reached to the public is often not fully true. The spread of fake news has been alarming and causes panic among the common people who are finding it already difficult to cope up with the stress. Some recent incidents involving the use of some medicines at home, false news of lockdowns and fake information about locations for help etc were responsible for wastage of time, stocking up important supplies in fear of shortage and also loss of lives.
A recent report published by the esteemed news website BBC took up this topic. The headline read “'Hundreds dead' because of Covid-19 misinformation” and it was justified, as at least 800 people may have died around the world because of coronavirus-related misinformation in the first three months of this year, researchers say. A study published in the American Journal of Tropical Medicine and Hygiene also estimates that about 5,800 people were admitted to hospital as a result of false information on social media. Many died from drinking methanol or alcohol-based cleaning products. They wrongly believe the products to be a cure for the virus. Another recent rumour was spread recently stating that “Women To Not Take COVID Vaccine During Periods” which is totally false. It was debunked later as we can read in this article.
With the spread of the internet, both Twitter and WhatsApp are easy sources of fake information an example is shown below:
![]() |
Fake Message being passed around via WhatsApp |
Tweets containing false information are pretty easy to find and some of them are given below:
Also, the tendency of false information going popular is so much more than of true information due to the fake facts and promises that sound good or alarming that the user is tempted to share further ahead.
![]() |
Ways in which Fake News on COVID-19 is spread |
All of this motivated us to come up with a solution that is both easy to use and effective to tackle the spread of Fake News, so that only the verified information reaches the users, showing them the real situation.
Solution
BustIT is a project that aims to help reduce the spread of fake news or false information on Twitter. Built by a team of undergraduate IIIT Delhi students, BustIT is an app that uses various existing learning algorithms to differentiate tweets containing true information from those containing fake news. In addition to that, the app also gives a credibility score to users based on the tweets (true or fake) they posted. This credibility score is an easy way to identify whether the user is a trusted source of information provider or not. Expanding the usability of the application, we also integrated searches based on hashtags, users and tweets that can be further used to get similar tweets and information. We hope that this project will help in reducing the spread of fake news during these tough times and further aim to get this app to be used by the masses.
Methodology
Dataset
We collected a dataset of 8558 tweets with their corresponding labels (real and fake), out of which, 4480 are real labelled tweets and 4078 are fake labelled tweets.
Analysing the most common bigrams in the dataset showed that certain bigrams which contained public figures like Donald Trump and bill gate etc were present in fake news and other saddening and fake messages like novel coronavirus, new coronavirus, coronavirus pandemic, cure covid were present in the dataset for fake news to gain attention whereas in real news lot of reported content was there like the confirmed case, state-reported etc.
Word Cloud of Real and Fake News was also studied from the dataset:
Word Cloud suggests that words with political nature like chinese, american, bill gates, Donald Trump, government etc. are common in fake news whereas reporting words like today, confirmed, mohfw etc. are prominent in real news.
Feature Extraction
Using the above analysis, we extracted the 1000 most common unigram and bigram features for predicting on a new dataset. The features are termed Direct content-based features, as they are extracted directly from the content of the tweets only.
We also extracted 7 Indirect features along with the 2000 direct features (1000 for unigram and 1000 for bigram) based on the differences between the real and fake tweets.
The following are the extracted 'indirect' features:
- Count of Words
- Count of Unique Words
- Count of Letters (Length of a Tweet)
- Count of Stop Words
- Count of Hashtags
- Polarity Score
- Subjectivity Score
The indirect features selected are based on the insights we collected from the graphs shown below. All the indirect features were normalized.
![]() |
Histogram of Count of Words (left) and Unique Words (right) in Tweets |
![]() |
Histogram of Count of Letters (left) and Stop Words (right) in Tweets |
The above plot on the left shows that there are more letters in real tweets whereas the plot above on the right shows more stop words are present in fake tweets.
![]() |
Histogram of Count of Hashtags (left) and Polarity Score (right) in Tweets |
The above plot on the left shows that there are more real tweets for a particular number of hashtags in real tweets in comparison to fake tweets whereas there are more fake tweets were there for sentiments closer to 0(neutral) whereas real tweets were more distributed in polarity compared to fake tweets.
Histogram of Subjectivity Score in Tweets
Fake Tweets had lower subjectivity in comparison to real tweets as the length of line around 0 (lower subjectivity, more objectivity) is more for fake news as compared to real news. This shows that most of the fake tweets may be produced by a bot or are repeated, whereas real tweets are manually written and therefore are less objective and more subjective.
Modelling and Classification
Our dataset was first oversampled to give 8960 tweets (4480 tweets of each real and fake category) with 2007 features for each tweet. We, then, used a 70:30 split ratio for training various classifiers and checking the accuracy of the models. The following models are used:
- Logistic Regression
- Random Forests
- Naive Bayes
- Support Vector Machines (SVM)
- Stochastic Gradient Descent (SGD)
- Extreme Gradient Boosting (XGBoost)
Following are the corresponding accuracies that each of the models achieved:
Accuracies achieved by Each Model
We also used the ROC Curve to compare the models used:
![]() |
ROC Curves of Each Model |
The area under the ROC Curves are as shown below:
![]() |
Area Under ROC Curves for Each Model |
By looking at the accuracies and the area under curves, we observed that Logistic Regression, SVM and XGBoost performed the best. We used Logistic Regression as our final model.
Source Credibility
Since our model can be used to predict the nature of tweets, we also want to tell the user whether the source that has posted the content is credible or not. So, we made a function that will give a credibility score to a user from 0 to 1, where 0 is the lowest credibility and 1 is the highest.
The credibility score is calculated as:
Credibility Score = 1 - (Number of Fake Tweets Posted by User)/(Total-Content Posted by User)
For our application, we incorporated both Hindi and English tweets by first converting the Hindi tweets to English using Google Translator.
About the Application
The application that we built is an Android app for displaying the tweets/user credibility as found by our ML algorithm. The app's home page displays the latest 80 tweets related to COVID-19 and their real/fake label as found by our ML algorithm. We also have a custom search button where a user can search by selecting (i) hashtag, (ii) username and (iii) tweet option.
- Using the hashtag option, a user can input a hashtag and a list of 20 most recent tweets containing the hashtag will be displayed along with their real/fake label.
- Using the tweet option, a user can input a text and can check whether that text is real/fake. The text can be copied from a tweet or can be any new text input by a user.
- Using the username option, a user can enter a Twitter id and the user credibility of that Twitter id will be displayed on the app.
To connect the app to our ML algorithm, an API was required to be the app's home page. Lask API was built which obtains the input from the app and the tweets required are obtained using the tweepy library. Then, the pre-trained ML models are run on these tweets and return the required results back to the app. The API was then deployed to the Heroku platform and was ready to be used by the Android app.
Following are the screenshots from our application:
Future Directions
- Our accuracy of predictions is dependent on the dataset. So, if we can get annotated tweets then we can constantly improve our model by retraining our model with new annotated tweets.
- We can further improve our accuracy if we can get Hindi tweets as well.
- Also, in our analysis, we can make use of Twitter social features like retweets, favourites, the number of followers of the source etc to improve our model.
- Moreover, due to API limitations, we are limited with the use of our app which can further reduce if we get non-standard API for usage.
User Feedback
We shared our application with some of our friends and family members to get their views and suggestions regarding our app.
![]() |
Was the Application Helpful? |
![]() |
Was the User Credibility Feature Useful? |
![]() |
Was the App User Interface Appealing? |
Feedback or Suggestions that we got:
- UI is great. Credibility was helpful but can be improved.
- I should be told how much credibility is good or not which was not told the rest is good.
- Include some more features like the user's credibility history, etc.
- If possible, expand this app for non-covid tweets as well.
- Response Time for user credibility could be faster.
- The classification of tweets is not that accurate.
- Would be very helpful if you could also include other topics as well. Other than that it is a very nice idea and could be very helpful to a lot of users.
- Make it available for other topics.
- Improve the UI for easy usability.
Check out our app by downloading it from here.
Video
The Team
- Aishwarya Kumar
- Ayush Yadav
- Nakul Gupta
- Sanket Khajuria
- Syed Ali Abbas Rizvi
- Durvish Singh
- Sohaib Fazal
- Utsav Gangwar
- Vaibhav Jayant
- Sachin Singh
Acknowledgement
This project was done under the supervision and guidance of Prof. Ponnurangam Kumaraguru, at Indraprastha Institute of Information Technology, Delhi as a part of our course, CSE648: Privacy and Security on Online Social Media, Winter Semester 2021.
Comments
Post a Comment