A PROJECT REPORT SUBMITTED ON
“Fake Social Media Account Detection”
MAHARASHTRA STATE BOARD OF TECHNICAL
EDUCATION, MUMBAI
FOR THE AWARD OF THE DEGREE OF
DIPLOMA IN ENGINEERING
(COMPUTER ENGINEERING)
SUBMITTED BY
Mr.Aryan Sopan Surve (2115230021)
Mr.Soham Shivaji Shirgire (2115230017)
Mr.Suraj Dipak Javir (2215230198)
Under the guidance of
Ms.Sugandhi.P.S.
DEPARTMENT OF COMPUTER ENGINEERING
New Satara College of Engineering & Management
(Poly.), Korti-Pandharpur.413304
Academic year 2023-2024
DEPARTMENT OF COMPUTER ENGINEERING
NEW SATARA COLLAGE OF ENGINEERING & MANAGEMENT
(POLY), KORTI-PANDHARPUR.4133042
CERTIFICATE
This is to certify that the project work entitled
“Fake Social Media Account Detection”
Submitted by
Mr.Aryan Sopan Surve (2115230021)
Mr.Soham Shivaji Shirgire (2115230017)
Mr.Suraj Dipak Javir (2215230198)
Is a bonafide work carried out by them under the guidance of Ms.Sugandhi.P.S. And
it is submitted toward the partial fulfillment of the requirement of Maharashtra State Board
of Technical Education, Mumbai for the award of the degree of Diploma in Computer
Engineering
Place: Korti
Date:
Ms.Sugandhi.P.S. Mr. Puri.S.B. Prof.Londhe V.H.
(Project Guide) (H.O.D) (Principal)
DECLARATION
I hereby declare that the project report entitles “Fake Social Media Account
Detection” completed and written by us for the award of the Diploma in Computer Engineering
to Maharashtra State Board of Technical Education, Mumbai has not previously formed for the
award of diploma, degree or similar title of this or any other university or examination body.
Place: -Korti
Date: -
Student Name Sign.
Mr.Aryan Sopan Surve
Mr.Soham Shivaji Shirgire
Mr.Suraj Dipak Javir
ACKNOWLEDGEMENT
It gives us lot of pleasure in submitting our project report on “Fake Social Media
Account Detection”. We should thank all of them who helped us in work and provide the
facilities to develop this application.
We are very much thankful to our project guide. Ms.Sugandhi.P.S. And project
coordinator Mr. Puri.S.B. For their encouragement, technical guidance and valuable assistant
rendered to us.
We are also thankful to all the facilities of computer department for their valuable
guidance, advice, and assistance in our project right from the initial stages.
We also express sincere thanks to all faculty members of our college. Last but not the
least we would like to thank to all our friends, fellow students and our parents for their whole-
hearted support.
Thanking You.
INDEX
1. Abstract
2. Chapter 1 : Introduction and Motivation [Purpose of the problem
statement (societal benefit)
3. Chapter 2: Review of Existing methods and their Limitations
4. Chapter 3 : Proposed Method with System Architecture / Flow Diagram
5. Chapter 4: Modules Description
6. Chapter 5: Implementation requirements
7. Chapter 6: Output Screenshots
8. Conclusion
9. References
10.Appendix A – Source Code
5
ABSTRACT
With the advent of the Internet and social media, while hundreds of people have benefitted
from the vast sources of information available, there has been an enormous increase in the
rise of cyber-crimes. According to a 2019 report in the Economics Times, India has
witnessed a 457% rise in cybercrime in the five year span between 2011 and 2016. Most
speculate that this is due to impact of social media such as Instagram on our daily lives.
While these definitely help in creating a sound social network, creation of user accounts in
these sites usually needs just an email-id. A real life person can create multiple fake IDs and
hence impostors can easily be made. Unlike the real world scenario where multiple rules and
regulations are imposed to identify oneself in a unique manner (for example while issuing
one’s passport or driver’s license), in the virtual world of social media, admission does not
require any such checks. In this project, we study the different accounts of Instagram, in
particular and try to assess an account as fake or real.
6
INTRODUCTION & MOTIVATION
Having the ability to check the authenticity of a user’s following is crucial for brands looking
to work with influencers. Social Media is one of the most important platforms, especially for
youth, to express themselves to the world.
This platform can be used by them as a way of interacting with same type of people and age
group, or to present their views. However, use of technology has also constrained with
various implications – humans can misuse the technology to cause harm and spread hatred
via the same social media platform.
Keeping this is mind, we have tried to perform a basic solution to this problem via deep
learning algorithm implementation over a dataset to check with respect to various social
media platform – Instagram’s attributes , can a neural network actually help to predict a fake
or real user profile.
7
Proposed Method with Flow Diagram
An artificial neural network (ANN) is a computing system designed to simulate how the
human brain analyzes and processes information. It is the foundation of artificial intelligence
(AI) and solves problems that would prove impossible or difficult by human or statistical
standards.
Artificial Neural Networks are primarily designed to mimic and simulate the functioning
of the human brain. Using the mathematical structure, it is ANN constructed to replicate
the biological neurons.
The concept of ANN follows the same process as that of a natural neural net. The objective
of ANN is to make the machines or systems understand and ape how a human brain makes a
decision and then ultimately takes action. Inspired by the human brain, the fundamentals of
neural networks are connected through neurons or nodes.
8
MODULES OF THE PROJECT
▪ Module I - Initial Data Exploration: It is the initial step in data analysis in
which we use data visualization and statistical techniques to describe
dataset characterizations, such as size, quantity, and accuracy, in order to
better understand the nature of the data.
▪ Module II - Data Wrangling: In this process, cleaning and unifying of
messy and complex data sets takes place for easy access and analysis.
With the amount of data and data sources rapidly growing and expanding,
it is getting increasingly essential for large amounts of available data to be
organized for analysis.
▪ Module III - Data Insights: Basic statistical and visual analysis with respect
to scraped datasets, which can help to provide basic overview of how data
needs to be cleaned or further processed with respect to core neural network
development
▪ Module IV - Core Neural Network Development: This module comprises
of core neural network development – a basic artificial neural network
(ANN), which takes input of basic attributes of independent features of
dataset and tries to predict target feature – fake or not.
▪ Module V – Evaluation: After neural network development, this module is
being implemented in order to check how the model is actually performing
training wise and how it performs on unseen test data – accuracy and loss of
model.
▪ Module VI - Testing and Inference: Once the desired and tuned model is
obtained, this module is implemented in order to test model (saved model
and
9
later loaded for future use) on random unseen data attributes to determine whether the user
is fake or not.
10
IMPLEMENTATION REQUIREMENTS
1) Initial Packages – Pandas, NumPy, Matplotlib, Seaborn – for basic statistical
analysis and mathematical insights
2) TensorFlow - TensorFlow is a free and open-source software library for
machine learning and artificial intelligence. It can be used across a range
of tasks but has a particular focus on training and inference of deep neural
networks
3) Scikit-Learn - Scikit-learn is a free software machine learning library for
the Python programming language
4) Python – Python based programming language interface in order to run
and execute the application
5) Google Colab - Colab is a free Jupyter notebook environment that runs
entirely in the cloud – cloud based instance which helps to set up a virtual
python based environments and run machine learning or deep learning
models
11
OUTPUT SCREENSHOTS
Load Data (Pre-processing)
Bar Plot – Visualization (Data Insights)
12
13
KDE Plot (Data Insights)
Heat Map – Correlation Check (Data Insights)
14
Model Training- (Sequential Training)
15
Training Progress - Loss (Training)
Training Progress - Accuracy(Training)
16
Classification Report (Evaluation)
17
Confusion Matrix (Evaluation)
18
CONCLUSION
The proposed project majorly focuses on how deep learning algorithms -
Artificial Neural Network or ANNs can be leveraged for better insights
exploration over a well distributed dataset. The proposed framework exhibits how
different attributes with respect to user’s activity can be learned or analysed by
machine learning or deep learning algorithms to predict any suspicious activity
and tell the probability of that specific account being a fake or genuine one.
Furthermore, this algorithm can be improved by scraping more metadata - like
visual features - images, posts, captions, activity spend time and heavy deep
learning models can be ensemble - like multimodal deep learning for even better
results.
19
REFERENCES
1. Instagram Fake Spammer Dataset - Kaggle
2. Easy ways to analyse if account is fake or not - WikiBlog
3. Tensorflow - Basic Code Base
4. Instagram Fake and Automated Account Detection - Fatih Cagatay
Akyon; M. Esat Kalfaoglu
20
APPENDIX A - Source Code
#Initial Data Exploration and Data Wrangling
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import Accuracy
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import
classification_report,accuracy_score,roc_curve,confusion_matrix
21
train_data_path = 'datasets/Fake-Instagram-Profile-Detection-
main/insta_train.csv'
test_data_path = 'datasets/Fake-Instagram-Profile-Detection-
main/insta_test.csv'
pd.read_csv(test_data_path)
576 + 120
train_data_path =
'datasets/Insta_Fake_Profile_Detection/train.csv'
test_data_path =
'datasets/Insta_Fake_Profile_Detection/test.csv'
pd.read_csv(train_data_path)
# Load the training dataset
instagram_df_train=pd.read_csv(train_data_path)
instagram_df_train
# Load the testing data
instagram_df_test=pd.read_csv(test_data_path)
instagram_df_test
instagram_df_train.head()
instagram_df_train.tail()
22
instagram_df_test.head()
instagram_df_test.tail()
# Getting dataframe info
instagram_df_train.info()
# Get the statistical summary of the dataframe
instagram_df_train.describe()
# Checking if null values exist
instagram_df_train.isnull().sum()
# Get the number of unique values in the "profile pic" feature
instagram_df_train['profile pic'].value_counts()
# Get the number of unique values in "fake" (Target column)
instagram_df_train['fake'].value_counts()
instagram_df_test.info()
instagram_df_test.describe()
23
instagram_df_test.isnull().sum()
instagram_df_test['fake'].value_counts()
# Perform Data Visualizations
# Visualize the data
sns.countplot(instagram_df_train['fake'])
plt.show()
# Visualize the private column data
sns.countplot(instagram_df_train['private'])
plt.show()
# Visualize the "profile pic" column data
sns.countplot(instagram_df_train['profile pic'])
plt.show()
# Visualize the data
plt.figure(figsize = (20, 10))
sns.distplot(instagram_df_train['nums/length username'])
plt.show()
# Correlation plot
plt.figure(figsize=(20, 20))
24
cm = instagram_df_train.corr()
ax = plt.subplot()
sns.heatmap(cm, annot = True, ax = ax)
plt.show()
sns.countplot(instagram_df_test['fake'])
sns.countplot(instagram_df_test['private'])
sns.countplot(instagram_df_test['profile pic'])
# Preparing Data to Train the Model
# Training and testing dataset (inputs)
X_train = instagram_df_train.drop(columns = ['fake'])
X_test = instagram_df_test.drop(columns = ['fake'])
X_train
X_test
# Training and testing dataset (Outputs)
y_train = instagram_df_train['fake']
y_test = instagram_df_test['fake']
y_train
25
y_test
# Scale the data before training the model
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler_x = StandardScaler()
X_train = scaler_x.fit_transform(X_train)
X_test = scaler_x.transform(X_test)
y_train = tf.keras.utils.to_categorical(y_train, num_classes =
2)
y_test = tf.keras.utils.to_categorical(y_test, num_classes = 2)
y_train
y_test
# print the shapes of training and testing datasets
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Training_data = len(X_train)/( len(X_test) + len(X_train) ) *
100
Training_data
26
Testing_data = len(X_test)/( len(X_test) + len(X_train) ) * 100
Testing_data
# Building and Training Deep Training Model
import tensorflow.keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
model = Sequential()
model.add(Dense(50, input_dim=11, activation='relu'))
model.add(Dense(150, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(150, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(25, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(2,activation='softmax'))
model.summary()
model.compile(optimizer = 'adam', loss =
'categorical_crossentropy', metrics = ['accuracy'])
epochs_hist = model.fit(X_train, y_train, epochs = 50, verbose
= 1, validation_split = 0.1)
27
# Access the Performance of the model
print(epochs_hist.history.keys())
plt.plot(epochs_hist.history['loss'])
plt.plot(epochs_hist.history['val_loss'])
plt.title('Model Loss Progression During Training/Validation')
plt.ylabel('Training and Validation Losses')
plt.xlabel('Epoch Number')
plt.legend(['Training Loss', 'Validation Loss'])
plt.show()
predicted = model.predict(X_test)
predicted_value = []
28
test = []
for i in predicted:
predicted_value.append(np.argmax(i))
for i in y_test:
test.append(np.argmax(i))
print(classification_report(test, predicted_value))
plt.figure(figsize=(10, 10))
cm=confusion_matrix(test, predicted_value)
sns.heatmap(cm, annot=True)
plt.show()
29
30