Machine Learning
2/20/2022
Projects
Mayank Gupta
PGP-DSBA Online
Table of Content
S.No Topic Page No.
.
01. Problem 1: - Election Data 04-30
1.1 Read the dataset. Do the descriptive statistics and do the null value 05-07
condition check. Write an inference on it.
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data 07-14
analysis. Check for Outliers.
1.3 Encode the data (having string values) for Modelling. Is Scaling 14-15
necessary here or not? Data Split: Split the data into train and test
(70:30).
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). 15-18
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. 18-24
1.6 Model Tuning, Bagging (Random Forest should be applied for 24-29
Bagging), and Boosting.
1.7 Performance Metrics: Check the performance of Predictions on Train 24-29
and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and
get ROC_AUC score for each model. Final Model: Compare the models
and write inference which model is best/optimized.
1.8 Based on these predictions, what are the insights? 29-30
02. Problem 2: - Inaugural Corpora 31-36
2.1 Find the number of characters, words, and sentences for the mentioned 32-33
documents.
2.2 Remove all the stopwords from all three speeches. 33-35
2.3 Which word occurs the most number of times in his inaugural address 35-35
for each president? Mention the top three words (after removing the
stopwords).
2.4 Plot the word cloud of each of the speeches of the variable. (after 36-36
removing the stopwords)
1
List of Figures
S.No. Topic Page No.
1.2 Multivariate Analysis 12-13
1.2 Heatmap 13
2.4 Word cloud for 1941-Roosevelt.txt 36
2.4 Word cloud for 1961-Kennedy.txt 36
2.4 Word cloud for 1973-Nixon.txt 36
List of Tables
S. No. Topic Page No.
1.1 Statistical Description of Dataset 05
1.2 Model Summary Table 29-30
2
Executive Summary
This is an accumulation of two projects which are based on different concepts of Machine
Learning. One of them are based on the numerical form of data analysis whereas the other
project is totally based on the text analytics. The main aim of this project is to get better
understanding and implementation of machine learning concepts. The two projects in this are
mutually exclusive and have their own dataset with the separate methods of analysis. It is also
having the detailed inferences and insights obtained after data analysis modelling the data for
analysis based on some factors of machine learning concepts.
Project 1: - This is based on the analysis of election data. I have assumed that I have been
hired by a media channel who wants me to make data analysis on the data of a survey which
has been answered by 1529 people and recorded their answers in 9 variables. I have to build a
model, to predict which party a voter will vote for on the basis of the given information, to
create an exit poll that will help in predicting overall win and seats covered by a particular
party.
Problem 2: - In this particular project, I am going to work on the inaugural corpora which will
be extracted from the nltk in Python. For this project, I will have to look at the following
mentioned speeches of the different Presidents of the United States of America:
President Franklin D. Roosevelt in 1941
President John F. Kennedy in 1961
President Richard Nixon in 1973
3
Problem 1: - Election Data
Problem Statement
You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build a
model, to predict which party a voter will vote for on the basis of the given information, to
create an exit poll that will help in predicting overall win and seats covered by a particular
party.
Dataset for Problem: Election_Data.xlsx
Summary of Dataset
There is total 9 variables for which data has been collected from 1525 people. Out of some of
them are males and some of them are females. Also, they have voted for Labour Party or
Conservative Party.
Party: - Two partied contesting in the election namely Labour Party and Conservative
Party.
Age: - Age of the voter who have taken the survey conducted by CNBE news
channel.
Gender: - Gender of the voter
economic.cond.national
economic.cond.household
Blair
Hague
Europe
political.knowledge
4
1.1 Read the dataset. Do the descriptive statistics and do the null value
condition check. Write an inference on it.
Importing all of the relevant libraries
Checking for the data in the Dataset
Statistical Description of the Dataset
Looking for the null values in the Dataset
5
Finding datatypes of different variables
Finding total number of duplicate entries in the dataset
Finding Shape of the Dataset
Finding vote counts of each party
Observed Inferences
6
The overall mean of the age of the voters is 54.18 years with the standard
deviation of about 15.71 years.
There is a huge gap between the maximum and minimum years of voters in the
sample dataset. The minimum age of voter is recorded to be 24 years whereas the
maximum age of the voter is recorded to be 93 years.
There is no entry in the dataset with null values.
Total number of duplicate entries in this dataset is 8.
As per the vote count of the survey data, Labour Party has achieved 1063 votes
and Conservatives Party has achieved 462 votes which is even less than half of
the votes achieved by Labour Party.
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis.
Check for Outliers.
Univariate Analysis
Blair Boxplot
Political Knowledge Boxplot
7
Bivariate Analysis
Vote vs. economic_cond_national
8
Vote vs. economic_cond_household
Vote vs. Blair
9
Vote vs. Hague
Vote vs. Europe
10
Vote vs. political knowledge
11
Multivariate Analysis
12
Heatmap
13
Outlier Analysis
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary
here or not? Data Split: Split the data into train and test (70:30).
Describing the dataset
Creating Dummy data by eliminating by converting Party and Gender into integer values
and assigning values 0 ad 1.
Changing columns names: vote_Labour to IsLabour_or_not' and gender_male to
IsMale_or_not
14
As per the data in the dataset, it is clear that there is a need for scaling of data for the further
data analysis, otherwise, there will be discrepancy in the analysis.
Data Split in 70:30
1.4 Apply Logistic Regression and LDA (linear discriminant analysis).
Logistic Regression
Train Data
y_train_prob
15
Logistic Model Score of Train Data
AUC ROC curve for Logistic Regression Train
Test Data
y_test_prob
AUC ROC curve for Logistic Regression Test
16
Linear Discriminant Analysis
y_train_predict
AUC ROC curve for LDA Train
17
y_test_predict
AUC ROC curve for LDA Test
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
KNN Model
Transforming dataset by applying zscore
18
KNN Model Score of train data
Confusion Matrix and Classification Report of train data
AUC ROC Curve KNN Train
KNN model score of Test Data
19
Confusion Matrix and Classification Report of test data
AUC ROC Curve KNN Test
KNN Model with n=7
KNN Model Score, Confusion Matrix and Classification Report of train data
20
KNN Model Score, Confusion Matrix and Classification Report of test data
KNeighborsClassifier(n_neighbors=5)
KNN Model Score, Confusion Matrix and Classification Report of train data
KNN Model Score, Confusion Matrix and Classification Report of test data
ac_score
21
AUC ROC curve after n classifier for train data set
AUC ROC curve after n classifier for test data set
Number of Neighbours K vs. Misclassification Error
22
Naive Bayes
Model Score, Confusion Matrix and Classification Report of train data
AUC ROC Curve for Train Data
23
Model Score, Confusion Matrix and Classification Report of test data
AUC ROC Curve for Test Data
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging),
and Boosting.
1.7 Performance Metrics: Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model.
Bagging
Bagging Train
Model Score, Confusion Matrix and Classification Report of train data
24
AUC _ROC Curve Bagging Train
Bagging Test
Model Score, Confusion Matrix and Classification Report of test data
AUC _ROC Curve Bagging Test
25
Boosting Method
Ada Boost
Model Score, Confusion Matrix and Classification Report of train data
AUC _ROC Curve Boosting Train
26
Gradient Boosting
Model Score, Confusion Matrix and Classification Report of train data
AUC _ROC Curve Boosting Train
27
ADA Boosting Test
Model Score, Confusion Matrix and Classification Report of test data
AUC _ROC Curve Boosting Test
28
Gradient Boosting Test
Gradient Boosting AUC_ROC Curve Test
Final Model: Compare the models and write inference which model is best/optimized.
On the in-depth observation from the different models used in this case, the data has been
inferred that in this case KNN model with n = 7 is highly optimised as compared to the other
models, after making the in-depth comparison of accuracy, recall, model score, and AUC
score of training and testing data of different models.
1.8 Based on these predictions, what are the insights?
Based on these predictions the following end result has been concluded.
Method Train Data AUC Score Test Data AUC Score
29
Logistic Regression 0.840 0.889 0.823 0.882
Linear Discriminant
0.837 0.889 0.819 0.884
Analysis
KNN 0.867 0.93 0.824 0.870
KNN (n=7) 0.853 0.904 0.835 0.900
KNN (n=5) 0.867 0.824
Naïve Bayes 0.833 0.886 0.825 0.885
Bagging 0.999 1.000 0.797 0.877
Boosting (ADA Boost) 0.847 0.913 0.819 0.877
Boosting (Gradient Boost) 0.886 0.950 0.831 0.904
Model Summary Table
The following set of inferences has been concluded from the above data analysis.
The overall data has needed scaling in order to make it more uniform for the data
analysis.
There are outliers being present in some variable.
The overall training and testing of this dataset using different methods has given
similar results which is clearly showing that the overall data modelling, model tuning
and scaling has been done properly.
Bagging has exhibited big differences in the training and testing data, rest others have
exhibited almost similar or very small gap between testing and training results.
The overall mean of the age of the voters is 54.18 years with the standard deviation of
about 15.71 years.
There is a huge gap between the maximum and minimum years of voters in the
sample dataset. The minimum age of voter is recorded to be 24 years whereas the
maximum age of the voter is recorded to be 93 years.
There is no entry in the dataset with null values.
Total number of duplicate entries in this dataset is 8.
As per the vote count of the survey data, Labour Party has achieved 1063 votes and
Conservatives Party has achieved 462 votes which is even less than half of the votes
achieved by Labour Party.
30
Problem 2: - Inaugural Corpora
Problem Statement
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States of
America:
President Franklin D. Roosevelt in 1941
President John F. Kennedy in 1961
President Richard Nixon in 1973
Code Snippet to extract the three speeches:
"
import nltk
nltk.download('inaugural')
from nltk.corpus import inaugural
inaugural.fileids()
inaugural.raw('1941-Roosevelt.txt')
inaugural.raw('1961-Kennedy.txt')
inaugural.raw('1973-Nixon.txt')
"
31
2.1 Find the number of characters, words, and sentences for the mentioned
documents.
Importing all of the relevant txt files
1941-Roosevelt.txt
Total number of words
Total number of sentences
Total number characters
1961-Kennedy.txt
Total number of words
Total number of sentences
Total number characters
1973-Nixon.txt
Total number of words
Total number of sentences
Total number characters
32
2.2 Remove all the stopwords from all three speeches.
Importing libraries for removing stopwords
1941-Roosevelt.txt
1961-Kennedy.txt
33
1973-Nixon.txt
34
2.3 Which word occurs the most number of times in his inaugural address for
each president? Mention the top three words. (after removing the stopwords)
1941-Roosevelt.txt
Most occurred word
Top 3 Words
1961-Kennedy.txt
Most occurred word
Top 3 Words
1973-Nixon.txt
Most occurred word
Top 3 Words
35
2.4 Plot the word cloud of each of the speeches of the variable. (after removing
the stopwords)
1941-Roosevelt.txt
1961-Kennedy.txt
1973-Nixon.txt
36