83% found this document useful (6 votes)

1K views37 pages

Machine Learning Project

This document summarizes two machine learning projects related to election data analysis and text analytics on presidential inaugural speeches. Project 1 analyzes survey data from 1525 voters to predict which political party they will vote for using logistic regression, LDA, KNN, and naive bayes models. The best model is determined by comparing performance metrics. Project 2 examines inaugural speeches from Roosevelt, Kennedy, and Nixon to analyze word counts, remove stop words, and create word clouds for each speech. The projects apply different machine learning techniques to numerical and text data to gain insights from political surveys and presidential addresses. Model performance is evaluated using metrics like accuracy and AUC to identify the most effective approaches.

Uploaded by

Aish Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

83% found this document useful (6 votes)

1K views37 pages

Machine Learning Project

Uploaded by

Aish Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Machine Learning

2/20/2022
Projects

Mayank Gupta
PGP-DSBA Online
Table of Content
S.No Topic Page No.
.
01. Problem 1: - Election Data 04-30
1.1 Read the dataset. Do the descriptive statistics and do the null value 05-07
condition check. Write an inference on it.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data 07-14

analysis. Check for Outliers.

1.3 Encode the data (having string values) for Modelling. Is Scaling 14-15
necessary here or not? Data Split: Split the data into train and test
(70:30).

1.4 Apply Logistic Regression and LDA (linear discriminant analysis). 15-18
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. 18-24
1.6 Model Tuning, Bagging (Random Forest should be applied for 24-29
Bagging), and Boosting.

1.7 Performance Metrics: Check the performance of Predictions on Train 24-29

and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and
get ROC_AUC score for each model. Final Model: Compare the models
and write inference which model is best/optimized.

1.8 Based on these predictions, what are the insights? 29-30

02. Problem 2: - Inaugural Corpora 31-36
2.1 Find the number of characters, words, and sentences for the mentioned 32-33
documents.

2.2 Remove all the stopwords from all three speeches. 33-35
2.3 Which word occurs the most number of times in his inaugural address 35-35
for each president? Mention the top three words (after removing the
stopwords).

2.4 Plot the word cloud of each of the speeches of the variable. (after 36-36
removing the stopwords)

1
List of Figures

S.No. Topic Page No.

1.2 Multivariate Analysis 12-13

1.2 Heatmap 13

2.4 Word cloud for 1941-Roosevelt.txt 36

2.4 Word cloud for 1961-Kennedy.txt 36

2.4 Word cloud for 1973-Nixon.txt 36

List of Tables

S. No. Topic Page No.

1.1 Statistical Description of Dataset 05

1.2 Model Summary Table 29-30

2
Executive Summary
This is an accumulation of two projects which are based on different concepts of Machine
Learning. One of them are based on the numerical form of data analysis whereas the other
project is totally based on the text analytics. The main aim of this project is to get better
understanding and implementation of machine learning concepts. The two projects in this are
mutually exclusive and have their own dataset with the separate methods of analysis. It is also
having the detailed inferences and insights obtained after data analysis modelling the data for
analysis based on some factors of machine learning concepts.
Project 1: - This is based on the analysis of election data. I have assumed that I have been
hired by a media channel who wants me to make data analysis on the data of a survey which
has been answered by 1529 people and recorded their answers in 9 variables. I have to build a
model, to predict which party a voter will vote for on the basis of the given information, to
create an exit poll that will help in predicting overall win and seats covered by a particular
party.
Problem 2: - In this particular project, I am going to work on the inaugural corpora which will
be extracted from the nltk in Python. For this project, I will have to look at the following
mentioned speeches of the different Presidents of the United States of America:
 President Franklin D. Roosevelt in 1941
 President John F. Kennedy in 1961
 President Richard Nixon in 1973

3
Problem 1: - Election Data
Problem Statement
You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build a
model, to predict which party a voter will vote for on the basis of the given information, to
create an exit poll that will help in predicting overall win and seats covered by a particular
party.
Dataset for Problem: Election_Data.xlsx
Summary of Dataset
There is total 9 variables for which data has been collected from 1525 people. Out of some of
them are males and some of them are females. Also, they have voted for Labour Party or
Conservative Party.
 Party: - Two partied contesting in the election namely Labour Party and Conservative
Party.
 Age: - Age of the voter who have taken the survey conducted by CNBE news
channel.
 Gender: - Gender of the voter
 economic.cond.national
 economic.cond.household
 Blair
 Hague
 Europe
 political.knowledge

4
1.1 Read the dataset. Do the descriptive statistics and do the null value
condition check. Write an inference on it.

Importing all of the relevant libraries

Checking for the data in the Dataset

Statistical Description of the Dataset

Looking for the null values in the Dataset

5
Finding datatypes of different variables

Finding total number of duplicate entries in the dataset

Finding Shape of the Dataset

Finding vote counts of each party

Observed Inferences

6
 The overall mean of the age of the voters is 54.18 years with the standard
deviation of about 15.71 years.
 There is a huge gap between the maximum and minimum years of voters in the
sample dataset. The minimum age of voter is recorded to be 24 years whereas the
maximum age of the voter is recorded to be 93 years.
 There is no entry in the dataset with null values.
 Total number of duplicate entries in this dataset is 8.
 As per the vote count of the survey data, Labour Party has achieved 1063 votes
and Conservatives Party has achieved 462 votes which is even less than half of
the votes achieved by Labour Party.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis.

Check for Outliers.

Univariate Analysis

Blair Boxplot

Political Knowledge Boxplot

7
Bivariate Analysis

Vote vs. economic_cond_national

8
Vote vs. economic_cond_household

Vote vs. Blair

9
Vote vs. Hague

Vote vs. Europe

10
Vote vs. political knowledge

11
Multivariate Analysis

12
Heatmap

13
Outlier Analysis

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary
here or not? Data Split: Split the data into train and test (70:30).

Describing the dataset

Creating Dummy data by eliminating by converting Party and Gender into integer values
and assigning values 0 ad 1.

Changing columns names: vote_Labour to IsLabour_or_not' and gender_male to

IsMale_or_not

14
As per the data in the dataset, it is clear that there is a need for scaling of data for the further
data analysis, otherwise, there will be discrepancy in the analysis.

Data Split in 70:30

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

Logistic Regression

Train Data

y_train_prob

15
Logistic Model Score of Train Data

AUC ROC curve for Logistic Regression Train

Test Data

y_test_prob

AUC ROC curve for Logistic Regression Test

16
Linear Discriminant Analysis

y_train_predict

AUC ROC curve for LDA Train

17
y_test_predict

AUC ROC curve for LDA Test

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.

KNN Model

Transforming dataset by applying zscore

18
KNN Model Score of train data

Confusion Matrix and Classification Report of train data

AUC ROC Curve KNN Train

KNN model score of Test Data

19
Confusion Matrix and Classification Report of test data

AUC ROC Curve KNN Test

KNN Model with n=7

KNN Model Score, Confusion Matrix and Classification Report of train data

20
KNN Model Score, Confusion Matrix and Classification Report of test data

KNeighborsClassifier(n_neighbors=5)

KNN Model Score, Confusion Matrix and Classification Report of train data

KNN Model Score, Confusion Matrix and Classification Report of test data

ac_score

21
AUC ROC curve after n classifier for train data set

AUC ROC curve after n classifier for test data set

Number of Neighbours K vs. Misclassification Error

22
Naive Bayes

Model Score, Confusion Matrix and Classification Report of train data

AUC ROC Curve for Train Data

23
Model Score, Confusion Matrix and Classification Report of test data

AUC ROC Curve for Test Data

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging),
and Boosting.
1.7 Performance Metrics: Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model.

Bagging

Bagging Train

Model Score, Confusion Matrix and Classification Report of train data

24
AUC _ROC Curve Bagging Train

Bagging Test

Model Score, Confusion Matrix and Classification Report of test data

AUC _ROC Curve Bagging Test

25
Boosting Method

Ada Boost

Model Score, Confusion Matrix and Classification Report of train data

AUC _ROC Curve Boosting Train

26
Gradient Boosting

Model Score, Confusion Matrix and Classification Report of train data

AUC _ROC Curve Boosting Train

27
ADA Boosting Test

Model Score, Confusion Matrix and Classification Report of test data

AUC _ROC Curve Boosting Test

28
Gradient Boosting Test

Gradient Boosting AUC_ROC Curve Test

Final Model: Compare the models and write inference which model is best/optimized.

On the in-depth observation from the different models used in this case, the data has been
inferred that in this case KNN model with n = 7 is highly optimised as compared to the other
models, after making the in-depth comparison of accuracy, recall, model score, and AUC
score of training and testing data of different models.

1.8 Based on these predictions, what are the insights?

Based on these predictions the following end result has been concluded.

Method Train Data AUC Score Test Data AUC Score

29
Logistic Regression 0.840 0.889 0.823 0.882

Linear Discriminant
0.837 0.889 0.819 0.884
Analysis

KNN 0.867 0.93 0.824 0.870

KNN (n=7) 0.853 0.904 0.835 0.900

KNN (n=5) 0.867 0.824

Naïve Bayes 0.833 0.886 0.825 0.885

Bagging 0.999 1.000 0.797 0.877

Boosting (ADA Boost) 0.847 0.913 0.819 0.877

Boosting (Gradient Boost) 0.886 0.950 0.831 0.904

Model Summary Table

The following set of inferences has been concluded from the above data analysis.

 The overall data has needed scaling in order to make it more uniform for the data
analysis.
 There are outliers being present in some variable.
 The overall training and testing of this dataset using different methods has given
similar results which is clearly showing that the overall data modelling, model tuning
and scaling has been done properly.
 Bagging has exhibited big differences in the training and testing data, rest others have
exhibited almost similar or very small gap between testing and training results.
 The overall mean of the age of the voters is 54.18 years with the standard deviation of
about 15.71 years.
 There is a huge gap between the maximum and minimum years of voters in the
sample dataset. The minimum age of voter is recorded to be 24 years whereas the
maximum age of the voter is recorded to be 93 years.
 There is no entry in the dataset with null values.
 Total number of duplicate entries in this dataset is 8.
 As per the vote count of the survey data, Labour Party has achieved 1063 votes and
Conservatives Party has achieved 462 votes which is even less than half of the votes
achieved by Labour Party.

30
Problem 2: - Inaugural Corpora
Problem Statement

In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States of
America:

 President Franklin D. Roosevelt in 1941

 President John F. Kennedy in 1961
 President Richard Nixon in 1973

Code Snippet to extract the three speeches:

import nltk

nltk.download('inaugural')

from nltk.corpus import inaugural

inaugural.fileids()

inaugural.raw('1941-Roosevelt.txt')

inaugural.raw('1961-Kennedy.txt')

inaugural.raw('1973-Nixon.txt')

31
2.1 Find the number of characters, words, and sentences for the mentioned
documents.
Importing all of the relevant txt files

1941-Roosevelt.txt

Total number of words

Total number of sentences

Total number characters

1961-Kennedy.txt

Total number of words

Total number of sentences

Total number characters

1973-Nixon.txt

Total number of words

Total number of sentences

Total number characters

32
2.2 Remove all the stopwords from all three speeches.
Importing libraries for removing stopwords

1941-Roosevelt.txt

1961-Kennedy.txt

33
1973-Nixon.txt

34
2.3 Which word occurs the most number of times in his inaugural address for
each president? Mention the top three words. (after removing the stopwords)
1941-Roosevelt.txt

Most occurred word

Top 3 Words

1961-Kennedy.txt

Most occurred word

Top 3 Words

1973-Nixon.txt

Most occurred word

Top 3 Words

35
2.4 Plot the word cloud of each of the speeches of the variable. (after removing
the stopwords)
1941-Roosevelt.txt

1961-Kennedy.txt

1973-Nixon.txt

Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Election Prediction Model Analysis
88% (8)
Election Prediction Model Analysis
26 pages
Election Prediction Using ML Models
100% (11)
Election Prediction Using ML Models
19 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Machine Learning Business Analysis Report
92% (12)
Machine Learning Business Analysis Report
42 pages
Election Prediction Model Analysis
100% (2)
Election Prediction Model Analysis
46 pages
Cold Storage Assignment Solution Ankur Jain
75% (8)
Cold Storage Assignment Solution Ankur Jain
6 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
Computational Intelligence (CS3030/CS3031) : School of Computer Engineering, KIIT-DU, BBS-24, India
No ratings yet
Computational Intelligence (CS3030/CS3031) : School of Computer Engineering, KIIT-DU, BBS-24, India
2 pages
Project ML
100% (4)
Project ML
36 pages
Materi Trigon English
No ratings yet
Materi Trigon English
5 pages
ML Project Report
100% (2)
ML Project Report
35 pages
1.1 Read The Data and Do Exploratory Data Analysis. Describe The Data Briefly
100% (19)
1.1 Read The Data and Do Exploratory Data Analysis. Describe The Data Briefly
50 pages
ML Ts Proj
100% (9)
ML Ts Proj
58 pages
Assignment 1
100% (1)
Assignment 1
3 pages
Machine Learning Business Report
100% (1)
Machine Learning Business Report
34 pages
CAT 2024 Quants Test Series Guide
No ratings yet
CAT 2024 Quants Test Series Guide
3 pages
Project Report
100% (3)
Project Report
36 pages
CFD Analysis of Manifold
No ratings yet
CFD Analysis of Manifold
27 pages
Summary Applied Econometric Time Series
No ratings yet
Summary Applied Econometric Time Series
10 pages
Unit1 PD
No ratings yet
Unit1 PD
8 pages
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
100% (3)
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
77 pages
Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021
100% (5)
Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021
83 pages
Predictive Modelling
67% (3)
Predictive Modelling
64 pages
Shoe Sales Time Series Analysis
100% (3)
Shoe Sales Time Series Analysis
105 pages
Aids To Selection and Selection Methods.
No ratings yet
Aids To Selection and Selection Methods.
172 pages
Order of Magnitude & Vector Basics
No ratings yet
Order of Magnitude & Vector Basics
24 pages
Capstone Project Submission
100% (2)
Capstone Project Submission
31 pages
State Wise Health Income Clustering 18th December 2021 PDF
100% (2)
State Wise Health Income Clustering 18th December 2021 PDF
29 pages
Linear Differential Equation
No ratings yet
Linear Differential Equation
35 pages
Predictive Modeling PDF
100% (3)
Predictive Modeling PDF
49 pages
Dynamic of Structures
No ratings yet
Dynamic of Structures
10 pages
Analysis of Transport Choice of Employees - A Project On Machine Learning
100% (10)
Analysis of Transport Choice of Employees - A Project On Machine Learning
24 pages
STD 8 Maths: Cube Roots & Proportions Quiz
No ratings yet
STD 8 Maths: Cube Roots & Proportions Quiz
3 pages
ML ProjectReport-Sonali Joshi
100% (2)
ML ProjectReport-Sonali Joshi
38 pages
Time Series Forecasting - SoftDrink - Business Report
75% (4)
Time Series Forecasting - SoftDrink - Business Report
37 pages
Intro to Statistics for Students
No ratings yet
Intro to Statistics for Students
28 pages
Business Report Machine Learning-1
100% (7)
Business Report Machine Learning-1
60 pages
Maths Paper
No ratings yet
Maths Paper
32 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Advance Statistics - Buisness Report
100% (1)
Advance Statistics - Buisness Report
26 pages
Interpretation of Data: Ritchie G. Macalanda, Ph. D
No ratings yet
Interpretation of Data: Ritchie G. Macalanda, Ph. D
48 pages
Linear - Regression - Assignment: Problem Statement
100% (3)
Linear - Regression - Assignment: Problem Statement
24 pages
Predicting Cubic Zirconia Prices Using Linear Regression
100% (1)
Predicting Cubic Zirconia Prices Using Linear Regression
58 pages
Wholesale Data Analysis Report
100% (1)
Wholesale Data Analysis Report
17 pages
Cot Math 4 q2 - Week6 2022
No ratings yet
Cot Math 4 q2 - Week6 2022
12 pages
VBScript Examples
No ratings yet
VBScript Examples
8 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Project Time Series Forecasting
100% (1)
Project Time Series Forecasting
53 pages
Machine Learning - Project
80% (10)
Machine Learning - Project
14 pages
Aharonov-Anandan - PRL.1990.65.1697 - Geometry of Quantum Phase
No ratings yet
Aharonov-Anandan - PRL.1990.65.1697 - Geometry of Quantum Phase
4 pages
Com Lab Manual
100% (1)
Com Lab Manual
59 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
Time Series Rose Shehroz Arfeen
100% (1)
Time Series Rose Shehroz Arfeen
42 pages
Grade 2 Class Prog
No ratings yet
Grade 2 Class Prog
1 page
Mathematicians and Their Contributions
No ratings yet
Mathematicians and Their Contributions
2 pages
Data Mining Quiz 1 Clustering
100% (2)
Data Mining Quiz 1 Clustering
4 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Csa FSP 22001
No ratings yet
Csa FSP 22001
12 pages
Predictive Modelling Project - Business Report
100% (1)
Predictive Modelling Project - Business Report
23 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
Logistic Regression and Lda
75% (4)
Logistic Regression and Lda
27 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
Shivani Pandey TSF
100% (1)
Shivani Pandey TSF
32 pages
1.bais Varience Trade-Off
No ratings yet
1.bais Varience Trade-Off
5 pages
Algorithms For Data Compression in Wireless Computing Systems
No ratings yet
Algorithms For Data Compression in Wireless Computing Systems
7 pages
Semi-Supervised Learning A Brief Review
No ratings yet
Semi-Supervised Learning A Brief Review
6 pages
Advanced Statistical Physics Problems
No ratings yet
Advanced Statistical Physics Problems
7 pages
Physics Midterm Review Packet
No ratings yet
Physics Midterm Review Packet
6 pages
Allocate 25 Seats For Five States Whose Populations
No ratings yet
Allocate 25 Seats For Five States Whose Populations
3 pages
Unreadable Document
No ratings yet
Unreadable Document
12 pages
Data Analysis for Python Users
100% (1)
Data Analysis for Python Users
14 pages
Predicting Commute Mode with ML
100% (1)
Predicting Commute Mode with ML
12 pages
TOA Course Outline
No ratings yet
TOA Course Outline
3 pages
Machine Learning Project Problem 1 Jupyter Notebook PDF
100% (5)
Machine Learning Project Problem 1 Jupyter Notebook PDF
85 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
MRA Project Milestone 2
100% (2)
MRA Project Milestone 2
31 pages
Machine Learning
100% (2)
Machine Learning
30 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Weekly Quiz 2 Predictive Modeling Logistic Regression PDF
100% (1)
Weekly Quiz 2 Predictive Modeling Logistic Regression PDF
3 pages
Heart Disease Prediction Using Decision Tree Analysis
No ratings yet
Heart Disease Prediction Using Decision Tree Analysis
10 pages
New Wheels Quarterly Business Report
No ratings yet
New Wheels Quarterly Business Report
20 pages
VARUNSAINI - 11 Dec 2022
No ratings yet
VARUNSAINI - 11 Dec 2022
16 pages
House Sale Price Prediction
0% (1)
House Sale Price Prediction
11 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
Data Mining Project
No ratings yet
Data Mining Project
11 pages
Arnab Chowdhury As1
No ratings yet
Arnab Chowdhury As1
12 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages