GIET UNIVERSITY, GUNUPUR
SCHOOL OF ENGINEERING AND TECHNOLOGY
DEPARTMENT OF CSE (AIML)
Step 1: Import Python Libraries:-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
Step 2: Read the Dataset:-
df= pd.read_csv('kerala.csv')
df.head(5)
Step 3: Explore the Dataset:-
1)df.info()
2)df.shape
3)df.describe()
4)df.corr()
Replace:- In order to train this Python model, we need the values of our target
output to be 0 & 1. So, we'll replace values in the Floods column (YES, NO)
with (1, 0) respectively
df['FLOODS'].replace(['YES', 'NO'], [1,0], inplace=True)
df.head(5)
null values:- To find the null values In the dataset
df.isnull().mean().sort_values(ascending=False) * 100
corr:- To identifying the correlation between the data points using heat map
NAME OF THE STUDENT: Arijeet Mishra ROLL NO – 21CSEAIML008
PAGE NO: 01
GIET UNIVERSITY, GUNUPUR
SCHOOL OF ENGINEERING AND TECHNOLOGY
DEPARTMENT OF CSE (AIML)
corr df.corr()
sns.heatmap(corr, xticklabels corr.columns, yticklabels
corr.columns)
Step 3: Feature Selection:-
Start by importing the Select Best library:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
After, define X & Y:-
X= df.iloc[:,1:14] //for all features
Y= df.iloc[:,-1] //for target output (floods)
Select the top 3 features:-
best_features= SelectKBest(score_func=chi2, k=3)
fit= best_features.fit(X,Y)
Now we create data frames for the features and the score of each
feature:
df_scores= pd.DataFrame(fit.scores_)
df_columns= pd.DataFrame(X.columns)
Finally, we’ll combine all the features and their corresponding scores in
one data frame:
features_scores= pd.concat([df_columns, df_scores], axis=1)
features_scores.columns= ['Features', 'Score']
features_scores.sort_values(by = 'Score')
Step 4: Build the Model:-
X= df[['SEP', 'JUN', 'JUL']] the top 3 features
Y= df[['FLOODS']] the target output
Splitting the dataset into train and test:-
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_siz
e=0.4,random_state=100)
NAME OF THE STUDENT: Arijeet Mishra ROLL NO – 21CSEAIML008
PAGE NO: 02
GIET UNIVERSITY, GUNUPUR
SCHOOL OF ENGINEERING AND TECHNOLOGY
DEPARTMENT OF CSE (AIML)
Create a logistic regression body:-
logreg= LogisticRegression()
logreg.fit(X_train,y_train)
we predict the likelihood of a flood using the logistic regression body we
created:-
y_pred=logreg.predict(X_test)
print (X_test) #test dataset
print (y_pred) #predicted values
Step 5: Evaluate the Model’s Performance:-
• 5.1:- Mean Absolute Error(MAE):- MAE is a straightforward metric that calculates
the absolute difference between actual and predicted values. The degree of errors for
predictions and observations is measured using the average absolute errors for the
entire group.
from sklearn.metrics import mean absolute_error
print("MAE", mean_absolute_error(y_test,y_pred)
• 5.2:- Mean Squared Error(MSE)
MSE is a popular and straightforward statistic with a bit of variation in mean
absolute error. The squared difference between the actual and anticipated values
is calculated using mean squared error.
from sklearn.metrics import mean_squared_error
print("MSE", mean_squared_error(y_test,y_pred)
• 5.3:-Root Mean Squared Error(RMSE)
As the term, RMSE implies that it is a straightforward square root of mean
squared error.
NAME OF THE STUDENT: Arijeet Mishra ROLL NO – 21cseaiml008
PAGE NO: 03
GIET UNIVERSITY, GUNUPUR
SCHOOL OF ENGINEERING AND TECHNOLOGY
DEPARTMENT OF CSE (AIML)
• R Squared (R2)
The R2 score, also called the coefficient of determination, is one of the
performance evaluation measures for the regression-based machine learning
model. Simply put, it measures how close the target data points are to the fitted
line. As we have shown, MAE and MSE are context-dependent, but the R2 score
is context neutral. So, with the help of R squared, we have a baseline model to
compare to a model that none of the other metrics give
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print(r2)
Classification Report:-
A classification report is a performance evaluation report that is used
to evaluate the performance of machine learning models by the
following 5 criteria:
• Accuracy is a score used to evaluate the model’s performance. The
higher it is, the better.
• Recall measures the model’s ability to correctly predict the true
positive values.
• Precision is the ratio of true positives to the sum of both true and
false positives.
• F-score combines precision and recall into one metric. Ideally, its
value should be closest to 1, the better.
• Support is the number of actual occurrences of each class in the
dataset.
NAME OF THE STUDENT: Arijeet Mishra Roll no – 21cseaiml008
PAGE NO: 04
GIET UNIVERSITY, GUNUPUR
SCHOOL OF ENGINEERING AND TECHNOLOGY
DEPARTMENT OF CSE (AIML)
from sklearn import metrics
from sklearn.metrics import classification_report
print(‘Accuracy: ‘,metrics.accuracy_score(y_test, y_pred))
print(‘Recall: ‘,metrics.recall_score(y_test, y_pred,
zero_division=1))
print(“Precision:”,metrics.precision_score(y_test, y_pred,
zero_division=1))
print(“CL Report:”,metrics.classification_report(y_test,
y_pred, zero_division=1))
ROC Curve:-
The receiver operating characteristic (ROC) curve is used to display the
sensitivity and specificity of the logistic regression model by calculating the true
positive and false positive rates.
From the ROC curve, we can calculate the area under the curve (AUC) whose
value ranges from 0 to 1. You’ll remember that the closer to 1, the better it is for
our predictive modeling.
• To determine the ROC curve, first define the metrics:-
y_pred_proba= logreg.predict_proba(X_test) [::,1]
• Then, calculate the true positive and false positive rates:-
false_positive_rate, true_positive_rate, _ =
metrics.roc_curve(y_test, y_pred_proba)
• Next, calculate the AUC to see the model's performance:-
auc= metrics.roc_auc_score(y_test, y_pred_proba)
• Finally, plot the ROC curve:-
plt.plot(false_positive_rate,
true_positive_rate,label="AUC="+str(auc))
plt.title('ROC Curve')
plt.ylabel('True Positive Rate')
plt.xlabel('false Positive Rate')
plt.legend(loc=4)
NAME OF THE STUDENT: Arijeet Mishra roll no – 21AIML008
PAGE NO: 05