DA WEEK – 9
REGRESSION [LINEAR AND LOGISTIC]
Regression: -
Regression is a statistical method used to model relationships between variables,
predicting outcomes based on input data.
Types Of Regression: -
Linear Regression
Polynomial Regression
Support Vector Regression
Decision Tree Regression
Random Forest Regression
Ridge Regression
Lasso Regression
Logistic Regression
Steps To Be Carried Out for Regression: -
Data collection
Understanding data
Choosing the regression type
Handling missing values and outliers
Feature engineering
Check for multi-collinearity
Split dataset [train, test]
Assumption check
Model evaluation
Interpret results
Make predictions
Dataset Description: -
The dataset is based on Teen Education Poll with Teens responding to “Are you getting a
good education? Questions” over 54,00 high school students responded to this poll in a
single day on the after-school app. Nearly 58% of students said yes and 42% students
said no.
Regression Type: -
The dataset fits in for linear regression since it involves continuous percentages [percent
yes, percent no]. To perform Logistic regression I removed percent yes, percent no and
state column.
Linear Regression: -
Linear regression is a statistical technique that models the relationship between two
variables by fitting a straight line through data points.
Procedure To Do Linear Regression On A Dataset: -
Step 1: Data collection
Data is collected from all the possible resources and imported the
following libraries.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Loading the dataset on to the jupyter notebook with the following command.
dataset = pd.read_csv(r'C:\Users\aksha\OneDrive\Desktop\afterschool-teen-education
poll\pollingData.csv')
Step 2: Define objectives
Illustration:- To define the Dependent and Independent variables.
X = df[['Total']]
y = df['PercentYes']
Inference:- The Independent variable is Total as it doesn’t directly depend on any values
and the dependent variable is Percent Yes since it depends on Yes column and Total column.
Step 3: Data cleansing
Illustration: - To handle missing data, detect and treat outlier and remove
irrelevant features.
o Handle Missing Values
print("Missing values in each column:")
print(df.isnull().sum())
o Identify Outlier
def identify_outliers(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return df[(df[column] < lower_bound) | (df[column] > upper_bound)]
outliers_total = identify_outliers(df, 'Total')
outliers_percent_yes = identify_outliers(df, 'PercentYes')
print("Outliers in 'Total' votes:")
print(outliers_total)
print("Outliers in 'PercentYes':")
print(outliers_percent_yes)
o Treat outlier
df = df[~df.index.isin(outliers_total.index)]
df = df[~df.index.isin(outliers_percent_yes.index)]
o Remove irrelevant feature
df = df.drop(columns=['State'])
df.head()
Inference: - There is no missing values in the dataset. Outliers are detected and treated.
A state is removed as it is an irrelevant feature for further analysis.
Step 4: Exploratory data analysis
Illustration: - To Visualize the data to understand relationships, trends, and
distributions.
o Scatter plots for total votes vs percent yes votes
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(df['Total'], df['PercentYes'], color='blue')
plt.title('Scatterplot: Total Votes vs. Percent Yes')
plt.xlabel('Total Votes')
plt.ylabel('Percent Yes')
o Histograms for percent yes distribution
plt.subplot(1, 2, 2)
plt.hist(df['PercentYes'], bins=10, color='green', edgecolor='black')
plt.title('Histogram: Distribution of Percent Yes')
plt.xlabel('Percent Yes')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
o Correlation Heatmap
plt.figure(figsize=(6, 5))
correlation_matrix = df[['Total', 'PercentYes']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()
Inference: - Scatter plots are use to represent linear relationships between variables.
Histograms are used to understand the distribution of the data. And Correlation heatmap
is used to check relationships between predictors.
Step 5: Feature Engineering
Illustration: - To calculate Encode categorical variables, Transform variables and
Interaction terms.
o Encode Categorical variables
df_encoded = pd.get_dummies(df, columns=['State'], drop_first=True)
print(df_encoded.head())
o Log transformation
df['LogTotal'] = np.log1p(df['Total'])
print(df['LogTotal'])
o Interactive term
df['Total_PercentYes_Interaction'] = df['Total'] * df['PercentYes']
Inference: - The encode categorical variables is calculated using one-heat map and
transformation variables is calculated using Log transformation and interaction terms are
created between variables.
Step 6: Check for multi-collinearity
Illustration: - To check for multi-collinearity of the dataset.
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
X = df[['Total', 'LogTotal', 'Total_PercentYes_Interaction']]
X = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("VIF values for features:")
print(vif_data)
Inference: - Use the Variance Inflation Factor (VIF) to ensure independent variables are
not highly correlated with each other. Since High multicollinearity can distort the model.
Step 7: Split the data
Illustration: - Divide the data as training set and Testing set.
from sklearn.model_selection import train_test_split
X = df[['Total', 'LogTotal', 'Total_PercentYes_Interaction']]
y = df['PercentYes']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")
Inference: - Divided the data into training and testing sets (commonly 70-80% for
training, 20-30% for testing) to validate model performance on unseen data.
Step 8: Fit the model
Illustration: - To fit The dataset in a model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")
Inference: - Initialized and fitted the model, predictions on the testing set is done, and
evaluated the model and the mean squared error and R squared error of the test dataset is
displayed.
Step 9: Evaluate Assumptions
Illustration: - To evaluate assumptions of linear regression.
o Linearity
residuals = y_train - y_train_pred
X_train_sm = sm.add_constant(X_train)
model_sm = sm.OLS(y_train, X_train_sm).fit()
y_train_pred_sm = model_sm.predict(X_train_sm)
residuals_sm = y_train - y_train_pred_sm
plt.figure(figsize=(14, 10))
plt.subplot(2, 2, 1)
plt.scatter(y_train_pred, residuals, color='blue')
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs Fitted')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
o Independence
from statsmodels.graphics.tsaplots import plot_acf
plt.subplot(2, 2, 2)
plot_acf(residuals_sm, lags=20, ax=plt.gca(), alpha=0.05)
plt.title('Autocorrelation of Residuals')
o Homoscedasticity
plt.subplot(2, 2, 3)
plt.scatter(y_train_pred_sm, residuals_sm, color='blue')
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs Fitted (statsmodels)')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
o Normality of Residuals
plt.subplot(2, 2, 4)
sm.qqplot(residuals_sm, line='s', ax=plt.gca())
plt.title('Q-Q Plot')
plt.tight_layout()
plt.show()
Histogram for residuals:
plt.figure(figsize=(6, 5))
plt.hist(residuals_sm, bins=20, color='blue', edgecolor='black')
plt.title('Histogram of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()
Inference: - Assumptions like Linearity (Check if the relationship between the predictors
and the target is linear), Independence( To Ensures the residuals are independent of each
other), Homoscedasticity(Check for the residuals that have constant variance) and
Normality of Residuals( Ensures The residuals should follow a normal distribution).
Step 10: Model Evaluation
Illustration: - to Evaluate the model using metrics.
from sklearn.metrics import mean_squared_error, r2_score
X_test_sm = sm.add_constant(X_test)
y_test_pred_sm = model_sm.predict(X_test_sm)
r2 = r2_score(y_test, y_test_pred_sm)
mse = mean_squared_error(y_test, y_test_pred_sm)
rmse = np.sqrt(mse)
n = X_test.shape[0]
p = X_test.shape[1]
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f"R-squared: {r2:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"Adjusted R-squared: {adj_r2:.4f}"}
Inference: The R-square error, Mean squared error, Root mean squared error and
adjusted R- squared error is calculated and displayed. Resulting in good proportion of the
variance (63.32%) in the dependent variable, and the predictions are quite close to the
actual values with low error metrics (MSE and RMSE).
Step 11: Interpret Results
Illustration:- To display the result of the models coefficents, confidence values.
print(model_sm.summary())
coefficients = model_sm.params
p_values = model_sm.pvalues
conf_int = model_sm.conf_int()
print("\nCoefficients:")
print(coefficients)
print("\nP-values:")
print(p_values)
print("\nConfidence Intervals:")
print(conf_int)
Inference:- The model explains a substantial portion of variance (60.9%).Total and the
interaction term significantly affect PercentYes. LogTotal does not significantly contribute
to the model. High condition number suggests potential multicollinearity or numerical
issues.
Step 12: Make Predictions
Illustration: - To make predictions on new data.
new_data = pd.DataFrame({
'Total': [500, 1500],
'LogTotal': np.log1p([500, 1500]),
'Total_PercentYes_Interaction': [500 * 0.6, 1500 * 0.5]
})
new_data_sm = sm.add_constant(new_data)
predictions = model_sm.predict(new_data_sm)
print("\nPredictions on new data:")
print(predictions)
Inference: - Predictions are made on new data(unseen/ given).
Logistic Regression: -
Logistic regression is a statistical method used for binary classification, modelling the
probability of an outcome using a logistic function.
Note: - The Data collection, defining objectives, data cleansing, Feature Engineering and
Exploratory data analysis are same as Linear Regression. The state and percent Yes and
Percent No columns Have been discarded as an irrelevant feature.
Procedure To Do Logistic Regression On A Dataset: -
Step 1: Exploratory data analysis
Illustration: To Understand the distribution of features and their relationship to
the target variable.
o Histograms
plt.figure(figsize=(14, 10))
plt.subplot(2, 2, 1)
sns.histplot(df['Total'], kde=True, bins=20)
plt.title('Histogram of Total Responses')
plt.subplot(2, 2, 2)
sns.histplot(df['Yes'], kde=True, bins=20)
plt.title('Histogram of Yes Responses')
plt.subplot(2, 2, 3)
sns.histplot(df['No'], kde=True, bins=20)
plt.title('Histogram of No Responses')
plt.subplot(2, 2, 4)
sns.histplot(df['PercentYes'], kde=True, bins=20)
plt.title('Histogram of PercentYes')
plt.tight_layout()
plt.show()
o Scatter plots
plt.figure(figsize=(14, 10))
plt.subplot(2, 2, 1)
sns.scatterplot(x='Total', y='PercentYes', hue='Target', data=df, palette='coolwarm')
plt.title('Total Responses vs. PercentYes')
plt.subplot(2, 2, 2)
sns.scatterplot(x='Yes', y='PercentYes', hue='Target', data=df, palette='coolwarm')
plt.title('Yes Responses vs. PercentYes')
plt.subplot(2, 2, 3)
sns.scatterplot(x='No', y='PercentYes', hue='Target', data=df, palette='coolwarm')
plt.title('No Responses vs. PercentYes')
plt.tight_layout()
plt.show()
o Correlation matrix
corr_matrix = df[['Total', 'Yes', 'No', 'PercentYes']].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Numerical Features')
plt.show()
Inference: Histograms and scatter plots are plotted for continuous features to
see relationships with the target. And the Correlation matrix represents
numerical features.
Step 2: Feature Scaling
Illustration: - Features are ranked by ANOVA F-value to select top predictors.
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k='all')
selector.fit(X_poly, y)
feature_scores = pd.DataFrame({
'Feature': X_poly_df.columns,
'Score': selector.scores_
}).sort_values(by='Score', ascending=False)
print("Feature importance scores:")
print(feature_scores)
top_features = feature_scores.head(10)['Feature']
X_selected = X_poly_df[top_features]
print("Data with selected top features:")
print(X_selected.head())
Inference: - Top features have the strongest relationships with the target variable.
Step 3: Addressing Class Imbalance
Illustration: - To find if one class is significantly underrepresented.
o Over sampling
df['Target'] = (df['Yes'] > df['No']).astype(int)
minority_class = df[df['Target'] == 1]
majority_class = df[df['Target'] == 0]
num_majority = len(majority_class)
num_minority = len(minority_class)
if num_minority < num_majority:
num_to_oversample = num_majority - num_minority
minority_oversampled = minority_class.sample(num_to_oversample, replace=True,
random_state=42)
df_oversampled = pd.concat([majority_class, minority_oversampled])
print(f"Oversampled dataset shape: {df_oversampled.shape}")
else:
print("No oversampling needed. Minority class is not smaller than the majority class.")
o Under sampling
if num_majority > num_minority:
majority_class_undersampled = majority_class.sample(num_minority,
random_state=42)
df_undersampled = pd.concat([majority_class_undersampled, minority_class])
print(f"Undersampled dataset shape: {df_undersampled.shape}")
else:
print("No undersampling needed. Majority class is not larger than the minority class.")
o Class weights
df['Target'] = (df['Yes'] > df['No']).astype(int) # Binary target
X = df.drop(['Target', 'State'], axis=1)
y = df['Target']
print("Original target variable distribution:")
print(y.value_counts())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42, stratify=y)
print("Training set target distribution:")
print(y_train.value_counts())
class_weights = compute_class_weight(class_weight='balanced',
classes=np.unique(y_train), y=y_train)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))
print("Class weights:")
print(class_weights_dict)
model = LogisticRegression(class_weight=class_weights_dict,
random_state=42)
model.fit(X_train, y_train)
print("Model class weights used:")
print(model.class_weight_)
Inference: - No class imbalance, so no sampling or class weighting
required.
Step 4: Check for multi-collinearity
Illustration:- Using Variance Inflation Factor (VIF) to check for highly correlated
predictors.
df = pd.DataFrame({
'Feature1': [1, 2, 3, 4, 5],
'Feature2': [2, 4, 6, 8, 10],
'Feature3': [5, 4, 3, 2, 1]
})
X = add_constant(df)
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
Inference:- Severe multicollinearity exists; features are highly correlated, causing
instability.
Step 5: Split The dataset
Illustration: - For Random features generate target; data split into training and
testing.
X = pd.DataFrame({
'feature1': np.random.rand(100),
'feature2': np.random.rand(100),
'feature3': np.random.rand(100)
})
y = (X['feature1'] + X['feature2'] > 1).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Inference: - Binary classification predicts y; 70/30 split ensures model
generalization.
Step 6: Fit the model
Illustration:- To the fit the logistic regression model.
model = LogisticRegression(penalty='l2', solver='liblinear', random_state=42)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
Inference: - This code fits a logistic regression model with L2 regularization,
predicts class labels, and calculates probabilities for class 1.
Step 7: Evaluate assumptions
Illustration: - To evaluate assumptions such as Linearity of log odds,
Independence of observations, No multicollinearity, Sufficient sample size.
o Linearity of Log odds
df['PercentYes'] = df['Yes'] / df['Total']
df['Target'] = (df['PercentYes'] > 0.5).astype(int)
X = df[['Yes', 'No']]
y = df['Target']
X = sm.add_constant(X)
logit_model = sm.Logit(y, X)
result = logit_model.fit()
print(result.summary())
df['LogOdds'] = np.log(df['PercentYes'] / (1 - df['PercentYes'])) # Logit function
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(x='Yes', y='LogOdds', data=df)
plt.title('Log Odds vs Yes')
plt.xlabel('Yes')
plt.subplot(1, 2, 2)
sns.scatterplot(x='No', y='LogOdds', data=df)
plt.title('Log Odds vs No')
plt.xlabel('No')
plt.tight_layout()
plt.show()
o Independence of observations
X_train = pd.DataFrame({
'feature1': np.random.rand(100),
'feature2': np.random.rand(100)
})
y_train = (X_train['feature1'] + X_train['feature2'] > 1).astype(int)
X_with_const = sm.add_constant(X_train)
logit_model = sm.Logit(y_train, X_with_const)
result = logit_model.fit()
residuals = y_train - result.predict(X_with_const)
plot_acf(residuals, lags=30)
plt.show()
o No multicollinearity
def calculate_vif(X):
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
return vif_data
vif_data = calculate_vif(X_with_const)
print("Variance Inflation Factors:")
print(vif_data)
o Sufficient sample size
num_predictors = X_with_const.shape[1] - 1 # Subtract 1 for the intercept
sample_size = X_with_const.shape[0]
print(f"Sample size: {sample_size}")
print(f"Number of predictors: {num_predictors}")
if sample_size >= 10 * num_predictors:
print("Sample size is adequate.")
else:
print("Sample size may be insufficient.")
Inference: - The logit regression model shows perfect prediction with
complete separation, indicating non-identifiable parameters and failed
convergence after 35 iterations.
Step 8: Model Evaluation
Illustration:- To evaluate the model using metrics such as accuracy, confusion
matrix, precision, recall, F1 score, ROC Curve and AUC.
o Accuracy:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
o Confusion matrix:
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1']).plot()
plt.title('Confusion Matrix')
plt.show()
o Precision:
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.2f}")
o Recall (Sensitivity):
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.2f}")
o F1 Score:
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.2f}")
o ROC Curve and AUC:
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = roc_auc_score(y_test, y_pred_proba)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc='lower right')
plt.show()
print(f"AUC: {roc_auc:.2f}")
Inference:- The model shows perfect accuracy, confusion matrix, precision,
recall, ROC & AUC curve ,and F1 score, indicating flawless classification.
Step 9: Regularization
Illustration: - To regularize the data if it is overfitted.
o L2 Regularization: Shrinks coefficients of less important variables to zero,
performing feature selection.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, roc_curve,
roc_auc_score, ConfusionMatrixDisplay
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
X = pd.DataFrame({
'feature1': np.random.rand(100),
'feature2': np.random.rand(100),
'feature3': np.random.rand(100)
})
y = (X['feature1'] + X['feature2'] > 1).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model_l2 = LogisticRegression(penalty='l2', solver='liblinear', random_state=42)
model_l2.fit(X_train_scaled, y_train)
y_pred_l2 = model_l2.predict(X_test_scaled)
y_pred_proba_l2 = model_l2.predict_proba(X_test_scaled)[:, 1]
print("L2 Regularization (Ridge) Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_l2):.2f}")
print(f"Precision: {precision_score(y_test, y_pred_l2):.2f}")
print(f"Recall: {recall_score(y_test, y_pred_l2):.2f}")
print(f"F1 Score: {f1_score(y_test, y_pred_l2):.2f}")
roc_auc_l2 = roc_auc_score(y_test, y_pred_proba_l2)
print(f"AUC: {roc_auc_l2:.2f}")
o L1 Regularization: Penalizes large coefficients but doesn’t shrink them to
zero.
model_l1 = LogisticRegression(penalty='l1', solver='liblinear', random_state=42)
model_l1.fit(X_train_scaled, y_train)
y_pred_l1 = model_l1.predict(X_test_scaled)
y_pred_proba_l1 = model_l1.predict_proba(X_test_scaled)[:, 1]
print("L1 Regularization (Lasso) Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_l1):.2f}")
print(f"Precision: {precision_score(y_test, y_pred_l1):.2f}")
print(f"Recall: {recall_score(y_test, y_pred_l1):.2f}")
print(f"F1 Score: {f1_score(y_test, y_pred_l1):.2f}")
roc_auc_l1 = roc_auc_score(y_test, y_pred_proba_l1)
print(f"AUC: {roc_auc_l1:.2f}")
Inference: - The Lasso model, and Ridge model, achieved perfect performance,
suggesting possible overfitting.
Step 10: Interpret Results
Illustration:- To interpret the results of the model.
import numpy as np
import pandas as pd
coefficients = model.coef_[0]
feature_names = X.columns
odds_ratios = np.exp(coefficients)
interpretation_df = pd.DataFrame({
'Feature': feature_names,
'Coefficient': coefficients,
'Odds Ratio': odds_ratios
})
print("Model Interpretation:")
print(interpretation_df)
Inference:- Feature1 and feature2 strongly increase odds, while feature3 has a
minor positive impact on the predicted outcome.
Step 11: Make Predictions
Illustration:- To make predictions on unseen data.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
new_data_scaled = scaler.fit_transform(new_data)
predicted_probs = model.predict_proba(new_data_scaled)[:, 1]
threshold = 0.5
predicted_classes = (predicted_probs >= threshold).astype(int)
predictions_df = pd.DataFrame({
'Predicted_Probability': predicted_probs,
'Predicted_Class': predicted_classes
})
print("Predictions:")
print(predictions_df)
Inference: - The predictions show that the first instance has a very low predicted
probability (0.0062) and is classified as 0, while the second instance has a high
predicted probability (0.987) and is classified as 1.