Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views20 pages

Da - Week 9

The document provides an overview of regression analysis, detailing various types of regression methods including linear and logistic regression, along with steps for data preparation, model fitting, and evaluation. It describes the process of handling a dataset from a Teen Education Poll, including data cleansing, exploratory data analysis, feature engineering, and model evaluation metrics. Additionally, it outlines the procedures for both linear and logistic regression, emphasizing the importance of assumptions and interpretation of results.

Uploaded by

Kk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views20 pages

Da - Week 9

The document provides an overview of regression analysis, detailing various types of regression methods including linear and logistic regression, along with steps for data preparation, model fitting, and evaluation. It describes the process of handling a dataset from a Teen Education Poll, including data cleansing, exploratory data analysis, feature engineering, and model evaluation metrics. Additionally, it outlines the procedures for both linear and logistic regression, emphasizing the importance of assumptions and interpretation of results.

Uploaded by

Kk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

DA WEEK – 9

REGRESSION [LINEAR AND LOGISTIC]


Regression: -
Regression is a statistical method used to model relationships between variables,
predicting outcomes based on input data.

Types Of Regression: -
 Linear Regression
 Polynomial Regression
 Support Vector Regression
 Decision Tree Regression
 Random Forest Regression
 Ridge Regression
 Lasso Regression
 Logistic Regression

Steps To Be Carried Out for Regression: -


 Data collection
 Understanding data
 Choosing the regression type
 Handling missing values and outliers
 Feature engineering
 Check for multi-collinearity
 Split dataset [train, test]
 Assumption check
 Model evaluation
 Interpret results
 Make predictions
Dataset Description: -
The dataset is based on Teen Education Poll with Teens responding to “Are you getting a
good education? Questions” over 54,00 high school students responded to this poll in a
single day on the after-school app. Nearly 58% of students said yes and 42% students
said no.

Regression Type: -
The dataset fits in for linear regression since it involves continuous percentages [percent
yes, percent no]. To perform Logistic regression I removed percent yes, percent no and
state column.

Linear Regression: -
Linear regression is a statistical technique that models the relationship between two
variables by fitting a straight line through data points.
Procedure To Do Linear Regression On A Dataset: -
 Step 1: Data collection

Data is collected from all the possible resources and imported the
following libraries.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Loading the dataset on to the jupyter notebook with the following command.
dataset = pd.read_csv(r'C:\Users\aksha\OneDrive\Desktop\afterschool-teen-education
poll\pollingData.csv')

 Step 2: Define objectives


Illustration:- To define the Dependent and Independent variables.

X = df[['Total']]
y = df['PercentYes']

Inference:- The Independent variable is Total as it doesn’t directly depend on any values
and the dependent variable is Percent Yes since it depends on Yes column and Total column.

 Step 3: Data cleansing


Illustration: - To handle missing data, detect and treat outlier and remove
irrelevant features.
o Handle Missing Values
print("Missing values in each column:")
print(df.isnull().sum())

o Identify Outlier
def identify_outliers(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return df[(df[column] < lower_bound) | (df[column] > upper_bound)]
outliers_total = identify_outliers(df, 'Total')
outliers_percent_yes = identify_outliers(df, 'PercentYes')
print("Outliers in 'Total' votes:")
print(outliers_total)
print("Outliers in 'PercentYes':")
print(outliers_percent_yes)
o Treat outlier
df = df[~df.index.isin(outliers_total.index)]
df = df[~df.index.isin(outliers_percent_yes.index)]

o Remove irrelevant feature


df = df.drop(columns=['State'])
df.head()

Inference: - There is no missing values in the dataset. Outliers are detected and treated.
A state is removed as it is an irrelevant feature for further analysis.

 Step 4: Exploratory data analysis


Illustration: - To Visualize the data to understand relationships, trends, and
distributions.
o Scatter plots for total votes vs percent yes votes
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(df['Total'], df['PercentYes'], color='blue')
plt.title('Scatterplot: Total Votes vs. Percent Yes')
plt.xlabel('Total Votes')
plt.ylabel('Percent Yes')

o Histograms for percent yes distribution


plt.subplot(1, 2, 2)
plt.hist(df['PercentYes'], bins=10, color='green', edgecolor='black')
plt.title('Histogram: Distribution of Percent Yes')
plt.xlabel('Percent Yes')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
o Correlation Heatmap
plt.figure(figsize=(6, 5))
correlation_matrix = df[['Total', 'PercentYes']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

Inference: - Scatter plots are use to represent linear relationships between variables.
Histograms are used to understand the distribution of the data. And Correlation heatmap
is used to check relationships between predictors.

 Step 5: Feature Engineering


Illustration: - To calculate Encode categorical variables, Transform variables and
Interaction terms.
o Encode Categorical variables
df_encoded = pd.get_dummies(df, columns=['State'], drop_first=True)
print(df_encoded.head())
o Log transformation
df['LogTotal'] = np.log1p(df['Total'])
print(df['LogTotal'])

o Interactive term
df['Total_PercentYes_Interaction'] = df['Total'] * df['PercentYes']
Inference: - The encode categorical variables is calculated using one-heat map and
transformation variables is calculated using Log transformation and interaction terms are
created between variables.

 Step 6: Check for multi-collinearity


Illustration: - To check for multi-collinearity of the dataset.
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
X = df[['Total', 'LogTotal', 'Total_PercentYes_Interaction']]
X = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("VIF values for features:")
print(vif_data)

Inference: - Use the Variance Inflation Factor (VIF) to ensure independent variables are
not highly correlated with each other. Since High multicollinearity can distort the model.

 Step 7: Split the data


Illustration: - Divide the data as training set and Testing set.
from sklearn.model_selection import train_test_split
X = df[['Total', 'LogTotal', 'Total_PercentYes_Interaction']]
y = df['PercentYes']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")
Inference: - Divided the data into training and testing sets (commonly 70-80% for
training, 20-30% for testing) to validate model performance on unseen data.

 Step 8: Fit the model


Illustration: - To fit The dataset in a model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")

Inference: - Initialized and fitted the model, predictions on the testing set is done, and
evaluated the model and the mean squared error and R squared error of the test dataset is
displayed.

 Step 9: Evaluate Assumptions


Illustration: - To evaluate assumptions of linear regression.
o Linearity
residuals = y_train - y_train_pred
X_train_sm = sm.add_constant(X_train)
model_sm = sm.OLS(y_train, X_train_sm).fit()
y_train_pred_sm = model_sm.predict(X_train_sm)
residuals_sm = y_train - y_train_pred_sm

plt.figure(figsize=(14, 10))
plt.subplot(2, 2, 1)
plt.scatter(y_train_pred, residuals, color='blue')
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs Fitted')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')

o Independence
from statsmodels.graphics.tsaplots import plot_acf
plt.subplot(2, 2, 2)
plot_acf(residuals_sm, lags=20, ax=plt.gca(), alpha=0.05)
plt.title('Autocorrelation of Residuals')

o Homoscedasticity
plt.subplot(2, 2, 3)
plt.scatter(y_train_pred_sm, residuals_sm, color='blue')
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs Fitted (statsmodels)')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')

o Normality of Residuals
plt.subplot(2, 2, 4)
sm.qqplot(residuals_sm, line='s', ax=plt.gca())
plt.title('Q-Q Plot')
plt.tight_layout()
plt.show()

Histogram for residuals:


plt.figure(figsize=(6, 5))
plt.hist(residuals_sm, bins=20, color='blue', edgecolor='black')
plt.title('Histogram of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

Inference: - Assumptions like Linearity (Check if the relationship between the predictors
and the target is linear), Independence( To Ensures the residuals are independent of each
other), Homoscedasticity(Check for the residuals that have constant variance) and
Normality of Residuals( Ensures The residuals should follow a normal distribution).

 Step 10: Model Evaluation


Illustration: - to Evaluate the model using metrics.
from sklearn.metrics import mean_squared_error, r2_score
X_test_sm = sm.add_constant(X_test)
y_test_pred_sm = model_sm.predict(X_test_sm)
r2 = r2_score(y_test, y_test_pred_sm)
mse = mean_squared_error(y_test, y_test_pred_sm)
rmse = np.sqrt(mse)
n = X_test.shape[0]
p = X_test.shape[1]
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f"R-squared: {r2:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"Adjusted R-squared: {adj_r2:.4f}"}

Inference: The R-square error, Mean squared error, Root mean squared error and
adjusted R- squared error is calculated and displayed. Resulting in good proportion of the
variance (63.32%) in the dependent variable, and the predictions are quite close to the
actual values with low error metrics (MSE and RMSE).

 Step 11: Interpret Results


Illustration:- To display the result of the models coefficents, confidence values.
print(model_sm.summary())
coefficients = model_sm.params
p_values = model_sm.pvalues
conf_int = model_sm.conf_int()
print("\nCoefficients:")
print(coefficients)
print("\nP-values:")
print(p_values)
print("\nConfidence Intervals:")
print(conf_int)

Inference:- The model explains a substantial portion of variance (60.9%).Total and the
interaction term significantly affect PercentYes. LogTotal does not significantly contribute
to the model. High condition number suggests potential multicollinearity or numerical
issues.

 Step 12: Make Predictions


Illustration: - To make predictions on new data.
new_data = pd.DataFrame({
'Total': [500, 1500],
'LogTotal': np.log1p([500, 1500]),
'Total_PercentYes_Interaction': [500 * 0.6, 1500 * 0.5]
})
new_data_sm = sm.add_constant(new_data)
predictions = model_sm.predict(new_data_sm)
print("\nPredictions on new data:")
print(predictions)

Inference: - Predictions are made on new data(unseen/ given).

Logistic Regression: -
Logistic regression is a statistical method used for binary classification, modelling the
probability of an outcome using a logistic function.

Note: - The Data collection, defining objectives, data cleansing, Feature Engineering and
Exploratory data analysis are same as Linear Regression. The state and percent Yes and
Percent No columns Have been discarded as an irrelevant feature.

Procedure To Do Logistic Regression On A Dataset: -


 Step 1: Exploratory data analysis
Illustration: To Understand the distribution of features and their relationship to
the target variable.
o Histograms
plt.figure(figsize=(14, 10))
plt.subplot(2, 2, 1)
sns.histplot(df['Total'], kde=True, bins=20)
plt.title('Histogram of Total Responses')
plt.subplot(2, 2, 2)
sns.histplot(df['Yes'], kde=True, bins=20)
plt.title('Histogram of Yes Responses')
plt.subplot(2, 2, 3)
sns.histplot(df['No'], kde=True, bins=20)
plt.title('Histogram of No Responses')
plt.subplot(2, 2, 4)
sns.histplot(df['PercentYes'], kde=True, bins=20)
plt.title('Histogram of PercentYes')
plt.tight_layout()
plt.show()

o Scatter plots
plt.figure(figsize=(14, 10))
plt.subplot(2, 2, 1)
sns.scatterplot(x='Total', y='PercentYes', hue='Target', data=df, palette='coolwarm')
plt.title('Total Responses vs. PercentYes')
plt.subplot(2, 2, 2)
sns.scatterplot(x='Yes', y='PercentYes', hue='Target', data=df, palette='coolwarm')
plt.title('Yes Responses vs. PercentYes')
plt.subplot(2, 2, 3)
sns.scatterplot(x='No', y='PercentYes', hue='Target', data=df, palette='coolwarm')
plt.title('No Responses vs. PercentYes')
plt.tight_layout()
plt.show()
o Correlation matrix
corr_matrix = df[['Total', 'Yes', 'No', 'PercentYes']].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Numerical Features')
plt.show()

Inference: Histograms and scatter plots are plotted for continuous features to
see relationships with the target. And the Correlation matrix represents
numerical features.

 Step 2: Feature Scaling


Illustration: - Features are ranked by ANOVA F-value to select top predictors.

from sklearn.feature_selection import SelectKBest, f_classif


selector = SelectKBest(score_func=f_classif, k='all')
selector.fit(X_poly, y)
feature_scores = pd.DataFrame({
'Feature': X_poly_df.columns,
'Score': selector.scores_
}).sort_values(by='Score', ascending=False)
print("Feature importance scores:")
print(feature_scores)
top_features = feature_scores.head(10)['Feature']
X_selected = X_poly_df[top_features]
print("Data with selected top features:")
print(X_selected.head())

Inference: - Top features have the strongest relationships with the target variable.

 Step 3: Addressing Class Imbalance


Illustration: - To find if one class is significantly underrepresented.

o Over sampling
df['Target'] = (df['Yes'] > df['No']).astype(int)
minority_class = df[df['Target'] == 1]
majority_class = df[df['Target'] == 0]
num_majority = len(majority_class)
num_minority = len(minority_class)
if num_minority < num_majority:
num_to_oversample = num_majority - num_minority
minority_oversampled = minority_class.sample(num_to_oversample, replace=True,
random_state=42)
df_oversampled = pd.concat([majority_class, minority_oversampled])
print(f"Oversampled dataset shape: {df_oversampled.shape}")
else:
print("No oversampling needed. Minority class is not smaller than the majority class.")

o Under sampling
if num_majority > num_minority:
majority_class_undersampled = majority_class.sample(num_minority,
random_state=42)
df_undersampled = pd.concat([majority_class_undersampled, minority_class])
print(f"Undersampled dataset shape: {df_undersampled.shape}")
else:
print("No undersampling needed. Majority class is not larger than the minority class.")
o Class weights
df['Target'] = (df['Yes'] > df['No']).astype(int) # Binary target
X = df.drop(['Target', 'State'], axis=1)
y = df['Target']
print("Original target variable distribution:")
print(y.value_counts())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42, stratify=y)
print("Training set target distribution:")
print(y_train.value_counts())
class_weights = compute_class_weight(class_weight='balanced',
classes=np.unique(y_train), y=y_train)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))
print("Class weights:")
print(class_weights_dict)
model = LogisticRegression(class_weight=class_weights_dict,
random_state=42)
model.fit(X_train, y_train)
print("Model class weights used:")
print(model.class_weight_)

Inference: - No class imbalance, so no sampling or class weighting


required.

 Step 4: Check for multi-collinearity


Illustration:- Using Variance Inflation Factor (VIF) to check for highly correlated
predictors.

df = pd.DataFrame({
'Feature1': [1, 2, 3, 4, 5],
'Feature2': [2, 4, 6, 8, 10],
'Feature3': [5, 4, 3, 2, 1]
})
X = add_constant(df)
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
Inference:- Severe multicollinearity exists; features are highly correlated, causing
instability.

 Step 5: Split The dataset


Illustration: - For Random features generate target; data split into training and
testing.

X = pd.DataFrame({
'feature1': np.random.rand(100),
'feature2': np.random.rand(100),
'feature3': np.random.rand(100)
})
y = (X['feature1'] + X['feature2'] > 1).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Inference: - Binary classification predicts y; 70/30 split ensures model


generalization.

 Step 6: Fit the model


Illustration:- To the fit the logistic regression model.

model = LogisticRegression(penalty='l2', solver='liblinear', random_state=42)


model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

Inference: - This code fits a logistic regression model with L2 regularization,


predicts class labels, and calculates probabilities for class 1.

 Step 7: Evaluate assumptions


Illustration: - To evaluate assumptions such as Linearity of log odds,
Independence of observations, No multicollinearity, Sufficient sample size.

o Linearity of Log odds


df['PercentYes'] = df['Yes'] / df['Total']
df['Target'] = (df['PercentYes'] > 0.5).astype(int)
X = df[['Yes', 'No']]
y = df['Target']
X = sm.add_constant(X)
logit_model = sm.Logit(y, X)
result = logit_model.fit()
print(result.summary())
df['LogOdds'] = np.log(df['PercentYes'] / (1 - df['PercentYes'])) # Logit function
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(x='Yes', y='LogOdds', data=df)
plt.title('Log Odds vs Yes')
plt.xlabel('Yes')
plt.subplot(1, 2, 2)
sns.scatterplot(x='No', y='LogOdds', data=df)
plt.title('Log Odds vs No')
plt.xlabel('No')
plt.tight_layout()
plt.show()

o Independence of observations
X_train = pd.DataFrame({
'feature1': np.random.rand(100),
'feature2': np.random.rand(100)
})
y_train = (X_train['feature1'] + X_train['feature2'] > 1).astype(int)
X_with_const = sm.add_constant(X_train)
logit_model = sm.Logit(y_train, X_with_const)
result = logit_model.fit()
residuals = y_train - result.predict(X_with_const)
plot_acf(residuals, lags=30)
plt.show()

o No multicollinearity
def calculate_vif(X):
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
return vif_data
vif_data = calculate_vif(X_with_const)
print("Variance Inflation Factors:")
print(vif_data)

o Sufficient sample size


num_predictors = X_with_const.shape[1] - 1 # Subtract 1 for the intercept
sample_size = X_with_const.shape[0]
print(f"Sample size: {sample_size}")
print(f"Number of predictors: {num_predictors}")
if sample_size >= 10 * num_predictors:
print("Sample size is adequate.")
else:
print("Sample size may be insufficient.")

Inference: - The logit regression model shows perfect prediction with


complete separation, indicating non-identifiable parameters and failed
convergence after 35 iterations.

 Step 8: Model Evaluation


Illustration:- To evaluate the model using metrics such as accuracy, confusion
matrix, precision, recall, F1 score, ROC Curve and AUC.

o Accuracy:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

o Confusion matrix:
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1']).plot()
plt.title('Confusion Matrix')
plt.show()
o Precision:
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.2f}")

o Recall (Sensitivity):
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.2f}")

o F1 Score:
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.2f}")

o ROC Curve and AUC:


fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = roc_auc_score(y_test, y_pred_proba)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc='lower right')
plt.show()
print(f"AUC: {roc_auc:.2f}")

Inference:- The model shows perfect accuracy, confusion matrix, precision,


recall, ROC & AUC curve ,and F1 score, indicating flawless classification.

 Step 9: Regularization
Illustration: - To regularize the data if it is overfitted.
o L2 Regularization: Shrinks coefficients of less important variables to zero,
performing feature selection.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, roc_curve,
roc_auc_score, ConfusionMatrixDisplay
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
X = pd.DataFrame({
'feature1': np.random.rand(100),
'feature2': np.random.rand(100),
'feature3': np.random.rand(100)
})
y = (X['feature1'] + X['feature2'] > 1).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model_l2 = LogisticRegression(penalty='l2', solver='liblinear', random_state=42)
model_l2.fit(X_train_scaled, y_train)
y_pred_l2 = model_l2.predict(X_test_scaled)
y_pred_proba_l2 = model_l2.predict_proba(X_test_scaled)[:, 1]
print("L2 Regularization (Ridge) Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_l2):.2f}")
print(f"Precision: {precision_score(y_test, y_pred_l2):.2f}")
print(f"Recall: {recall_score(y_test, y_pred_l2):.2f}")
print(f"F1 Score: {f1_score(y_test, y_pred_l2):.2f}")
roc_auc_l2 = roc_auc_score(y_test, y_pred_proba_l2)
print(f"AUC: {roc_auc_l2:.2f}")

o L1 Regularization: Penalizes large coefficients but doesn’t shrink them to


zero.

model_l1 = LogisticRegression(penalty='l1', solver='liblinear', random_state=42)


model_l1.fit(X_train_scaled, y_train)
y_pred_l1 = model_l1.predict(X_test_scaled)
y_pred_proba_l1 = model_l1.predict_proba(X_test_scaled)[:, 1]
print("L1 Regularization (Lasso) Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_l1):.2f}")
print(f"Precision: {precision_score(y_test, y_pred_l1):.2f}")
print(f"Recall: {recall_score(y_test, y_pred_l1):.2f}")
print(f"F1 Score: {f1_score(y_test, y_pred_l1):.2f}")
roc_auc_l1 = roc_auc_score(y_test, y_pred_proba_l1)
print(f"AUC: {roc_auc_l1:.2f}")
Inference: - The Lasso model, and Ridge model, achieved perfect performance,
suggesting possible overfitting.

 Step 10: Interpret Results


Illustration:- To interpret the results of the model.
import numpy as np
import pandas as pd
coefficients = model.coef_[0]
feature_names = X.columns
odds_ratios = np.exp(coefficients)
interpretation_df = pd.DataFrame({
'Feature': feature_names,
'Coefficient': coefficients,
'Odds Ratio': odds_ratios
})
print("Model Interpretation:")
print(interpretation_df)

Inference:- Feature1 and feature2 strongly increase odds, while feature3 has a
minor positive impact on the predicted outcome.

 Step 11: Make Predictions


Illustration:- To make predictions on unseen data.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
new_data_scaled = scaler.fit_transform(new_data)
predicted_probs = model.predict_proba(new_data_scaled)[:, 1]
threshold = 0.5
predicted_classes = (predicted_probs >= threshold).astype(int)
predictions_df = pd.DataFrame({
'Predicted_Probability': predicted_probs,
'Predicted_Class': predicted_classes
})
print("Predictions:")
print(predictions_df)

Inference: - The predictions show that the first instance has a very low predicted
probability (0.0062) and is classified as 0, while the second instance has a high
predicted probability (0.987) and is classified as 1.

You might also like