0% found this document useful (0 votes)

3 views20 pages

Da - Week 9

The document provides an overview of regression analysis, detailing various types of regression methods including linear and logistic regression, along with steps for data preparation, model fitting, and evaluation. It describes the process of handling a dataset from a Teen Education Poll, including data cleansing, exploratory data analysis, feature engineering, and model evaluation metrics. Additionally, it outlines the procedures for both linear and logistic regression, emphasizing the importance of assumptions and interpretation of results.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views20 pages

Da - Week 9

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

DA WEEK – 9

REGRESSION [LINEAR AND LOGISTIC]

Regression: -
Regression is a statistical method used to model relationships between variables,
predicting outcomes based on input data.

Types Of Regression: -
 Linear Regression
 Polynomial Regression
 Support Vector Regression
 Decision Tree Regression
 Random Forest Regression
 Ridge Regression
 Lasso Regression
 Logistic Regression

Steps To Be Carried Out for Regression: -

 Data collection
 Understanding data
 Choosing the regression type
 Handling missing values and outliers
 Feature engineering
 Check for multi-collinearity
 Split dataset [train, test]
 Assumption check
 Model evaluation
 Interpret results
 Make predictions
Dataset Description: -
The dataset is based on Teen Education Poll with Teens responding to “Are you getting a
good education? Questions” over 54,00 high school students responded to this poll in a
single day on the after-school app. Nearly 58% of students said yes and 42% students
said no.

Regression Type: -
The dataset ﬁts in for linear regression since it involves continuous percentages [percent
yes, percent no]. To perform Logistic regression I removed percent yes, percent no and
state column.

Linear Regression: -
Linear regression is a statistical technique that models the relationship between two
variables by ﬁtting a straight line through data points.
Procedure To Do Linear Regression On A Dataset: -
 Step 1: Data collection

Data is collected from all the possible resources and imported the
following libraries.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Loading the dataset on to the jupyter notebook with the following command.
dataset = pd.read_csv(r'C:\Users\aksha\OneDrive\Desktop\afterschool-teen-education
poll\pollingData.csv')

 Step 2: Deﬁne objectives

Illustration:- To deﬁne the Dependent and Independent variables.

X = df[['Total']]
y = df['PercentYes']

Inference:- The Independent variable is Total as it doesn’t directly depend on any values
and the dependent variable is Percent Yes since it depends on Yes column and Total column.

 Step 3: Data cleansing

Illustration: - To handle missing data, detect and treat outlier and remove
irrelevant features.
o Handle Missing Values
print("Missing values in each column:")
print(df.isnull().sum())

o Identify Outlier
def identify_outliers(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return df[(df[column] < lower_bound) | (df[column] > upper_bound)]
outliers_total = identify_outliers(df, 'Total')
outliers_percent_yes = identify_outliers(df, 'PercentYes')
print("Outliers in 'Total' votes:")
print(outliers_total)
print("Outliers in 'PercentYes':")
print(outliers_percent_yes)
o Treat outlier
df = df[~df.index.isin(outliers_total.index)]
df = df[~df.index.isin(outliers_percent_yes.index)]

o Remove irrelevant feature

df = df.drop(columns=['State'])
df.head()

Inference: - There is no missing values in the dataset. Outliers are detected and treated.
A state is removed as it is an irrelevant feature for further analysis.

 Step 4: Exploratory data analysis

Illustration: - To Visualize the data to understand relationships, trends, and
distributions.
o Scatter plots for total votes vs percent yes votes
plt.ﬁgure(ﬁgsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(df['Total'], df['PercentYes'], color='blue')
plt.title('Scatterplot: Total Votes vs. Percent Yes')
plt.xlabel('Total Votes')
plt.ylabel('Percent Yes')

o Histograms for percent yes distribution

plt.subplot(1, 2, 2)
plt.hist(df['PercentYes'], bins=10, color='green', edgecolor='black')
plt.title('Histogram: Distribution of Percent Yes')
plt.xlabel('Percent Yes')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
o Correlation Heatmap
plt.ﬁgure(ﬁgsize=(6, 5))
correlation_matrix = df[['Total', 'PercentYes']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

Inference: - Scatter plots are use to represent linear relationships between variables.
Histograms are used to understand the distribution of the data. And Correlation heatmap
is used to check relationships between predictors.

 Step 5: Feature Engineering

Illustration: - To calculate Encode categorical variables, Transform variables and
Interaction terms.
o Encode Categorical variables
df_encoded = pd.get_dummies(df, columns=['State'], drop_ﬁrst=True)
print(df_encoded.head())
o Log transformation
df['LogTotal'] = np.log1p(df['Total'])
print(df['LogTotal'])

o Interactive term
df['Total_PercentYes_Interaction'] = df['Total'] * df['PercentYes']
Inference: - The encode categorical variables is calculated using one-heat map and
transformation variables is calculated using Log transformation and interaction terms are
created between variables.

 Step 6: Check for multi-collinearity

Illustration: - To check for multi-collinearity of the dataset.
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
X = df[['Total', 'LogTotal', 'Total_PercentYes_Interaction']]
X = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("VIF values for features:")
print(vif_data)

Inference: - Use the Variance Inﬂation Factor (VIF) to ensure independent variables are
not highly correlated with each other. Since High multicollinearity can distort the model.

 Step 7: Split the data

Illustration: - Divide the data as training set and Testing set.
from sklearn.model_selection import train_test_split
X = df[['Total', 'LogTotal', 'Total_PercentYes_Interaction']]
y = df['PercentYes']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")
Inference: - Divided the data into training and testing sets (commonly 70-80% for
training, 20-30% for testing) to validate model performance on unseen data.

 Step 8: Fit the model

Illustration: - To ﬁt The dataset in a model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
model = LinearRegression()
model.ﬁt(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")

Inference: - Initialized and ﬁtted the model, predictions on the testing set is done, and
evaluated the model and the mean squared error and R squared error of the test dataset is
displayed.

 Step 9: Evaluate Assumptions

Illustration: - To evaluate assumptions of linear regression.
o Linearity
residuals = y_train - y_train_pred
X_train_sm = sm.add_constant(X_train)
model_sm = sm.OLS(y_train, X_train_sm).ﬁt()
y_train_pred_sm = model_sm.predict(X_train_sm)
residuals_sm = y_train - y_train_pred_sm

plt.ﬁgure(ﬁgsize=(14, 10))
plt.subplot(2, 2, 1)
plt.scatter(y_train_pred, residuals, color='blue')
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs Fitted')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')

o Independence
from statsmodels.graphics.tsaplots import plot_acf
plt.subplot(2, 2, 2)
plot_acf(residuals_sm, lags=20, ax=plt.gca(), alpha=0.05)
plt.title('Autocorrelation of Residuals')

o Homoscedasticity
plt.subplot(2, 2, 3)
plt.scatter(y_train_pred_sm, residuals_sm, color='blue')
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs Fitted (statsmodels)')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')

o Normality of Residuals
plt.subplot(2, 2, 4)
sm.qqplot(residuals_sm, line='s', ax=plt.gca())
plt.title('Q-Q Plot')
plt.tight_layout()
plt.show()

Histogram for residuals:

plt.ﬁgure(ﬁgsize=(6, 5))
plt.hist(residuals_sm, bins=20, color='blue', edgecolor='black')
plt.title('Histogram of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

Inference: - Assumptions like Linearity (Check if the relationship between the predictors
and the target is linear), Independence( To Ensures the residuals are independent of each
other), Homoscedasticity(Check for the residuals that have constant variance) and
Normality of Residuals( Ensures The residuals should follow a normal distribution).

 Step 10: Model Evaluation

Illustration: - to Evaluate the model using metrics.
from sklearn.metrics import mean_squared_error, r2_score
X_test_sm = sm.add_constant(X_test)
y_test_pred_sm = model_sm.predict(X_test_sm)
r2 = r2_score(y_test, y_test_pred_sm)
mse = mean_squared_error(y_test, y_test_pred_sm)
rmse = np.sqrt(mse)
n = X_test.shape[0]
p = X_test.shape[1]
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f"R-squared: {r2:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"Adjusted R-squared: {adj_r2:.4f}"}

Inference: The R-square error, Mean squared error, Root mean squared error and
adjusted R- squared error is calculated and displayed. Resulting in good proportion of the
variance (63.32%) in the dependent variable, and the predictions are quite close to the
actual values with low error metrics (MSE and RMSE).

 Step 11: Interpret Results

Illustration:- To display the result of the models coefficents, confidence values.
print(model_sm.summary())
coefficients = model_sm.params
p_values = model_sm.pvalues
conf_int = model_sm.conf_int()
print("\nCoefficients:")
print(coefficients)
print("\nP-values:")
print(p_values)
print("\nConfidence Intervals:")
print(conf_int)

Inference:- The model explains a substantial portion of variance (60.9%).Total and the
interaction term significantly affect PercentYes. LogTotal does not significantly contribute
to the model. High condition number suggests potential multicollinearity or numerical
issues.

 Step 12: Make Predictions

Illustration: - To make predictions on new data.
new_data = pd.DataFrame({
'Total': [500, 1500],
'LogTotal': np.log1p([500, 1500]),
'Total_PercentYes_Interaction': [500 * 0.6, 1500 * 0.5]
})
new_data_sm = sm.add_constant(new_data)
predictions = model_sm.predict(new_data_sm)
print("\nPredictions on new data:")
print(predictions)

Inference: - Predictions are made on new data(unseen/ given).

Logistic Regression: -
Logistic regression is a statistical method used for binary classiﬁcation, modelling the
probability of an outcome using a logistic function.

Note: - The Data collection, deﬁning objectives, data cleansing, Feature Engineering and
Exploratory data analysis are same as Linear Regression. The state and percent Yes and
Percent No columns Have been discarded as an irrelevant feature.

Procedure To Do Logistic Regression On A Dataset: -

 Step 1: Exploratory data analysis
Illustration: To Understand the distribution of features and their relationship to
the target variable.
o Histograms
plt.ﬁgure(ﬁgsize=(14, 10))
plt.subplot(2, 2, 1)
sns.histplot(df['Total'], kde=True, bins=20)
plt.title('Histogram of Total Responses')
plt.subplot(2, 2, 2)
sns.histplot(df['Yes'], kde=True, bins=20)
plt.title('Histogram of Yes Responses')
plt.subplot(2, 2, 3)
sns.histplot(df['No'], kde=True, bins=20)
plt.title('Histogram of No Responses')
plt.subplot(2, 2, 4)
sns.histplot(df['PercentYes'], kde=True, bins=20)
plt.title('Histogram of PercentYes')
plt.tight_layout()
plt.show()

o Scatter plots
plt.figure(figsize=(14, 10))
plt.subplot(2, 2, 1)
sns.scatterplot(x='Total', y='PercentYes', hue='Target', data=df, palette='coolwarm')
plt.title('Total Responses vs. PercentYes')
plt.subplot(2, 2, 2)
sns.scatterplot(x='Yes', y='PercentYes', hue='Target', data=df, palette='coolwarm')
plt.title('Yes Responses vs. PercentYes')
plt.subplot(2, 2, 3)
sns.scatterplot(x='No', y='PercentYes', hue='Target', data=df, palette='coolwarm')
plt.title('No Responses vs. PercentYes')
plt.tight_layout()
plt.show()
o Correlation matrix
corr_matrix = df[['Total', 'Yes', 'No', 'PercentYes']].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Numerical Features')
plt.show()

Inference: Histograms and scatter plots are plotted for continuous features to
see relationships with the target. And the Correlation matrix represents
numerical features.

 Step 2: Feature Scaling

Illustration: - Features are ranked by ANOVA F-value to select top predictors.

from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k='all')
selector.ﬁt(X_poly, y)
feature_scores = pd.DataFrame({
'Feature': X_poly_df.columns,
'Score': selector.scores_
}).sort_values(by='Score', ascending=False)
print("Feature importance scores:")
print(feature_scores)
top_features = feature_scores.head(10)['Feature']
X_selected = X_poly_df[top_features]
print("Data with selected top features:")
print(X_selected.head())

Inference: - Top features have the strongest relationships with the target variable.

 Step 3: Addressing Class Imbalance

Illustration: - To ﬁnd if one class is signiﬁcantly underrepresented.

o Over sampling
df['Target'] = (df['Yes'] > df['No']).astype(int)
minority_class = df[df['Target'] == 1]
majority_class = df[df['Target'] == 0]
num_majority = len(majority_class)
num_minority = len(minority_class)
if num_minority < num_majority:
num_to_oversample = num_majority - num_minority
minority_oversampled = minority_class.sample(num_to_oversample, replace=True,
random_state=42)
df_oversampled = pd.concat([majority_class, minority_oversampled])
print(f"Oversampled dataset shape: {df_oversampled.shape}")
else:
print("No oversampling needed. Minority class is not smaller than the majority class.")

o Under sampling
if num_majority > num_minority:
majority_class_undersampled = majority_class.sample(num_minority,
random_state=42)
df_undersampled = pd.concat([majority_class_undersampled, minority_class])
print(f"Undersampled dataset shape: {df_undersampled.shape}")
else:
print("No undersampling needed. Majority class is not larger than the minority class.")
o Class weights
df['Target'] = (df['Yes'] > df['No']).astype(int) # Binary target
X = df.drop(['Target', 'State'], axis=1)
y = df['Target']
print("Original target variable distribution:")
print(y.value_counts())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42, stratify=y)
print("Training set target distribution:")
print(y_train.value_counts())
class_weights = compute_class_weight(class_weight='balanced',
classes=np.unique(y_train), y=y_train)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))
print("Class weights:")
print(class_weights_dict)
model = LogisticRegression(class_weight=class_weights_dict,
random_state=42)
model.ﬁt(X_train, y_train)
print("Model class weights used:")
print(model.class_weight_)

Inference: - No class imbalance, so no sampling or class weighting

required.

 Step 4: Check for multi-collinearity

Illustration:- Using Variance Inﬂation Factor (VIF) to check for highly correlated
predictors.

df = pd.DataFrame({
'Feature1': [1, 2, 3, 4, 5],
'Feature2': [2, 4, 6, 8, 10],
'Feature3': [5, 4, 3, 2, 1]
})
X = add_constant(df)
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inﬂation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
Inference:- Severe multicollinearity exists; features are highly correlated, causing
instability.

 Step 5: Split The dataset

Illustration: - For Random features generate target; data split into training and
testing.

X = pd.DataFrame({
'feature1': np.random.rand(100),
'feature2': np.random.rand(100),
'feature3': np.random.rand(100)
})
y = (X['feature1'] + X['feature2'] > 1).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Inference: - Binary classiﬁcation predicts y; 70/30 split ensures model

generalization.

 Step 6: Fit the model

Illustration:- To the ﬁt the logistic regression model.

model = LogisticRegression(penalty='l2', solver='liblinear', random_state=42)

model.ﬁt(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

Inference: - This code ﬁts a logistic regression model with L2 regularization,

predicts class labels, and calculates probabilities for class 1.

 Step 7: Evaluate assumptions

Illustration: - To evaluate assumptions such as Linearity of log odds,
Independence of observations, No multicollinearity, Suﬃcient sample size.

o Linearity of Log odds

df['PercentYes'] = df['Yes'] / df['Total']
df['Target'] = (df['PercentYes'] > 0.5).astype(int)
X = df[['Yes', 'No']]
y = df['Target']
X = sm.add_constant(X)
logit_model = sm.Logit(y, X)
result = logit_model.fit()
print(result.summary())
df['LogOdds'] = np.log(df['PercentYes'] / (1 - df['PercentYes'])) # Logit function
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(x='Yes', y='LogOdds', data=df)
plt.title('Log Odds vs Yes')
plt.xlabel('Yes')
plt.subplot(1, 2, 2)
sns.scatterplot(x='No', y='LogOdds', data=df)
plt.title('Log Odds vs No')
plt.xlabel('No')
plt.tight_layout()
plt.show()

o Independence of observations
X_train = pd.DataFrame({
'feature1': np.random.rand(100),
'feature2': np.random.rand(100)
})
y_train = (X_train['feature1'] + X_train['feature2'] > 1).astype(int)
X_with_const = sm.add_constant(X_train)
logit_model = sm.Logit(y_train, X_with_const)
result = logit_model.ﬁt()
residuals = y_train - result.predict(X_with_const)
plot_acf(residuals, lags=30)
plt.show()

o No multicollinearity
def calculate_vif(X):
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inﬂation_factor(X.values, i) for i in range(X.shape[1])]
return vif_data
vif_data = calculate_vif(X_with_const)
print("Variance Inﬂation Factors:")
print(vif_data)

o Suﬃcient sample size

num_predictors = X_with_const.shape[1] - 1 # Subtract 1 for the intercept
sample_size = X_with_const.shape[0]
print(f"Sample size: {sample_size}")
print(f"Number of predictors: {num_predictors}")
if sample_size >= 10 * num_predictors:
print("Sample size is adequate.")
else:
print("Sample size may be insuﬃcient.")

Inference: - The logit regression model shows perfect prediction with

complete separation, indicating non-identiﬁable parameters and failed
convergence after 35 iterations.

 Step 8: Model Evaluation

Illustration:- To evaluate the model using metrics such as accuracy, confusion
matrix, precision, recall, F1 score, ROC Curve and AUC.

o Accuracy:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

o Confusion matrix:
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1']).plot()
plt.title('Confusion Matrix')
plt.show()
o Precision:
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.2f}")

o Recall (Sensitivity):
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.2f}")

o F1 Score:
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.2f}")

o ROC Curve and AUC:

fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = roc_auc_score(y_test, y_pred_proba)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc='lower right')
plt.show()
print(f"AUC: {roc_auc:.2f}")

Inference:- The model shows perfect accuracy, confusion matrix, precision,

recall, ROC & AUC curve ,and F1 score, indicating flawless classification.

 Step 9: Regularization
Illustration: - To regularize the data if it is overfitted.
o L2 Regularization: Shrinks coefficients of less important variables to zero,
performing feature selection.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, roc_curve,
roc_auc_score, ConfusionMatrixDisplay
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
X = pd.DataFrame({
'feature1': np.random.rand(100),
'feature2': np.random.rand(100),
'feature3': np.random.rand(100)
})
y = (X['feature1'] + X['feature2'] > 1).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model_l2 = LogisticRegression(penalty='l2', solver='liblinear', random_state=42)
model_l2.fit(X_train_scaled, y_train)
y_pred_l2 = model_l2.predict(X_test_scaled)
y_pred_proba_l2 = model_l2.predict_proba(X_test_scaled)[:, 1]
print("L2 Regularization (Ridge) Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_l2):.2f}")
print(f"Precision: {precision_score(y_test, y_pred_l2):.2f}")
print(f"Recall: {recall_score(y_test, y_pred_l2):.2f}")
print(f"F1 Score: {f1_score(y_test, y_pred_l2):.2f}")
roc_auc_l2 = roc_auc_score(y_test, y_pred_proba_l2)
print(f"AUC: {roc_auc_l2:.2f}")

o L1 Regularization: Penalizes large coeﬃcients but doesn’t shrink them to

zero.

model_l1 = LogisticRegression(penalty='l1', solver='liblinear', random_state=42)

model_l1.ﬁt(X_train_scaled, y_train)
y_pred_l1 = model_l1.predict(X_test_scaled)
y_pred_proba_l1 = model_l1.predict_proba(X_test_scaled)[:, 1]
print("L1 Regularization (Lasso) Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_l1):.2f}")
print(f"Precision: {precision_score(y_test, y_pred_l1):.2f}")
print(f"Recall: {recall_score(y_test, y_pred_l1):.2f}")
print(f"F1 Score: {f1_score(y_test, y_pred_l1):.2f}")
roc_auc_l1 = roc_auc_score(y_test, y_pred_proba_l1)
print(f"AUC: {roc_auc_l1:.2f}")
Inference: - The Lasso model, and Ridge model, achieved perfect performance,
suggesting possible overﬁtting.

 Step 10: Interpret Results

Illustration:- To interpret the results of the model.
import numpy as np
import pandas as pd
coefficients = model.coef_[0]
feature_names = X.columns
odds_ratios = np.exp(coefficients)
interpretation_df = pd.DataFrame({
'Feature': feature_names,
'Coefficient': coefficients,
'Odds Ratio': odds_ratios
})
print("Model Interpretation:")
print(interpretation_df)

Inference:- Feature1 and feature2 strongly increase odds, while feature3 has a
minor positive impact on the predicted outcome.

 Step 11: Make Predictions

Illustration:- To make predictions on unseen data.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
new_data_scaled = scaler.ﬁt_transform(new_data)
predicted_probs = model.predict_proba(new_data_scaled)[:, 1]
threshold = 0.5
predicted_classes = (predicted_probs >= threshold).astype(int)
predictions_df = pd.DataFrame({
'Predicted_Probability': predicted_probs,
'Predicted_Class': predicted_classes
})
print("Predictions:")
print(predictions_df)

Inference: - The predictions show that the first instance has a very low predicted
probability (0.0062) and is classified as 0, while the second instance has a high
predicted probability (0.987) and is classified as 1.

Regression Analysis - Cheatsheet
No ratings yet
Regression Analysis - Cheatsheet
9 pages
Fingerprint Matching Project Report
No ratings yet
Fingerprint Matching Project Report
30 pages
Assignment 1:: Intro To Machine Learning
No ratings yet
Assignment 1:: Intro To Machine Learning
6 pages
Project Idea
No ratings yet
Project Idea
8 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
DA Programs
No ratings yet
DA Programs
44 pages
Unit 5
No ratings yet
Unit 5
18 pages
Linear Regression - Numpy and Sklearn
No ratings yet
Linear Regression - Numpy and Sklearn
7 pages
223a1131 ML Exp 1
No ratings yet
223a1131 ML Exp 1
8 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
ML Combined
No ratings yet
ML Combined
254 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Zerox Ready
No ratings yet
Zerox Ready
21 pages
Da Lab Mannual
No ratings yet
Da Lab Mannual
25 pages
Lab 6 - Linear Regression and Multiple Linear Regression
No ratings yet
Lab 6 - Linear Regression and Multiple Linear Regression
12 pages
Practical 5
No ratings yet
Practical 5
8 pages
19BCS2059 DL1
No ratings yet
19BCS2059 DL1
4 pages
Lecture-2 Unit 2
No ratings yet
Lecture-2 Unit 2
56 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
66 pages
ML PR-2
No ratings yet
ML PR-2
11 pages
Exp 4 - LM
No ratings yet
Exp 4 - LM
5 pages
Assignment 7
No ratings yet
Assignment 7
4 pages
Practical # 10
No ratings yet
Practical # 10
5 pages
Data Science Experiment Guide
100% (2)
Data Science Experiment Guide
43 pages
Machine Learning 2
No ratings yet
Machine Learning 2
45 pages
Assignment AI-ML
No ratings yet
Assignment AI-ML
13 pages
Data Analysis for Beginners
No ratings yet
Data Analysis for Beginners
8 pages
Abinash Nag Project Report CART
No ratings yet
Abinash Nag Project Report CART
40 pages
Python Data Analysis Guide
No ratings yet
Python Data Analysis Guide
171 pages
ML LN 3
No ratings yet
ML LN 3
44 pages
ML Lab Record - 250625 - 105014
No ratings yet
ML Lab Record - 250625 - 105014
29 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Linear Regression
No ratings yet
Linear Regression
18 pages
Parth ML
No ratings yet
Parth ML
24 pages
Simple Linear Regression in Machine Learning
No ratings yet
Simple Linear Regression in Machine Learning
7 pages
Da Rec
No ratings yet
Da Rec
29 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
67 pages
Exp1 (Linear - Regression) (1) 2
No ratings yet
Exp1 (Linear - Regression) (1) 2
7 pages
Logistic Regression Lab Manual
No ratings yet
Logistic Regression Lab Manual
7 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Pad Assignment No - 01
No ratings yet
Pad Assignment No - 01
6 pages
DS Unit 4
No ratings yet
DS Unit 4
21 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
30 pages
Simple Linear Regression - Assign4
No ratings yet
Simple Linear Regression - Assign4
8 pages
Regression Analysis and Plot Interpretations in R
No ratings yet
Regression Analysis and Plot Interpretations in R
13 pages
Linear Regression with Boston Housing Data
No ratings yet
Linear Regression with Boston Housing Data
14 pages
ML Assignment 2
No ratings yet
ML Assignment 2
3 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
Machine Learning Basics 1683717543
No ratings yet
Machine Learning Basics 1683717543
15 pages
Python Simple Linear Regression Guide
No ratings yet
Python Simple Linear Regression Guide
8 pages
Linear Regression Explained
No ratings yet
Linear Regression Explained
8 pages
Linear Regression Lab Guide
100% (1)
Linear Regression Lab Guide
8 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
22 pages
Regression Analysis
No ratings yet
Regression Analysis
16 pages
Unit II - Diagnotis and Multiple Linear
No ratings yet
Unit II - Diagnotis and Multiple Linear
8 pages
Unit 2 Supervised Learning
No ratings yet
Unit 2 Supervised Learning
20 pages
DA Week3
No ratings yet
DA Week3
5 pages
Da Pra Week 15 (Apriori Algo) - 114413
No ratings yet
Da Pra Week 15 (Apriori Algo) - 114413
11 pages
Unique Admission Hospital Management
No ratings yet
Unique Admission Hospital Management
4 pages
Da Week6 Sathish
No ratings yet
Da Week6 Sathish
7 pages
NISM Chapter+4
No ratings yet
NISM Chapter+4
27 pages
Citrus Canker Presentation
100% (1)
Citrus Canker Presentation
9 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
5 pages
Child Mental Health - FORMAT
No ratings yet
Child Mental Health - FORMAT
7 pages
Major Project Final Banana Fiber Report
No ratings yet
Major Project Final Banana Fiber Report
55 pages
Cancer Prediction in Early Stages
No ratings yet
Cancer Prediction in Early Stages
3 pages
Abhay Raj 2019ugcs005r NLP Report
No ratings yet
Abhay Raj 2019ugcs005r NLP Report
21 pages
Borescope Inspection Best Practices
No ratings yet
Borescope Inspection Best Practices
57 pages
Non-IID Outlier Detection Review
No ratings yet
Non-IID Outlier Detection Review
20 pages
Nawaz Et Al. - 2023
No ratings yet
Nawaz Et Al. - 2023
16 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
68 pages
Bovine Anaplasmosis in Zimbabwe Spatio-Temporal Di
No ratings yet
Bovine Anaplasmosis in Zimbabwe Spatio-Temporal Di
16 pages
Employee Attrition Prediction
100% (1)
Employee Attrition Prediction
25 pages
Pid - 235
No ratings yet
Pid - 235
14 pages
ML Record
No ratings yet
ML Record
28 pages
Intrapartum Ultrasound for Delivery Prediction
No ratings yet
Intrapartum Ultrasound for Delivery Prediction
6 pages
Level of Service Case Management Inventory
No ratings yet
Level of Service Case Management Inventory
7 pages
Introduction To AI Storytelling
100% (4)
Introduction To AI Storytelling
11 pages
Resuscitation
No ratings yet
Resuscitation
7 pages
193-Article Text-479-1-10-20240215
No ratings yet
193-Article Text-479-1-10-20240215
20 pages
Normal Distribution
No ratings yet
Normal Distribution
13 pages
A Video System Based
No ratings yet
A Video System Based
13 pages
Machine Learning Concise Notes
No ratings yet
Machine Learning Concise Notes
7 pages
ICT513 Data Analytics - A3
No ratings yet
ICT513 Data Analytics - A3
25 pages
Tutorial Maxent
No ratings yet
Tutorial Maxent
38 pages
Decision Trees: Info Gain & Gini Index
No ratings yet
Decision Trees: Info Gain & Gini Index
47 pages
B-56 Sanket Jambhulkar MLA-3
No ratings yet
B-56 Sanket Jambhulkar MLA-3
7 pages
Assessing Autism in Adults An Evaluation of The Developmental, Dimensional and Diagnostic Interview-Adult Version (3Di-Adult)
100% (4)
Assessing Autism in Adults An Evaluation of The Developmental, Dimensional and Diagnostic Interview-Adult Version (3Di-Adult)
12 pages
Correlacion Entre La Relacion Lactato Albumina y SOFA en Pacientes
No ratings yet
Correlacion Entre La Relacion Lactato Albumina y SOFA en Pacientes
11 pages
An Automatic Siren Detection Algorithm Using Fourier Decomposition Method and MFCC
No ratings yet
An Automatic Siren Detection Algorithm Using Fourier Decomposition Method and MFCC
6 pages
Machine Learning for Pattern Recognition
No ratings yet
Machine Learning for Pattern Recognition
15 pages
DWM - Classification-Unit7
No ratings yet
DWM - Classification-Unit7
44 pages
Retinal Vessel Segmentation via SVM
No ratings yet
Retinal Vessel Segmentation via SVM
19 pages

Da - Week 9

Uploaded by

Da - Week 9

Uploaded by

DA WEEK – 9

REGRESSION [LINEAR AND LOGISTIC]

Steps To Be Carried Out for Regression: -

 Step 2: Deﬁne objectives

 Step 3: Data cleansing

o Remove irrelevant feature

 Step 4: Exploratory data analysis

o Histograms for percent yes distribution

 Step 5: Feature Engineering

 Step 6: Check for multi-collinearity

 Step 7: Split the data

 Step 8: Fit the model

 Step 9: Evaluate Assumptions

Histogram for residuals:

 Step 10: Model Evaluation

 Step 11: Interpret Results

 Step 12: Make Predictions

Inference: - Predictions are made on new data(unseen/ given).

Procedure To Do Logistic Regression On A Dataset: -

 Step 2: Feature Scaling

from sklearn.feature_selection import SelectKBest, f_classif

 Step 3: Addressing Class Imbalance

Inference: - No class imbalance, so no sampling or class weighting

 Step 4: Check for multi-collinearity

 Step 5: Split The dataset

Inference: - Binary classiﬁcation predicts y; 70/30 split ensures model

 Step 6: Fit the model

model = LogisticRegression(penalty='l2', solver='liblinear', random_state=42)

Inference: - This code ﬁts a logistic regression model with L2 regularization,

 Step 7: Evaluate assumptions

o Linearity of Log odds

o Suﬃcient sample size

Inference: - The logit regression model shows perfect prediction with

 Step 8: Model Evaluation

o ROC Curve and AUC:

Inference:- The model shows perfect accuracy, confusion matrix, precision,

o L1 Regularization: Penalizes large coeﬃcients but doesn’t shrink them to

model_l1 = LogisticRegression(penalty='l1', solver='liblinear', random_state=42)

 Step 10: Interpret Results

 Step 11: Make Predictions

You might also like