Basic-Level Questions & Answers
1. What is linear regression?
Linear regression is a statistical technique used to model the relationship between a dependent
(target) variable and one or more independent (predictor) variables by fitting a linear equation to the
data.
2. What are the assumptions of linear regression?
1. Linearity – Relationship between X and Y is linear.
2. Independence – Observations are independent.
3. Homoscedasticity – Constant variance of errors.
4. Normality of residuals – Residuals are normally distributed.
5. No multicollinearity – Independent variables are not highly correlated.
3. Differentiate between simple and multiple linear regression.
Simple Linear Regression: One independent variable.
Multiple Linear Regression: Two or more independent variables.
4. What is the equation of a linear regression line?
Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n + \epsilon
= dependent variable
= intercept
= coefficient for feature
= error term
5. What do the coefficients in linear regression represent?
Each coefficient represents the expected change in the dependent variable for a one-unit change in
the corresponding independent variable, assuming all other variables remain constant.
6. What is the difference between dependent and independent variables?
Dependent variable (Y): The output or target we are trying to predict.
Independent variables (X): The inputs or predictors used to predict Y.
7. How do you interpret the R-squared value?
R² measures the proportion of variance in the dependent variable that is predictable from the
independent variables.
R² = 1: perfect prediction
R² = 0: no prediction power
8. What is a residual?
A residual is the difference between the observed value and the predicted value:
\text{Residual} = Y_{\text{actual}} - Y_{\text{predicted}}
9. What does it mean if residuals are not randomly scattered?
It means the model may be misspecified or that assumptions (like linearity or homoscedasticity) are
violated, indicating poor model fit.
10. Can linear regression be used for classification problems?
No. Linear regression is used for regression tasks. For classification, logistic regression or other
classification algorithms are preferred.
Intermediate-Level Questions & Answers
1. How do you evaluate the performance of a linear regression model?
Common metrics:
R² / Adjusted R²
MAE (Mean Absolute Error)
MSE (Mean Squared Error)
RMSE (Root Mean Squared Error)
2. What is multicollinearity, and how do you detect it?
Multicollinearity occurs when independent variables are highly correlated. It can inflate variances of
coefficients.
Detection:
Correlation matrix
VIF (Variance Inflation Factor): VIF > 5 or 10 indicates multicollinearity.
3. What is the role of p-values in linear regression?
P-values help test the statistical significance of each coefficient.
If p-value < 0.05, the variable is considered significant.
4. What is the difference between R² and Adjusted R²?
R² increases with added variables, even if they’re irrelevant.
Adjusted R² adjusts for the number of predictors; it only increases if the new variable improves the
model meaningfully.
5. How do you handle outliers in linear regression?
Identify using residual plots or boxplots.
Treat by transformation, removing, or using robust models like RANSAC.
6. How do you select variables for a linear regression model?
Methods:
Forward selection
Backward elimination
Stepwise selection
Regularization (Lasso)
7. What is the effect of adding irrelevant variables to the model?
It can lead to:
Overfitting
Lower interpretability
Decreased Adjusted R²
8. How do you handle categorical variables in linear regression?
Convert them using one-hot encoding or label encoding, so they can be treated numerically.
9. What is heteroscedasticity, and why is it a problem?
It refers to non-constant variance of residuals. It violates the assumptions of linear regression and
can lead to inefficient estimates and misleading inference.
10. How do you know if your model is overfitting or underfitting?
Overfitting: High accuracy on training, poor on testing.
Underfitting: Poor accuracy on both.
Advanced-Level Questions & Answers
1. What is the impact of correlated features on linear regression?
Correlated features lead to multicollinearity, which makes coefficient estimates unstable and
interpretation difficult.
2. How do you regularize a linear regression model?
By adding a penalty term to the loss function:
Lasso (L1): Shrinks some coefficients to 0 (feature selection)
Ridge (L2): Shrinks coefficients but doesn’t eliminate
3. Explain the difference between Ridge, Lasso, and ElasticNet regression.
Ridge: L2 penalty, shrinks coefficients
Lasso: L1 penalty, shrinks and eliminates
ElasticNet: Combination of both
4. What are the consequences if linear regression assumptions are violated?
Linearity: Model misspecification
Independence: Biased coefficients
Homoscedasticity: Inefficient estimates
Normality: Inaccurate confidence intervals
Multicollinearity: Unstable coefficients
5. Can linear regression work with non-linear relationships? How?
Yes, by:
Transforming variables (log, square, etc.)
Adding polynomial features
Using non-linear regression models
6. What is the difference between linear regression and logistic regression?
Linear regression: Predicts continuous output
Logistic regression: Predicts probability of categorical outcome (binary/multiclass)
7. Why might you prefer a robust regression method instead of OLS?
Robust methods like RANSAC or Huber Regression handle outliers and non-normal errors better than
OLS.
8. What is the role of the F-statistic in linear regression?
The F-test checks whether at least one predictor is statistically significant. It compares the full model
with a null model.
9. Explain the use of interaction terms in linear regression.
Interaction terms (like X1 * X2) model how the effect of one variable depends on another. They help
capture non-additive effects.
10. How would you interpret a negative coefficient in your model?
A negative coefficient means the dependent variable decreases as the independent variable
increases, assuming all other variables are constant.