Section A – Question 1: Vehicle Price Analysis
Introduction:
Understanding the factors that influence vehicle sale prices is crucial for both buyers and
sellers in the automotive market. By applying statistical modeling techniques, we can
quantify the impact of various features on sale price, identify key drivers, and make
informed predictions. In this section, we use multiple linear regression to analyze a dataset
of vehicle sales, focusing on how sale date, model age, proximity to urban centres, and the
number of dealerships nearby affect the final sale price. The analysis also includes
diagnostic checks for model validity and an exploration of potential nonlinear effects.
1.1 Linear Regression Model and Interpretation
(a) Model Building:
A multiple linear regression model was constructed to predict vehicle sale price using the
following predictors: sale date, model age, proximity to urban centres, and number of
dealerships nearby. The regression equation is:
Vehicle Sale Price=−17 , 858.70+8.90 ×Sale Date − 0.41× Model Age− 0.0087 × Proximity+1.98 × Dealerships
This equation allows us to estimate the expected sale price for any vehicle in the dataset,
given its characteristics. Each coefficient represents the average change in sale price
associated with a one-unit increase in the corresponding predictor, holding all other
variables constant.
(b) Interpretation:
• Model Fit: The model’s R2 is 0.558, meaning about 56% of the variance in vehicle
sale price is explained by the predictors. The F-statistic (137.06, p<0.001 ) indicates
the model is highly significant, suggesting that the predictors collectively provide
substantial explanatory power. This level of R2 is typical for real-world economic
data, where many unmeasured factors can influence price.
• Predictor Effects:
– Sale Date: Each unit increase in sale date (i.e., newer sales) increases price
by 8.90 units, holding other factors constant ( p<0.001 ). This likely reflects
inflation or market trends over time, where newer sales tend to fetch higher
prices.
– Model Age: Each additional year of model age reduces price by 0.41 units (
p<0.001), reflecting the well-known effect of depreciation. Older vehicles are
generally less valuable due to wear and technological obsolescence.
– Proximity to Urban Centres: Each unit increase in proximity (further from
urban centres) reduces price by 0.0087 units ( p<0.001), suggesting vehicles
located closer to cities are more desirable, possibly due to better access to
services and higher demand.
– Number of Dealerships Nearby: Each additional dealership nearby
increases price by 1.98 units ( p<0.001 ), possibly due to increased
competition, better service options, or greater buyer confidence in areas with
more dealerships.
• Intercept: The intercept is not directly interpretable in this context, as it represents
the expected price when all predictors are zero, which is not realistic for actual
vehicles.
Regression Statistics:
Statistic Value
Multiple R 0.7471
R Square 0.5581
Adjusted R Square 0.5541
Standard Error 14.0989
Observations 439
These statistics confirm that the model fits the data reasonably well, with a moderate
standard error and a large sample size, which increases the reliability of the estimates.
ANOVA Table:
Source df SS MS F Significan
ce F
Regressio 4 108,978.0 27,244.50 137.0584 1.29 ×10
− 75
n 1
Residual 434 86,270.63 198.78
Total 438 195,248.6
4
The ANOVA table shows that the regression model explains a significant portion of the total
variance in sale price, with a very small significance value indicating strong evidence
against the null hypothesis of no relationship.
Coefficients Table:
Variabl Coeff. Std. Err. t Stat P-value Lower Upper
e 95% 95%
Intercep - 4,770.62 -3.74 0.00021 - -
t 17,858.7 27,235.1 8,482.30
0 0
sale date 8.90 2.37 3.76 0.00020 4.25 13.56
Model -0.41 0.06 -6.98 1.12 ×10− 11 -0.53 -0.30
age
proximit -0.0087 0.0007 -12.59 3.50 ×10− 31 -0.0101 -0.0073
y
number 1.98 0.29 6.86 2.34 × 10−11 1.42 2.55
of
dealersh
ips
Scatter plots show clear relationships between predictors and price, especially for model age
(negative) and proximity (negative). These visualizations help confirm the linear
relationships assumed by the model and provide intuitive support for the statistical findings.
1.2 Heteroskedasticity and Multicollinearity
(a) Concepts:
• Heteroskedasticity refers to non-constant variance of residuals across levels of the
predicted value, which can lead to inefficient estimates and invalid significance
tests. In regression analysis, we assume that the spread of residuals is roughly the
same for all fitted values; violations of this assumption can undermine the reliability
of confidence intervals and hypothesis tests.
• Multicollinearity occurs when predictors are highly correlated with each other,
inflating standard errors and making it difficult to assess individual predictor
effects. Severe multicollinearity can make coefficient estimates unstable and
sensitive to small changes in the data.
(b) Evidence from Results:
• Heteroskedasticity: The residuals vs. fitted values plot does not display a clear
pattern or funnel shape, suggesting residual variance is roughly constant. This
indicates no strong evidence of heteroskedasticity, and the model’s standard errors
and significance tests are likely valid. If heteroskedasticity were present, we might
see a fan or cone shape, with residuals spreading out as fitted values increase.
• Multicollinearity: The Variance Inflation Factors (VIFs) for all predictors are low:
Predictor VIF
sale date 1.00
Model age 1.01
proximity 1.59
number of dealerships 1.59
VIFs below 5 (or even 2) indicate negligible multicollinearity. Thus, the model does
not suffer from this issue, and the estimated effects of each predictor can be
interpreted with confidence.
(c) Remedies (if needed):
• Heteroskedasticity: Use robust standard errors, transform the dependent variable
(e.g., log transformation), or apply weighted least squares. These approaches help
correct for non-constant variance and yield more reliable inference.
• Multicollinearity: Remove or combine correlated predictors, use principal
component analysis, or apply regularization techniques (e.g., ridge regression). In
this case, such remedies are unnecessary due to the low VIFs.
1.3 Nonlinear Model Evaluation
A quadratic (nonlinear) model was fitted for model age to check for improvement over the
linear model. The quadratic regression coefficients were small, and visual inspection of the
scatter plot and quadratic fit did not show a substantial improvement in fit or pattern over
the linear model. The R2 of the linear model is already moderate (0.56), and the residuals
do not show a clear nonlinear pattern. This suggests that the relationship between model
age and price is adequately captured by a linear term, and adding complexity does not yield
meaningful gains.
Conclusion: The linear model is appropriate for this data. There is no strong evidence that
a nonlinear model would provide a significantly better fit. The analysis demonstrates the
value of checking for nonlinearity, but also the importance of parsimony in model selection.
Summary
The regression analysis reveals that newer vehicles, those closer to urban centres, and
those with more dealerships nearby tend to have higher sale prices, while older vehicles
are less valuable. The model is statistically sound, with no evidence of heteroskedasticity or
multicollinearity, and a linear approach is justified for this dataset. These findings can
inform pricing strategies and highlight the key factors that buyers and sellers should
consider in the vehicle market.
Section A – Question 2: Iris Classification with K-NN
Introduction:
Classification is a fundamental task in data science, with applications ranging from medical
diagnosis to species identification. The K-Nearest Neighbour (K-NN) algorithm is a simple
yet powerful non-parametric method for classifying observations based on their similarity
to known examples. In this section, we apply K-NN to the classic iris dataset, focusing on
distinguishing between the versicolor and virginica species using four flower
measurements. We also explore the impact of different values of K on classification
performance and discuss best practices for model selection.
2.1 K-NN Classification of a New Observation
A K-Nearest Neighbour (K-NN) algorithm using the Euclidean distance was applied to
classify a new iris observation with features ( 6.6 , 3.2 ,5.1 , 1.5 ) (sepal length, sepal width,
petal length, petal width). The algorithm computes the distance from the test point to all
samples in the dataset and selects the K closest neighbors. This approach is intuitive: it
assumes that similar flowers (in terms of measurements) are likely to belong to the same
species.
Nearest Neighbors and Predicted Class: For K=5 ,7 , 9 , the nearest neighbors and their
classes are summarized below:
K Versicolor Virginica Count Predicted Class
Count
5 2 3 Iris-virginica
7 2 5 Iris-virginica
9 2 7 Iris-virginica
The majority of the nearest neighbors for each K are of the class Iris-virginica, so the model
predicts this class for the new observation. The use of Euclidean distance is standard for
continuous features and ensures that the closest points in feature space are selected. The
results are robust across different odd values of K , indicating that the prediction is not
sensitive to the exact choice of K within this range. This stability is desirable in practical
applications, as it suggests the model is not overfitting to noise.
2.2 The Problem with Even K and Solutions
When K is set to an even number (e.g., K=4 ), the algorithm may encounter a tie between
classes. For the test observation, the 4 nearest neighbors are split evenly:
K Versicolor Virginica Count Tie?
Count
4 2 2 Yes
This tie makes the prediction ambiguous. The standard solution is to use an odd value for
K , which reduces the likelihood of ties. Alternatively, a tie-breaking rule (such as always
choosing the class with the closest neighbor, or random assignment) can be implemented,
but using odd K is preferred for interpretability and consistency. In real-world
applications, ties can lead to inconsistent or arbitrary predictions, so careful selection of K
is important for reliable classification.
2.3 Model Performance: Confusion Matrices and Accuracy
The K-NN model was evaluated using leave-one-out cross-validation for K=5 ,7 , 9. This
method involves using each observation in turn as a test case, with the remaining data
serving as the training set. The confusion matrices and accuracies are as follows:
Actual ∖
K Predicted Iris-versicolor Iris-virginica
5 Iris-versicolor 48 9
Iris-virginica 9 40
Accuracy 0.83
7 Iris-versicolor 51 6
Iris-virginica 8 41
Accuracy 0.87
9 Iris-versicolor 52 5
Iris-virginica 6 43
Accuracy 0.90
The accuracy improves as K increases from 5 to 9. For K=9, the model achieves the
highest accuracy (0.90), with the lowest number of misclassifications for both classes. The
confusion matrices show that most errors are between the two classes, with very few false
positives or negatives. The full prediction results for each instance (actual, predicted,
correct/incorrect) are available in the output, allowing for detailed error analysis and
identification of borderline cases.
Best Model Choice: Based on the confusion matrices and accuracy, K=9 provides the best
balance of sensitivity and specificity for both classes in this dataset. However, it is
important to note that the optimal K may vary with different datasets, and cross-validation
is essential for robust model selection.
Summary
The K-NN algorithm, using Euclidean distance, effectively classifies iris varieties in this
dataset. For the new observation, the model robustly predicts Iris-virginica for all odd K
values tested. Using an odd K avoids ties, and K=9 yields the highest accuracy in cross-
validation. These results support the use of K-NN with odd K and highlight the importance
of model validation in classification tasks. In summary, K-NN is a flexible and interpretable
method for classification, but careful attention must be paid to parameter selection and
validation to ensure reliable results.