Advanced Machine Learning Tutorial
Advanced Machine Learning Tutorial
This Jupyter Notebook provides an in-depth tutorial on Machine Learning, suitable for an 8-hour advanced
class. We will cover key concepts, algorithms (with code examples using scikit-learn), and practical exercises.
The topics include:
Let's begin!
"A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E." 1
In simpler terms, ML algorithms identify patterns in data and use those patterns to make predictions or
decisions on new, unseen data without being hard-coded with specific rules. Instead of writing a rigid
algorithm for every scenario, we provide a model with data and let it figure out the underlying structure.
1
Types of Machine Learning: ML tasks generally fall into a few broad categories: Supervised learning,
Unsupervised learning, and Reinforcement learning (among others like semi-supervised and self-
supervised, which are hybrids).
• Supervised Learning: The algorithm learns from labeled data – each training example comes with
an output or target. The model makes predictions and is corrected when those predictions are
wrong. Over time it learns the mapping from inputs to the correct output 2 . Both classification
(predicting discrete labels) and regression (predicting continuous values) are forms of supervised
learning. (Example: predicting if an email is spam or not spam, based on a labeled training set of
emails 3 .)
• Unsupervised Learning: The algorithm works with unlabeled data – it must find structure in the
data on its own. Common tasks include clustering (grouping similar data points) and
dimensionality reduction (reducing data complexity while retaining structure). The model discovers
patterns such as groupings or anomalies without explicit feedback 4 . (Example: grouping
customers into segments based on purchasing behavior without predefined categories 5 .)
• Reinforcement Learning (RL): An agent learns by interacting with an environment, taking actions
and receiving rewards or penalties as feedback 6 . The goal is to learn a policy (strategy) that
maximizes cumulative reward. Unlike supervised learning, there is no direct “right/wrong” label for
each action – the agent discovers which actions yield the most reward through trial and error.
(Example: training a game-playing AI – the agent (AI) receives positive rewards for winning or
making good moves and negative rewards for losing or making bad moves.)
Real-World Applications of ML: Machine learning is ubiquitous in modern technology. Some notable
applications include:
• Image and Speech Recognition: e.g., facial recognition in photos, voice assistants transcribing
speech to text 7 .
• Natural Language Processing: e.g., language translation services, chatbots, sentiment analysis of
text.
• Recommender Systems: e.g., movie or product recommendations on Netflix and e-commerce sites,
which learn user preferences 8 .
• Healthcare: ML models assist in medical diagnosis by analyzing images (X-rays, MRIs) or patient
data to predict diseases.
• Finance: Fraud detection systems learn to flag unusual transactions; algorithmic trading uses ML for
decision making.
• Autonomous Vehicles: Self-driving cars use reinforcement learning and supervised learning for
tasks like path planning and object detection.
These examples barely scratch the surface – ML is also used in cybersecurity, robotics, agriculture, and
many other fields to improve efficiency and outcomes.
2
How Machines "Learn": At the core, an ML model makes predictions and then adjusts itself based on
errors. This typically involves an optimization algorithm (like gradient descent) that tweaks model
parameters to better fit the data. The process often includes:
Note: High-quality data is crucial. The saying "garbage in, garbage out" holds – models trained
on biased or noisy data will produce poor predictions. Indeed, data is the foundation of
machine learning – without relevant, clean data, even advanced models cannot perform well
9 .
Now that we have a high-level idea of what ML is, let's dive deeper into the main types of learning, starting
with supervised learning.
2. Supervised Learning
In supervised learning, we have input variables X (features) and an output variable Y (target), and we use an
algorithm to learn the mapping function from X to Y such that the model’s predictions $\hat{Y}$ for new X
are as close as possible to the true Y. This section is divided into two parts:
We will explore various algorithms for each and discuss how to evaluate their performance.
2.1 Classification
Classification is the task of predicting a discrete class label for an input. For example, given attributes of a
tumor, predict whether it is "benign" or "malignant" (binary classification), or given an image of a
handwritten digit, identify which digit 0-9 it represents (multi-class classification).
A key concept in classification is the decision boundary. This is the surface (in feature space) that separates
different class predictions. Formally, a decision boundary is a hypersurface that partitions the underlying
feature space into regions, one for each class 10 . Points lying on different sides of the boundary are predicted
as different classes. For a simple 2D example, the decision boundary is a curve (or line, if the classes are
linearly separable) dividing the plane into regions (class A vs class B).
If the decision boundary is a linear hyperplane, the classes are linearly separable. More
complex boundaries can be non-linear. Different classifiers have different shaped decision
boundaries (e.g., linear models vs. decision trees vs. k-NN all produce different boundaries).
3
To illustrate decision boundaries, let's look at a visual comparison of several classification algorithms on a
toy 2D dataset (with two features):
Figure 1: Decision boundaries of various classifiers on example 2D datasets. Each subplot shows a classifier (e.g.,
Nearest Neighbors, Linear SVM, RBF SVM, Decision Tree, Random Forest, Neural Net, AdaBoost, Naive Bayes, QDA)
trained on a synthetic dataset with two classes (red vs blue). The colored regions represent the predicted class for
any point, and points show the training data (solid) and test data (semi-transparent). This illustrates how different
algorithms partition the feature space with different shaped boundaries 11 12 .
In Figure 1, notice how: - Linear models (like Linear SVM or Logistic Regression, not shown but similar to
Linear SVM) produce a single straight-line boundary (or hyperplane in higher dimensions). - k-Nearest
Neighbors (k-NN) creates a very wiggly, locally-defined boundary that can be quite irregular (following the
training points closely). - Decision Tree yields a piecewise-constant boundary (rectangular segments in 2D).
- RBF SVM (a non-linear SVM) and Neural Network can produce smooth but complex curved boundaries. -
Naive Bayes here produces roughly linear boundaries (for Gaussian NB, boundaries are quadratic curves
but often close to linear). - Ensembles like Random Forest (many trees) or AdaBoost can produce complex
boundaries as well, often less smooth.
This highlights that model choice matters – some models are more flexible (able to capture complex
patterns) but may risk overfitting, while others are simpler and may underfit if the true boundary is
complex.
Now, let's discuss common classification algorithms, with brief explanations and code examples for each:
Despite its name, Logistic Regression is actually a classification algorithm (binary classification by default).
It models the probability of the positive class using a logistic (sigmoid) function. The model is a linear
function of the input features passed through the sigmoid, giving an output between 0 and 1 that can be
interpreted as $P(Y=1|X)$.
• Model form: $P(Y=1|X) = \sigma(w^T X + b)$, where $\sigma(z) = 1/(1+e^{-z})$ is the sigmoid. The
decision boundary corresponds to $w^T X + b = 0$ (where probability = 0.5 for binary case, often
thresholded at 0.5 to decide class).
• Training: Optimize the coefficients $w, b$ to minimize logistic loss (equivalent to maximizing
likelihood). Often solved with gradient descent. No closed-form solution (unlike linear regression)
due to the sigmoid, but convex optimization ensures a single optimum.
4
• Regularization: Logistic regression often uses L2 (ridge) or L1 (lasso) regularization to prevent
overfitting, especially when feature count is high.
Use cases: Logistic regression is popular for its simplicity, interpretability (the coefficients indicate feature
influence on log-odds of the outcome), and efficiency on high-dimensional sparse data (e.g., text
classification).
Let's see logistic regression in action on a simple dataset. We'll use the classic Iris dataset for
demonstration (though Iris has 3 classes, logistic regression can handle multi-class via one-vs-rest by
default in scikit-learn).
# Load Iris dataset (features: flower measurements, target: species index 0,1,2)
iris = load_iris()
X = iris.data
y = iris.target
# Binary classification example: to simplify, let's classify whether species is
"Virginica" (class 2) or not
y_binary = (y == 2).astype(int)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.3,
random_state=0, stratify=y_binary)
# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Coefficients:", model.coef_, "Intercept:", model.intercept_)
In the code above, we turned Iris into a binary problem ("Virginica vs not"). The output will show accuracy
and the learned coefficients. Logistic regression would draw a linear boundary in the 4D feature space to
separate Virginica from the other species.
5
2.1.2 Decision Trees
A Decision Tree is a versatile supervised learning algorithm capable of both classification and regression. It
learns a set of hierarchical if-else rules to predict the target.
• Model form: A tree structure where each internal node tests a feature (e.g., "petal length > 2.5?"),
and each leaf node outputs a prediction (class label for classification, or a value for regression).
• Learning: The tree is built by recursively splitting the data on the feature and threshold that yields
the largest information gain (or equivalently, minimizes impurity like Gini index or entropy in
classification) 13 . Splitting continues until leaves are pure (all one class) or a stopping criterion is
met (max depth, min samples, etc.).
• Nature: Decision trees are non-parametric and can model complex decision boundaries by their
piecewise splits 14 . However, they easily overfit if grown too deep (essentially memorizing the
training set).
To combat overfitting, we can prune the tree (stop splitting early or trim branches post-training). Simpler
trees generalize better. An advantage of trees is interpretability – one can follow the path and see the
conditions leading to a prediction.
Use cases: Decision trees are intuitive and handle heterogeneous data (categorical and numerical features)
and missing values well. They don't require feature scaling and can capture non-linear relationships.
However, a single tree often is not the most accurate predictor compared to other methods, unless it's
boosted or in a forest.
Let's train a decision tree on a dataset. We will use the Wine dataset (classification of wine cultivars based
on chemical analysis) as an example:
6
With max_depth=3 , we restrict the complexity to prevent overfitting and for readability. The output will
show the tree’s depth and number of leaves, as well as accuracy on training vs test. (A large gap between
train and test accuracy signals overfitting; we might see train accuracy is 100% if tree was deep enough to
memorize data.)
We can visualize the trained tree (though in practice, large trees are hard to interpret fully):
plt.figure(figsize=(12,8))
tree.plot_tree(tree, feature_names=wine.feature_names,
class_names=wine.target_names.astype(str), filled=True)
plt.show()
This will plot the tree structure with conditions at each node and class distributions at leaves. Each node
shows the splitting rule (e.g., "proline <= 755.5"), and colored leaves indicate predicted class.
A Random Forest is an ensemble of decision trees, built via the technique of bagging (bootstrap
aggregating) plus random feature selection. The idea is to reduce the variance of a single decision tree by
averaging many trees trained on random subsets of data and features.
• Training: We train multiple trees (e.g., 100) on bootstrap samples of the training data. Each tree is
grown typically to full depth (or with minimal pruning), but at each split, the algorithm is limited to
choose the best split among a random subset of features (this decorrelates the trees).
• Prediction: For classification, each tree votes for a class, and the forest predicts the majority vote.
For regression, it averages the predictions.
Random forests generally achieve better accuracy than individual trees and are more robust to noise. They
also come with an out-of-bag estimation for error (using those samples not in a tree’s bootstrap sample to
test that tree) and measures of feature importance (based on how much each feature split improves the
purity, averaged over trees).
Use cases: When you need a strong default classifier/regressor that works out-of-the-box, random forest is
often a good choice. It handles nonlinear relations and interactions well, requires little parameter tuning,
and tends not to overfit badly with enough trees 15 . It can still struggle with very high-dimensional sparse
data or purely linear relationships where simpler models suffice.
7
clf.fit(X_train, y_train)
print("Test Accuracy:", clf.score(X_test, y_test))
# Feature importance
for name, importance in zip(wine.feature_names, clf.feature_importances_):
print(f"{name}: {importance:.3f}")
This will train a forest of 100 trees and report accuracy. It also prints feature importances, showing which
features the model found most informative (these are normalized values that sum to 1).
The k-Nearest Neighbors algorithm is a simple yet effective approach that makes predictions based on the
memorized training dataset. For classification, the predicted class of a new point is determined by the
majority class among its k closest training points (neighbors) 16 . "Closeness" is typically defined by a
distance metric, usually Euclidean distance in feature space.
• Training phase: There really isn’t one – k-NN simply stores the training data.
• Prediction phase: Compute distance from the new point to all training points, find the k nearest,
and take a majority vote (for classification) or average (for regression) of their outputs.
• Parameter: The choice of k (and distance metric) is important. A small k (like 1) can lead to
overfitting (very jagged decision boundary following individual points), while a large k smooths out
predictions but may underfit.
Characteristics: k-NN is intuitive and can model very irregular decision boundaries (by local voting) 17 .
However, it becomes slow as the dataset grows (for each prediction, distance to all training points must be
computed – though techniques like KD-trees or ball trees can speed this up in lower dimensions). It also
suffers if features are on different scales or if irrelevant features introduce noise in distance – feature
scaling and selection can help.
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train2, y_train)
print("Test accuracy (k=5):", knn.score(X_test2, y_test))
We restricted to 2 features for potential plotting. The accuracy might be a bit lower than using all features,
but we can visualize the decision boundary in 2D for k-NN:
8
import numpy as np
In the contour plot above, you’ll see how k-NN partitions the feature plane into complex regions (especially
near the class boundaries, it can be quite irregular). More training points and a higher k would smooth the
boundaries.
Support Vector Machines are powerful classifiers that find the optimal separating hyperplane between
classes by maximizing the margin (distance) between the hyperplane and the nearest points (support
vectors) 18 . Key points:
• For linear separable data, SVM finds the hyperplane with maximum margin.
• For non-linear data, SVM can employ the kernel trick: implicitly map data to a higher-dimensional
space where it is linearly separable, without explicitly computing the coordinates in that space.
Common kernels: polynomial, RBF (Gaussian), sigmoid.
• Soft-margin SVM allows some misclassifications (with a penalty) to handle overlapping classes and
improve generalization (controlled by hyperparameter C).
SVMs are effective in high dimensions, memory efficient (only support vectors matter), and can be robust.
However, they do not natively provide probabilistic outputs (though we can calibrate or use Platt scaling)
and can be slow on large datasets (time complexity can be quadratic in number of samples for training).
Let's use SVM on a dataset (say the Wine dataset again). We'll use an RBF kernel to allow non-linear decision
boundaries:
9
svc.fit(X_train, y_train)
print("Test accuracy (SVM):", svc.score(X_test, y_test))
You can experiment with kernel='linear' or 'poly' and see how it affects performance. The
gamma parameter controls the kernel width for RBF (higher gamma = smaller radius for influence of each
point, can lead to overfitting if too high).
Naive Bayes classifiers are simple probabilistic classifiers based on applying Bayes’ Theorem with a strong
(naive) assumption: that features are independent given the class label 19 . Despite this assumption often
being false in real data, Naive Bayes can perform surprisingly well, especially for high-dimensional
problems like text classification (where independence between words is assumed).
There are different variants depending on the distribution assumption for features: - Gaussian Naive
Bayes: Assumes continuous features follow a Gaussian distribution per class. - Multinomial Naive Bayes:
For counts (e.g., word counts in text). - Bernoulli Naive Bayes: For binary features (e.g., presence/absence
of a word).
How it works: The model uses Bayes’ theorem: $P(C|X) \propto P(X|C) P(C)$. With the independence
assumption, $P(X|C) = \prod_{i} P(X_i | C)$. It estimates these probabilities from training data (e.g., mean/
variance for Gaussian NB, or frequency of feature values for others). Prediction is then $ \hat{y} =
\arg\max_C P(C) \prod_{i} P(x_i | C)$. Taking log (to avoid underflow) turns it into sums of log-probabilities.
Naive Bayes is extremely fast to train (just counting occurrences, basically) and to predict. It requires very
little data to estimate parameters (since it needs just per-feature statistics). However, if the independence
assumption is grossly violated, its probability estimates can be off (it may still get the class right, but the
confidence scores are unreliable).
Let's demonstrate Naive Bayes on a simple text classification task to illustrate (e.g., classifying very short
text as positive or negative sentiment). We will use MultinomialNB for a toy example:
10
nb = MultinomialNB()
nb.fit(X_docs, y_docs)
test_docs = ["happy love", "hate joy"]
X_test_docs = vec.transform(test_docs)
preds = nb.predict(X_test_docs)
print("Predictions:", preds) # expect [1, 0] perhaps
print("Predicted probabilities:", nb.predict_proba(X_test_docs))
This example converts text to features (bag-of-words) and applies NB. The predictions and probabilities
show how confident the model is that each test phrase is positive or negative. Despite the small dataset and
simplistic assumption, NB can correctly generalize that "happy love" is positive (both words are strongly
associated with the positive class in training) and "hate joy" might be predicted as negative (because "hate"
is a strong negative indicator, even though "joy" is positive, NB weighs them independently).
Gradient Boosting refers to a class of ensemble techniques where new models are added sequentially to
correct the errors of the existing ensemble. Typically, this is done with decision trees as the base learners
(resulting in a Gradient Boosted Decision Trees model). Each new tree is fit to the residual errors
(gradients) of the current model, hence "gradient boosting" 20 .
Key points: - Boosting vs Bagging: Unlike random forests (bagging), boosting trees are built sequentially,
each trying to fix errors of the previous ensemble. There is no bootstrap sampling; each tree typically uses
the full dataset but with weighted observations (higher weight on those previously mispredicted). -
Regularization: Boosting models have parameters like learning rate (how much each new tree contributes),
max depth of individual trees (often kept small, like 3-8), and number of trees. A smaller learning rate and
more trees can improve generalization (at the cost of training time). - Performance: Gradient boosting
often outperforms random forests in terms of pure predictive accuracy, especially when carefully tuned, but
can be more prone to overfitting if not regularized. Techniques like shrinkage (learning rate), row/column
sampling, and early stopping (stop adding trees when validation error stops improving) mitigate this.
# Load data
cancer = load_breast_cancer()
X = cancer.data; y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
random_state=0)
11
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
max_depth=3, random_state=0)
gbc.fit(X_train, y_train)
print("Accuracy (train):", gbc.score(X_train, y_train))
print("Accuracy (test):", gbc.score(X_test, y_test))
You will likely see a very high accuracy on both train and test (breast cancer is an “easy” dataset, and
boosting is powerful). If the train accuracy is significantly higher than test, that indicates some overfitting;
one might lower max_depth or learning_rate or use fewer estimators to remedy that.
Feature Importance: Like random forests, boosted trees also provide feature importance. You can inspect
gbc.feature_importances_ to see which features are most used in the splits.
Note: Modern gradient boosting libraries like XGBoost or LightGBM often have further optimizations and
conveniences (handling missing data, GPU training, etc.). But the essence is the same.
2.2 Regression
Regression is about predicting a continuous numeric value for each input. Many algorithms are similar to
classification counterparts but optimized for a numeric target with appropriate loss (e.g., mean squared
error).
Some common regression algorithms: - Linear Regression: A fundamental model that assumes a linear
relationship between inputs and the output. It minimizes the sum of squared errors and has an analytical
solution (normal equation) if no regularization. It's highly interpretable (coefficients show effect of features)
but limited to linear trends (which can be extended via polynomial features). - Decision Tree Regression: A
decision tree where leaves predict an average value (or median) of training samples in that leaf. Tends to
create piecewise constant prediction regions. - Random Forest Regression: Ensemble of trees averaged to
yield smoother predictions than a single tree. - k-NN Regression: Predict by averaging the values of nearest
neighbors. - Support Vector Regression (SVR): SVM adapted for regression, with a margin of tolerance
(epsilon-insensitive loss). - Gradient Boosting Regression: Analogous to boosting for classification but
optimizing squared error or absolute error, etc.
The process and considerations (overfitting, feature scaling for some models, etc.) are similar to
classification. The evaluation metrics differ (we use error metrics, see next section).
Example – Linear Regression on Boston Housing dataset: The Boston Housing dataset (historic dataset
of house prices in Boston areas) is a classic regression benchmark. It has features like average number of
rooms, crime rate, etc., and target is median house value in $1000s.
(Note: Scikit-Learn’s load_boston is deprecated due to ethical concerns; in practice one might use the
California housing dataset or another. We'll proceed with Boston for demonstration.)
12
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
lr = LinearRegression()
lr.fit(X_train, y_train)
print("Train R^2:", lr.score(X_train, y_train))
print("Test R^2:", lr.score(X_test, y_test))
print("Coefficients:", lr.coef_)
The .score method for regression returns the coefficient of determination $R^2$ (which is 1 - (MSE of
model)/(MSE of trivial mean model)). An $R^2$ of 1 is perfect fit, 0 means model is no better than predicting
the mean of y, and negative means it's worse than that mean prediction baseline.
The coefficients printed show the linear relationship learned: for each feature, holding others constant, how
many units the house price changes per unit increase in that feature (according to the model). For example,
a coefficient of -2 on “RAD” feature would mean if the accessibility to radial highways goes up by 1 (with
other features fixed), the predicted price goes down by $2k, suggesting perhaps houses near more
highways are cheaper (just an interpretation).
We can also examine non-linear or ensemble regression. Let's try a Random Forest Regressor on the same
data for comparison:
rf = RandomForestRegressor(n_estimators=100, random_state=0)
rf.fit(X_train, y_train)
print("Test R^2 (Random Forest):", rf.score(X_test, y_test))
Often the random forest will have a higher $R^2$ than linear regression if the true relationships are non-
linear or involve complex interactions.
When we build models, we need ways to evaluate their performance in order to choose the best model
and to know if the model is good enough for our application. For classification, there are several important
metrics:
• Accuracy: The fraction of predictions the model got right. $$\text{Accuracy} = \frac{TP + TN}{TP + TN
+ FP + FN}$$ where TP = true positives, TN = true negatives, FP = false positives, FN = false negatives.
13
Accuracy is intuitive but can be misleading in imbalanced datasets (e.g., 99% accuracy could mean
it’s just predicting the majority class always, ignoring the minority class).
• Precision: Out of all instances predicted as a certain class (say "positive"), how many were actually
that class? It’s a measure of exactness – low precision means many false positives. Formally, for
positive class, $\text{Precision} = \frac{TP}{TP + FP}$ 21 . High precision is important in scenarios
where false alarms are costly (e.g., precision of a spam detector – if low, you’d mislabel important
emails as spam).
• Recall (Sensitivity or True Positive Rate): Out of all actual instances of a class, how many did the
model correctly identify? It’s a measure of completeness – low recall means many false negatives. $
\text{Recall} = \frac{TP}{TP + FN}$ 22 . High recall is crucial when missing a positive case is very bad
(e.g., recall of a cancer diagnostic test – we want to catch as many cases as possible).
• F1-Score: The harmonic mean of precision and recall: $F1 = 2 \frac{\text{Precision} \cdot
\text{Recall}}{\text{Precision} + \text{Recall}}$ 23 . It gives a single score that balances both
concerns. A high F1 means both precision and recall are reasonably high. This is useful for
comparing models when one may have higher precision but lower recall, and another vice versa.
• Confusion Matrix: A table layout of model predictions vs actual values. For binary classification, it's
a 2x2 matrix: Predicted
Pos Neg
Actual Pos TP FN
Neg FP TN It shows the counts in each category. The confusion matrix gives a
fuller picture of performance, from which you can derive all the metrics above 24 25 .
• ROC Curve (Receiver Operating Characteristic): This is a plot of the True Positive Rate (Recall)
against the False Positive Rate (FPR = FP/(FP+TN)) at various threshold settings of the classifier. It
characterizes the trade-off between sensitivity and specificity. The AUC (Area Under the ROC Curve)
is often used as a threshold-independent summary of the classifier’s performance. An AUC of 0.5 is
random guessing, 1.0 is perfect. A useful interpretation: AUC is the probability that a randomly chosen
positive instance is ranked higher than a randomly chosen negative instance by the classifier 26 . ROC is
great for understanding model performance across all classification thresholds, particularly in binary
classification 27 .
Let's compute some of these metrics for an example model to see them in practice. We'll train a logistic
regression on the breast cancer dataset and evaluate it:
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)
print("Precision:", precision_score(y_test, y_pred))
14
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
For a well-performing model on this dataset, we might get something like: - Confusion matrix showing most
cases on the diagonal (correct predictions), e.g., [[57, 7], [ 5, 102]] meaning 57 malignant
correctly, 7 malignant misclassified, 5 benign misclassified, 102 benign correctly (just an example output). -
High precision and recall (both maybe around 0.93+).
We can better visualize performance with a confusion matrix heatmap and an ROC curve:
Figure 2: Confusion matrix for logistic regression on the breast cancer dataset (malignant vs benign). The matrix
shows counts of true vs predicted labels. E.g., here 57 malignant cases were correctly predicted as malignant (true
positives), 7 malignant cases were predicted as benign (false negatives), 5 benign cases were predicted malignant
(false positives), and 102 benign were correctly predicted (true negatives).
From Figure 2, we can derive: - Accuracy = (TP+TN)/total = (57+102)/171 ≈ 0.93 (93%). - Precision
(malignant) = 57/(57+5) ≈ 0.919, Recall (malignant) = 57/(57+7) ≈ 0.890. (Usually we compute precision/
recall for the positive class by default, or macro-average for all classes if multi-class.)
15
Figure 3: ROC Curve for the logistic regression model. The curve plots True Positive Rate vs False Positive Rate as
we vary the decision threshold. The dashed line is the diagonal (random performance). The model’s curve bows
towards the top-left, indicating good performance (AUC ≈ 0.99 in this case, meaning excellent separability
between classes) 26 .
When to use which metric: - If classes are balanced and each prediction is equally important, accuracy is
fine. - If data is imbalanced or certain errors are more costly, use precision/recall or F1. For example, in
fraud detection (fraud is rare), accuracy can be very high by always predicting "not fraud", but we care about
detecting the frauds (recall) while not annoying too many customers with false alarms (precision). - ROC
AUC is useful for comparing models and seeing the trade-off behavior. However, for very imbalanced data,
precision-recall curves might be more informative than ROC.
For multi-class, we extend these concepts: - Confusion matrix becomes NxN. - We calculate metrics per class
(one vs rest), and can report average (macro-average = simple average of per-class metrics, micro-average =
aggregate TP/FP counts). - There are also concepts of specificity (true negative rate), NPV (negative
predictive value), etc., but they are just the analogous measures for the negative class.
(Briefly, since classification was the focus above, but completeness for regression:)
For regression, common metrics include: - Mean Squared Error (MSE): Average of squared differences
between predicted and actual values. Emphasizes larger errors (outliers heavily penalized). - Root Mean
Squared Error (RMSE): $\sqrt{MSE}$, more interpretable in same units as target. - Mean Absolute Error
(MAE): Average of absolute differences. More robust to outliers (doesn't square the error). - R^2
(Coefficient of Determination): $1 - \frac{\sum (y-\hat{y})^2}{\sum (y-\bar{y})^2}$. Interpretable as
proportion of variance in Y explained by the model. Can be negative if model is worse than always
predicting mean.
In scikit-learn:
16
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
y_pred_lr = lr.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred_lr))
print("RMSE:", mean_squared_error(y_test, y_pred_lr, squared=False))
print("MAE:", mean_absolute_error(y_test, y_pred_lr))
print("R^2:", r2_score(y_test, y_pred_lr))
One must consider the context: e.g., an RMSE of 5 (thousands of dollars) on house prices might be
acceptable or not depending on the range of prices.
At this point, we have covered an array of supervised learning algorithms and how to evaluate them. Next,
we move on to unsupervised learning, where there are no labels guiding the training.
3. Unsupervised Learning
Unsupervised learning deals with finding patterns or structure in data without any labels or targets. We will
discuss two main unsupervised tasks: Clustering and Dimensionality Reduction, along with how to
evaluate clustering results.
3.1 Clustering
Clustering is the task of grouping a set of objects such that those in the same group (cluster) are more
similar to each other than to those in other groups 17 . It is a form of pattern discovery – for example,
segmenting customers into distinct groups based on purchasing behavior, or finding communities in a
social network.
Clustering is inherently subjective; its goal is to uncover any underlying structure in the data: - Hard
clustering: each point belongs to exactly one cluster. - Soft (fuzzy) clustering: a point can have degrees of
belonging to multiple clusters. - Different clustering algorithms have different notions of what constitutes a
cluster (e.g., spherical clusters vs. density-based clusters vs. hierarchical groupings).
K-Means is a popular and simple partitioning method: - It requires the number of clusters k to be specified
in advance. - It starts with k initial cluster centroids (can be randomly chosen data points or other
initialization methods like k-means++ for better results). - Expectation-Maximization iterative
refinement: In each iteration, - Assign step: assign each data point to the nearest centroid (nearest in
Euclidean distance usually). - Update step: recompute each centroid as the mean of all points assigned to it.
- Repeat until assignments do not change or max iterations reached.
17
The objective is to minimize the sum of squared distances of points to their cluster centroid (within-cluster
variance). K-Means effectively finds clusters that are convex and roughly spherical in shape because it
uses distance to a mean (centroid) as criterion 28 .
Pros: Scalable to large datasets, easy to implement, often finds reasonably good clusters if clusters are nice
and separated.
Cons: Must choose k. Sensitive to outliers (mean shifts), and to initial seeds (bad initialization can lead to
poor results or local minima). Only finds convex clusters; cannot handle complex shapes. Also assumes
equal importance (and scaling) of features due to Euclidean distance usage.
Let's run k-means on a simple dataset for demonstration. We'll use a synthetic dataset where we know the
true clusters for illustration, and see if k-means recovers them.
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
In the scatter plot, points are colored by their k-means cluster assignment, and red 'X' marks are the
centroids. We would expect k-means to have identified clusters near (0,0), (5,5), (0,8) if it worked well
(perhaps not perfectly if clusters had some overlap or if initialization was poor, but with distinct clusters it
should do fine).
18
3.1.2 Hierarchical Clustering
Hierarchical clustering does not require a pre-specified number of clusters. Instead, it creates a hierarchy
of clusterings which can be represented as a tree (dendrogram). There are two main types: -
Agglomerative (bottom-up): Start with each point as its own cluster, then iteratively merge the two closest
clusters until eventually all points are in one cluster 29 . "Closest" is defined by a linkage criterion (e.g.,
single-linkage = closest pair of points between clusters, complete-linkage = farthest pair, average-linkage =
average distance between points of clusters, etc.). - Divisive (top-down): Start with all points in one cluster
and recursively split clusters.
Agglomerative is more common. It results in a dendrogram where cutting the tree at a certain level gives a
particular number of clusters.
Pros: You get a full hierarchy; you can choose any number of clusters post-hoc by cutting the dendrogram.
Can capture nested patterns. No need to commit to a particular k upfront (though you still need to decide a
stopping criterion or cut).
Cons: Typically $O(n^2)$ complexity (computing distance matrix) so not feasible for very large n (unless
using approximation). Sensitive to noise and outliers if not handled. Merging decisions are irreversible
(greedy), so early mistaken merges can’t be corrected.
# We can use scipy to plot a dendrogram for a subset (because plotting all 300
points will be too crowded)
sample_indices = np.random.choice(len(X_syn), size=50, replace=False)
X_sample = X_syn[sample_indices]
# compute the linkage matrix
linkage_matrix = sch.linkage(X_sample, method='ward')
plt.figure(figsize=(10, 5))
sch.dendrogram(linkage_matrix)
plt.title("Hierarchical Clustering Dendrogram (sample of points)")
plt.xlabel("Sample index"); plt.ylabel("Distance")
plt.show()
19
The dendrogram plot shows how clusters merge at increasing distances. You could decide a distance cutoff
to get a desired cluster partition. For instance, if you see three distinct merges far apart, cutting just before
those merges might give 3 clusters.
In practice, one might decide cluster count by looking for a "gap" in distances (large jump indicates merging
two very dissimilar clusters) or other criteria like the silhouette score for different k.
3.1.3 DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering method that
groups points that are closely packed together (points with many neighbors in a radius) while marking
points in sparse regions as outliers 30 .
Key concepts: - epsilon (ε): radius of neighborhood. - minPts: minimum number of points in a
neighborhood (of radius ε) for a point to be considered a core point. - Core points: have at least minPts
points (including itself) within ε. - Border points: not core points, but within ε of a core point (reachable
from core). - Noise points: not core or border (no sufficient neighbors, and not near a core).
DBSCAN algorithm: - Find all core points. - For each core point not yet assigned to a cluster, start a new
cluster and include all core points reachable (directly or transitively) from it (if point A is within ε of core B,
they're in same cluster; and B within ε of C, etc.). Border points get assigned to clusters of their nearest core
if applicable. - Noise points remain unassigned.
Pros: Can find arbitrarily shaped clusters (not just convex) and can identify outliers (noise) explicitly. No
need to specify k beforehand – you set ε and minPts which are more intuitive sometimes (if you have
domain knowledge of what density constitutes a cluster). Cons: Choice of ε is critical. If density varies, a
single ε may not work well (clusters in dense areas vs sparser areas). Also, DBSCAN struggles in high
dimensions due to curse of dimensionality (distance becomes less meaningful).
Example: Consider the famous "two moons" dataset – two interleaving half-circle shapes. K-means fails to
cluster them properly (would cut across the moons), but DBSCAN can.
20
axes[1].set_title("DBSCAN Clustering")
plt.show()
Figure 4: Clustering results on the "two moons" dataset. Left: K-Means splits the data into two clusters but the
linear boundary misassigns points (one moon is split into two parts). Right: DBSCAN correctly finds the two
crescent-shaped clusters (red and blue) and would label any isolated noise points as -1 (if present, shown as a
separate color).
In Figure 4, K-Means forced spherical clusters, which doesn't fit the curved shapes. DBSCAN (with an
appropriate ε) identifies the moons properly as separate clusters, showing its advantage for non-globular
structures. DBSCAN also automatically treated some scattered noise (if any) as outliers (cluster label -1).
Often data in high dimensions (many features) can be hard to visualize and even hard for models to handle
due to the curse of dimensionality (distance metrics become less informative, too many parameters, risk
of overfitting). Dimensionality reduction techniques seek to reduce the number of features while
preserving as much data variance or structure as possible 31 . This can be used for: - Data compression and
efficiency. - Noise reduction. - Visualization (reducing to 2 or 3 dimensions to plot). - Feature extraction
(derive new composite features that capture most information).
Principal Component Analysis (PCA): PCA is a linear technique that finds new orthogonal axes (principal
components) that maximize the variance of the data 31 . The first principal component is the direction of
highest variance. The second is orthogonal to the first and has the next highest variance, and so on. By
projecting data onto the top k components, we get a k-dimensional representation that retains most of the
variability of the original data (information).
21
close in original space should be close in reduced space) 32 . It converts distances into probabilities
(similarity measures) and then tries to minimize the divergence between the distributions in original and
reduced space. It's great for plotting high-dimensional data into 2D or 3D where clusters or manifolds
might become visually apparent (like grouping of images by type, or word embeddings by semantic
clusters).
• t-SNE is non-linear and probabilistic, it will highlight local clusters but may distort global
relationships (distance between far clusters isn't necessarily meaningful).
• It has parameters like perplexity (related to how it balances local vs global) and can be slow on very
large datasets.
• It's only for visualization, not for general reduction to feed into other algorithms (because it
doesn't preserve a clear metric structure for all points, mainly the local neighbor relations).
Other techniques: UMAP (Uniform Manifold Approximation and Projection) is a newer non-linear
technique, often faster than t-SNE and preserves more global structure. There are also autoencoders
(neural network based reduction), Factor Analysis, Independent Component Analysis (ICA), etc.
Example – PCA on Digits dataset: The Digits dataset (8x8 images of handwritten digits 0-9, flattened to 64
features) is 64-dimensional. We can use PCA to reduce it to 2D, and t-SNE for another 2D, and compare.
digits = load_digits()
X_digits = digits.data
y_digits = digits.target
# PCA to 2D
pca2 = PCA(n_components=2)
X_pca2 = pca2.fit_transform(X_digits)
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
for digit in range(10):
plt.scatter(X_pca2[y_digits==digit, 0], X_pca2[y_digits==digit, 1],
label=str(digit), alpha=0.6)
plt.title("PCA (2D) of Digits"); plt.legend()
plt.subplot(1,2,2)
22
for digit in range(10):
plt.scatter(X_tsne2[y_digits==digit, 0], X_tsne2[y_digits==digit, 1],
label=str(digit), alpha=0.6)
plt.title("t-SNE (2D) of Digits")
plt.legend()
plt.show()
Figure 5: Comparing dimensionality reduction on the digits dataset. Left: PCA projection to 2D – some digits form
clusters but there is overlap (e.g., 3 (pink) and 8 (yellow) mix). Right: t-SNE 2D embedding – clearer separation of
digit clusters (each color) 32 . For instance, green '6's and orange '9's form distinct clusters in t-SNE, whereas in
PCA they were closer and mixed with others.
In Figure 5, PCA being a linear method could not completely separate all classes (digits) with just 2
components, though it captured ~varIance. t-SNE, focusing on local neighbor structure, found tighter
clusters for each digit, making the grouping by digit much more visually apparent. Each cluster corresponds
to a digit label fairly well.
Evaluating clustering is tricky since we often don’t have ground truth labels (if we did, it wouldn’t be
unsupervised!). However, if we do have some labeled data for benchmarking, we can use external indices like
ARI (Adjusted Rand Index), NMI (Normalized Mutual Information), etc., which compare the cluster
assignments to true labels.
For unlabeled data, we use internal indices that examine the structure: - Silhouette Coefficient: For each
point, silhouette = (b - a) / max(a, b), where - a = average distance from the point to others in the same
cluster (cohesion), - b = average distance from the point to points in the nearest other cluster (separation).
Silhouette ranges from -1 to 1. High value means the point is much closer to its own cluster than to others
(good) 33 . We often use the mean silhouette score over all points as a measure of cluster quality. It can
also help to choose k: compute silhouette for different k and pick the highest.
• Davies–Bouldin Index (DBI): This index measures the average “similarity” between each cluster and
its most similar cluster, where similarity is defined as the ratio of within-cluster scatter to between-
cluster separation 34 . Lower DBI is better – it means clusters are compact and far from each other.
It's computed as: for each cluster i, find R_ij = (scatter_i + scatter_j) / distance(center_i, center_j) for
23
every other cluster j; for cluster i take D_i = max_j R_ij (the worst case similarity with another cluster);
DBI = average of all D_i. In simpler terms, clusters that are far apart and tight will have low DBI 35
36 .
• Within-cluster SSE (Sum of Squared Errors) / Inertia: Used by k-means (sum of squared distances
of points to their centroid). By itself, this always decreases with more clusters, but one can look at
the "elbow" in SSE vs k curve to choose an optimal k (point where adding another cluster doesn't
significantly reduce error).
Let’s apply silhouette and DBI to compare k-means vs DBSCAN in the earlier two-moons example
quantitatively:
In this hypothetical, K-means had a higher silhouette (point labeling was more separated in distance than
DBSCAN clusters perhaps) and lower DBI (indicating better cluster separation in its sense). Does that mean
K-means was better? Not really – we know K-means actually "split" one true cluster incorrectly (so external
metrics would judge it poorly), but silhouette/DBI being internal, evaluated on distance, might sometimes
favor a method even if clusters don't align with the actual truth we expect.
Caution: Internal metrics aren’t perfect. For example, silhouette assumes convex clusters as ideal. In the
two moons, silhouette for DBSCAN was lower possibly because the clusters are close in some parts and far
in others (and not convex). Yet DBSCAN clearly gave the more correct grouping. So use these metrics as
guides, not absolutes. Visual inspection and domain sense often play a role in unsupervised learning
evaluation.
Use of Evaluation: - If you don't know how many clusters k to use, try a range and see where silhouette is
highest or DBI is lowest (and also consider interpretability – what makes sense for your application). - If
clusters are very clear, metrics will reflect it, but if they are muddled, the metrics help confirm that (you
might get low silhouettes across the board). - Also watch out for degenerate solutions: e.g., silhouette is
undefined for 1 cluster and not meaningful; a very high DBI could mean clusters are not well separated at
all.
24
We’ve now covered unsupervised techniques for clustering and reducing dimensionality, and how to assess
clustering quality. The next section introduces a different paradigm: reinforcement learning, where learning
happens via interactions and feedback rather than from static datasets.
4. Reinforcement Learning
Reinforcement Learning (RL) is a learning paradigm inspired by behavioral psychology, where an agent
learns to make a sequence of decisions by interacting with an environment. The agent observes the state
of the environment, takes an action, and receives a reward (a scalar feedback signal). The goal is to learn a
policy (mapping from states to actions) that maximizes the cumulative reward over time 6 .
Key RL terminology: - State (s): A representation of the current situation the agent is in. - Action (a): A
choice the agent can make at a state. - Reward (r): Feedback from the environment for taking an action in a
state (can be positive, zero, or negative). - Episode: A sequence of states, actions, rewards, typically
terminating in a terminal state (like end of game). - Policy (π): Strategy of the agent, π(s) = action chosen in
state s (can be deterministic or stochastic). - Value (V) of a state: The expected cumulative reward (often
discounted) from that state under a given policy. - Q-value (Q) of a state-action: The expected cumulative
reward from taking a certain action in a given state and following a policy thereafter.
The agent's objective is to learn an optimal policy that yields the highest long-term rewards. Unlike
supervised learning, the agent often does not get told the correct action – it must explore and discover
which actions yield more reward. Also, rewards may be delayed (an action now may yield a reward much
later), making credit assignment non-trivial.
A common framework is the Markov Decision Process (MDP) where states, actions, reward function, and
transition probabilities are defined. Many RL algorithms revolve around solving MDPs.
4.1 Q-Learning
Q-Learning is a model-free RL algorithm (meaning it doesn't require knowing the environment's dynamics;
it learns from experience) that learns the action-value function Q(s,a) directly 37 . The Q-value Q(s,a)
represents how good it is to take action a in state s (in terms of expected future rewards).
It uses the Bellman equation for Q-values: $$ Q_{\text{new}}(s,a) \leftarrow Q(s,a) + \alpha \Big[ r(s,a) +
\gamma \max_{a'} Q(s', a') - Q(s,a) \Big] $$ Here: - s = current state, a = action taken, r(s,a) = reward received,
s' = next state after action. - $\alpha$ = learning rate (how much new info overrides old). - $\gamma$ =
discount factor (how much future rewards are worth relative to immediate rewards, between 0 and 1). - $
\max_{a'} Q(s', a')$ = best estimated future value from the next state (assuming optimal action onward).
This update is applied each time the agent experiences a transition (s,a,r,s'). It's basically moving Q(s,a)
towards the sampled value r + γ max Q(next_state, ·) which is an estimate of what Q(s,a) should
be if we assume we are one step closer to optimal.
25
Over many episodes of interaction, Q-values converge to the optimal Q-function (given sufficient
exploration). The learned Q can then define an optimal policy: in any state s, choose the action a with
highest Q(s,a) (greedy policy). In practice, during learning, one uses an exploration strategy (like $
\epsilon$-greedy: with probability $\epsilon$ choose a random action to explore, otherwise choose current
best action) to ensure all state-action pairs are tried.
Characteristics: Q-learning is off-policy (it learns the optimal policy regardless of how the agent is
currently exploring; the update uses the max over next actions, not following the agent's current policy
strictly). It can handle stochastic transitions and rewards, and will find an optimal policy for any finite MDP
given infinite exploration and a decaying learning rate 38 .
Let's illustrate Q-learning with a simple example: imagine a grid world where an agent must reach a goal
cell for reward +1, and gets 0 otherwise, moving up/down/left/right with no obstacles. (This is a common
didactic example.)
import numpy as np
# Q-learning loop
for episode in range(1000):
s = 0 # start at top-left corner (state 0)
done = False
while not done:
# epsilon-greedy action
26
if np.random.rand() < epsilon:
a = np.random.randint(n_actions)
else:
a = np.argmax(Q[s])
s2, r = step(s, a)
# update Q
Q[s,a] += alpha * (r + gamma * np.max(Q[s2]) - Q[s,a])
s = s2
if s == goal_state:
done = True
Running this, we'd expect the agent learns to go to the goal (which is at bottom-right). Likely the optimal
move from state 0 is "down" or "right" (since goal is bottom-right, an optimal policy is to always move right
or down). The Q-values for state 0 might end up something like [something, high, high,
something] where the highest corresponds to moving either right or down depending on how ties broke.
This simple example demonstrates how Q-learning iteratively improves its estimates. In early episodes Q is
all zeros; by exploring, it finds the reward eventually and propagates value back along paths leading to the
goal. After sufficient episodes, Q converges and the derived policy consistently leads to the goal.
Q-learning has been pivotal in many RL successes (often combined with function approximation like deep
neural networks – the famous "Deep Q Network (DQN)" that learned to play Atari games used Q-learning
with a deep network to approximate Q).
Value Iteration is a classical dynamic programming method to compute the optimal value function for an
MDP (assuming you have a model of the environment). Unlike Q-learning, this is not learning from real
experience but rather computing solution given known probabilities of transitions and rewards.
It uses the Bellman Optimality equation for state values: $$ V_{k+1}(s) = \max_a \sum_{s'} P(s' | s, a)
[ R(s,a,s') + \gamma V_k(s') ] $$ 39
• Start with an initial value function $V_0(s)$ (e.g., zero for all states).
• At each iteration, update $V(s)$ for all states using the above formula (take the action that
maximizes expected future reward plus immediate reward, using current $V$ estimates for next
states).
• Repeat until values converge (change is below a threshold).
Once converged, you have $V^(s)$, the optimal state values. And you can derive the optimal policy: for each
state, choose the action that achieves the max in the above equation (i.e., $\pi^(s) = \arg\max_a \sum_{s'} P(s'|
s,a)[R + \gamma V^*(s')]$) 40 .
27
Value iteration is guaranteed to converge to optimal values for discounted finite MDPs. It's effectively
performing a contraction mapping update (Bellman operator).
Policy Iteration is another approach where you iterate between evaluating a given policy (calculate V^π)
and improving the policy (greedify it with respect to current V), until no change.
Example: For a simple grid world like above (if we had the transition probabilities table), value iteration
could compute the optimal values exactly. It's like solving a system of equations via iteration.
Pseudo-code:
states = range(n_states)
V = np.zeros(n_states)
theta = 1e-6
while True:
delta = 0
for s in states:
if s == goal_state:
continue
max_val = -inf
for a in actions:
s2, r = deterministic_step(s,a) # assuming deterministic for
simplicity
val = r + gamma * V[s2]
if val > max_val: max_val = val
newV = max_val
delta = max(delta, abs(newV - V[s]))
V[s] = newV
if delta < theta: break
# Derive policy
policy = {}
for s in states:
best_a = None; best_val = -inf
for a in actions:
s2, r = deterministic_step(s,a)
val = r + gamma * V[s2]
if val > best_val:
best_val = val; best_a = a
policy[s] = best_a
This would yield the same result as Q-learning given the model, but via computation rather than sampling.
When to use Value Iteration: It requires knowing the environment model (transition probabilities P(s'|s,a)
and rewards). It's useful in planning problems or small MDPs where you can enumerate states and actions.
28
For large state spaces, it's intractable to update every state (that's where function approximation or other
methods come in). Still, understanding value iteration is foundational to understanding how optimal
policies are defined.
Reinforcement learning has achieved fame with examples like: - Teaching agents to play games (Atari video
games, Go – AlphaGo used advanced RL, OpenAI Five for Dota, etc.). - Robotics control (learning to make a
robot walk, or control a helicopter). - Recommendation systems and personalized decisions can sometimes
be framed as RL (learn from sequential interaction with user). - Resource management and operations
research problems (e.g., RL for job scheduling in computing clusters).
It’s a very powerful framework, though more complex hyperparameter-wise and often requiring many trial-
and-error interactions to get right.
We have given an overview of RL basics and simple algorithms. A full treatment is beyond scope (it typically
involves Markov decision processes, exploration-exploitation dilemma, function approximation, etc.), but
this provides the foundation to understand how agents can learn by themselves with minimal guidance.
• Iris Dataset: Perhaps the most famous small dataset. 150 samples of iris flowers with 4 features
(sepal length/width, petal length/width) and 3 species. Great for demonstrating classification
algorithms (it's relatively easy: linear models, trees, etc. can get >95% on it), as well as clustering
(unsupervised) and even PCA (4D to 2D visualization).
• Use in practice: Quick tests of classifiers, visual examples (pair plots, etc.). We saw it in logistic
regression and K-NN sections.
• Titanic Dataset: Passenger data from the Titanic disaster (commonly from Kaggle). Features like
age, sex, passenger class, etc., and target is survival (yes/no). Useful for demonstrating data cleaning
(it has missing ages), feature engineering (titles from names, family size), and binary classification (a
slightly imbalanced dataset).
• Use: Teaching logistic regression or decision trees on a mix of numeric and categorical features.
Good for exercise in data preprocessing and model evaluation (e.g., computing precision/recall to
find the best model since accuracy can be ~78% baseline with no model).
• Wine Dataset: Chemical composition of 178 wines from 3 cultivars. 13 continuous features. It's a
nice multi-class classification set (as we used with decision tree, SVM). It’s not very large, so good for
algorithms that scale poorly, and can also be used to demonstrate PCA (13D to 2D) or clustering
(maybe the classes correspond to clusters in feature space).
29
• Use: Classification (e.g., try logistic regression vs. KNN vs. SVM on it), or clustering vs actual labels
comparisons.
• Digits Dataset: 1797 8x8 images of handwritten digits (0 through 9). We saw it in PCA/t-SNE context.
It's high-dimensional (64 features) but not too large. It’s a classic for classification (a softer precursor
to MNIST). Many algorithms can achieve >95% accuracy here as well. Also good for clustering
(unsupervised) – do the images cluster by digit without labels? (Often yes, as t-SNE showed).
• Use: Could train a simple neural network or SVM on it in an exercise. Or use it for an exercise on
model comparison (try multiple classifiers, compare accuracy and confusion matrices). Also for
demonstrating cross-validation because enough data to hold out a test set or do k-fold.
• Boston Housing Dataset: 506 instances, 13 features (various demographics and statistics for
suburbs of Boston in 1970s), target is median home value. We used it to demonstrate regression. It's
a bit outdated (and has some ethical concerns), but still widely referenced. It's good for showing
linear vs non-linear regression, feature importance (e.g., LASSO selecting features), etc.
• Use: Could be an exercise to perform regression, evaluate with RMSE, maybe try polynomial
regression or tree-based regression and compare.
Other interesting datasets that could be used in exercises (not explicitly listed in prompt, but as an
instructor you might mention): - Breast Cancer Wisconsin Dataset: Binary classification (benign/
malignant) with 30 features. We used it in classification metrics section. Good for classification examples
(especially demonstrating model evaluation on imbalanced data – malignant cases are minority). - MNIST
(handwritten digits, 28x28 images) – too large for inclusion here but a classic for image classification. -
COCO or ImageNet etc. for advanced image tasks (likely beyond an 8h intro class, but worth mentioning as
real-world scale). - CIFAR-10 (image classification small images in 10 classes). - 20 Newsgroups (text
classification into 20 categories), good for demonstrating text feature extraction (CountVectorizer/Tfidf) and
Naive Bayes or SVM on text.
Using real datasets grounds the theory: - It highlights practical issues like missing values (Titanic age, etc.),
categorical variables (embark town on Titanic), feature scaling needs (some models need scaled data). - It
demonstrates that not all models perform equally on all data (e.g., k-NN might do worse on high-dim data
vs. a neural net). - It provides a playground for exercises: e.g., "Using the Titanic dataset, build a classifier to
predict survival and tune it."
Often, the best way to solidify understanding is to try things out on these datasets. So in the exercises
section next, we will leverage some of them.
6. Exercises
Now that we've covered a broad range of topics, it's time for some hands-on practice. Below are a set of
exercises corresponding to major sections. Attempt to solve them before peeking at the answers. The
solutions (in code or explanation) are provided for verification and learning.
30
6.1 Supervised Learning Exercises
Exercise 6.1.1 (Logistic Regression & Evaluation): Using the Titanic dataset (you can load a preprocessed
version or use sns.load_dataset('titanic') if available), train a logistic regression model to predict
survival. Then: - Compute the confusion matrix, precision, recall, and F1-score on the test set. - Interpret the
precision and recall in context (e.g., what does a certain precision value mean for predicting survivors?).
Solution Outline: First, load and preprocess Titanic data (handle missing ages, encode categorical variables).
Fit LogisticRegression . Then use confusion_matrix , precision_score , etc., from
sklearn.metrics . You might get something like precision ~0.75, recall ~0.70 (depending on how you set
threshold). That would mean "out of all people model predicted as survived, 75% actually survived" and "the
model catches 70% of actual survivors".
Exercise 6.1.2 (Compare Classifiers): On the Wine dataset, compare at least three classifiers (e.g., Logistic
Regression, Decision Tree, k-NN, SVM) using 5-fold cross-validation. Which model performs best in terms of
accuracy? Does one model significantly outperform others?
31
Solution Outline: Use cross_val_score for each model on load_wine() . Likely, all will do fairly well
(>90%). SVM or RandomForest might slightly outperform Logistic or k-NN. But differences might not be
huge due to small dataset. It's a chance to show how to use cross-validation for fair comparison.
Example code:
wine = load_wine()
models = [
("LogReg", LogisticRegression(max_iter=1000)),
("DecisionTree", DecisionTreeClassifier()),
("KNN", KNeighborsClassifier(n_neighbors=5)),
("SVM", SVC(kernel='linear'))
]
for name, clf in models:
scores = cross_val_score(clf, wine.data, wine.target, cv=5)
print(name, "accuracy:", scores.mean())
After running, you might see e.g. LogReg ~0.97, Tree ~0.86 (could be lower due to overfitting on small
folds), KNN ~0.93, SVM ~0.98. Conclude which is best (SVM in this hypothetical) and note any trade-offs (e.g.,
tree is lower but offers interpretability).
Exercise 6.1.3 (Hyperparameter Tuning): Use the Digits dataset and a RandomForestClassifier. Perform a
grid search to find the best n_estimators (e.g., try 50, 100, 200) and max_depth (None, 5, 10). Use 5-
fold CV on the training set to select hyperparams. What combination gives the highest validation accuracy?
What is that accuracy, and how does the model perform on a held-out test set?
Pseudo-code:
32
best_rf = gs.best_estimator_
print("Test accuracy:", best_rf.score(X_test, y_test))
Exercise 6.2.1 (Clustering Evaluation): The Iris dataset (ignoring labels) can be clustered. Apply K-Means
with k=3 to the iris data (only features). Then compare the clusters to the true species labels: - Compute the
confusion matrix between predicted cluster assignments and true labels. (You may need to permute cluster
indices to match labels, since clustering labels have no inherent order.) - Compute the silhouette score of
the clustering. - Briefly discuss how well K-Means performed in recovering the known classes.
Exercise 6.2.2 (Visualizing Clusters): Use PCA to reduce the Wine dataset (13 features) to 2 principal
components. Plot the data points in 2D and color by the wine cultivar (target label). Can the three classes be
visually separated by the first two PCs? Then try running a clustering (say Agglomerative with 3 clusters) on
the PCA-reduced data and see if those clusters align with the true labels.
Solution Outline: Compute PCA, plot. The three wine classes might show some grouping but with overlaps. If
Agglomerative clustering is performed on the 2D data, ARI or confusion can check alignment. Possibly it
clusters 1 vs (2+3 together) if not clearly separated. The exercise teaches that PCA can help visualize but
might not perfectly separate classes.
Exercise 6.2.3 (Anomaly Detection with DBSCAN): Generate a dataset of 2D points consisting of two
obvious clusters and some random noise points far away. Use DBSCAN to cluster with appropriate eps so
that it identifies the two clusters and marks noise as outliers. How many noise points did it find? Plot the
result showing noise vs clustered points.
Solution Outline: You could sample two Gaussians, and 5 random outliers far away. DBSCAN with eps maybe
intermediate will isolate those 5 as noise. The number of noise = 5 ideally. The plot will show outliers as
having label -1. This illustrates DBSCAN’s ability to find noise.
Exercise 6.3.1 (Q-Learning on Grid): Consider a 3x3 grid world. Top-left is start (S), bottom-right is goal (G).
Reward +1 at G, 0 otherwise. Moves: up/down/left/right (no wrap, hitting wall leaves state same).
Implement a simple Q-learning (as pseudo-code or actual code) to find an optimal policy. How many
episodes (roughly) does it take until the agent consistently goes to the goal? Provide the learned policy
(sequence of moves from start) and the Q-values for the start state.
33
Solution Outline: This is similar to the earlier example but 3x3. Likely it learns in few hundred episodes. The
optimal policy from S likely “right, right, down, down” (if indexing from top-left). The Q(start, right) or down
become highest, etc. One could either show a table or arrow diagram. This exercise is more conceptual if
coding RL is too advanced; one can just reason or write pseudo-code (the answer can highlight the process
and result, as we did in earlier Q-learning code cell).
These exercises provide hands-on engagement with key concepts. By working through them, one should
gain a deeper understanding of applying ML algorithms and interpreting their results.
• Data Preprocessing & Cleaning: Real-world data is messy. Always examine your data (use
descriptive stats, visualize distributions). Handle missing values (drop, impute, or use models that
handle them). Remove or fix outliers if they are errors. Ensure data types are correct and text is
normalized if needed. This step prevents garbage input to your model (remember: Garbage in,
garbage out).
• Feature Engineering: Good features can make a simple model powerful. This could include:
• Scaling/normalizing features (especially for methods like SVM, k-NN, neural nets).
• Encoding categorical variables properly (one-hot, ordinal encoding if order matters, etc.).
• Creating new features from existing ones (e.g., combining date fields into day of week; text to word
counts).
• Dimensionality reduction or feature selection to reduce noise and redundancy.
• Considering transformations (log, sqrt, etc.) if a feature has skewed distribution.
Keep in mind the context – domain knowledge can guide what features might be relevant. Feature
engineering is often where human insight adds value beyond automated algorithms 41 .
• Train/Test Split (and Cross-Validation): Always set aside a test set that your model doesn't see
during training, to evaluate generalization. Use cross-validation on the training set for model
selection and hyperparameter tuning. This avoids overfitting to the test data (a common mistake is
tuning hyperparameters on test set – this leaks information and makes test performance overly
optimistic).
• Use stratified splitting for classification to maintain class proportions.
• For time series, use time-based splits (train on past, test on future) instead of random.
• Cross-validation (e.g., 5-fold) gives a more stable estimate by averaging performance over folds 42 .
• Model Selection and Tuning: Start with simple models as baselines (e.g., linear or logistic
regression, decision stump, etc.). Then try more complex ones. Use grid search or random search to
tune hyperparameters systematically. Keep track of what combinations you've tried (it's easy to
34
forget). Automate what you can (e.g., using GridSearchCV ). Also consider ensemble strategies if
multiple models seem to have complementary strengths.
• Sometimes using an ensemble of many overfit models (like in bagging) can ironically reduce
overfitting due to variance reduction (random forests).
• Feature Importance and Interpretability: Whenever possible, examine which features are
important to your model. Models like random forests provide importances, linear models have
coefficients. This can yield insights (and catch issues, like a leaked feature that makes no sense but
has high importance). Interpretability is crucial in fields like healthcare or finance. If using a complex
model, you might use techniques like SHAP values or LIME to explain individual predictions.
• Evaluation Metrics aligned with Objective: We covered many metrics; choose the one that
matches the problem needs. For example, if false negatives are worse than false positives, focus on
recall (sensitivity). If data is imbalanced, accuracy is not informative – use precision/recall or AUC, etc.
In ranking/recommendation, you might use MAP or NDCG. Always think about what the numbers
mean for the application (e.g., an AUC of 0.9 might sound good, but if it's a cancer test, what is the
recall at an acceptable precision?).
• Does it perform reasonably across different segments of data? (Check performance by subgroup –
e.g., does a model do much worse for a certain demographic? That might indicate bias.)
• If possible, have a small human-verified dataset to compare.
• Iterate and Experiment: The ML workflow is iterative. You might find your model underperforms –
then analyze error cases. That analysis might suggest new features or data collection. Or maybe you
realize some labels are wrong – fix them if you can. It's a loop of improvement.
35
• Data Splitting for Final Model: After doing all comparisons and choosing the best approach via
cross-validation on train, you typically train that final model on the entire training set (potentially
even include the validation folds now) to maximize data usage, then do a final evaluation on the
untouched test set for reporting results. Only at this final step should you evaluate on test (to avoid
biasing earlier decisions).
• Deployment considerations: (Beyond scope of this class, but noteworthy) If a model will be
deployed:
• Ensure the pipeline can handle missing or unexpected values at prediction time.
• Monitor model performance over time (data drift can degrade a model).
• Consider how to update the model with new data (will you retrain periodically? Does the model
automatically adapt online?).
• Ethical and Bias considerations: Be mindful of biases in data. If the training data reflects historical
biases (e.g., along gender or race), the model may learn them. Fairness in ML is an active area.
Sometimes feature selection (omitting certain attributes) or bias mitigation techniques are needed.
Always think of the societal impact of errors (e.g., false rejection in loan applications affecting certain
groups more).
In summary, a successful machine learning project is not just about choosing a fancy algorithm. It's about a
rigorous process of understanding the problem, preparing data, selecting appropriate models, tuning
them, and critically evaluating them in context 9 43 . By adhering to these best practices, you set yourself
up to build models that are not only accurate but also robust and reliable.
Congratulations on making it through this comprehensive journey of Machine Learning! We covered a lot
of ground, from fundamental concepts to specific algorithms in supervised, unsupervised, and
reinforcement learning, with practical examples and exercises. With these foundations and hands-on
practice, you are well-equipped to tackle real-world machine learning problems, continually refine your
approach, and keep learning more advanced techniques. Good luck, and happy modeling!
1 01-ml-overview.key
https://sebastianraschka.com/pdf/lecture-notes/stat479fs18/01_ml-overview_slides.pdf
36
16 k-nearest neighbors algorithm - Wikipedia
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
37 38 Q-learning - Wikipedia
https://en.wikipedia.org/wiki/Q-learning
37