0% found this document useful (0 votes)

5 views37 pages

Advanced Machine Learning Tutorial

Uploaded by

asura200401

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views37 pages

Advanced Machine Learning Tutorial

Uploaded by

asura200401

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Advanced Machine Learning Tutorial

This Jupyter Notebook provides an in-depth tutorial on Machine Learning, suitable for an 8-hour advanced
class. We will cover key concepts, algorithms (with code examples using scikit-learn), and practical exercises.
The topics include:

1. Introduction to Machine Learning: Definition, types of learning, and real-world applications.

2. Supervised Learning: Classification vs regression concepts, decision boundaries, common
algorithms (Logistic Regression, Decision Trees, Random Forests, k-NN, SVM, Naive Bayes, Gradient
Boosting) with code examples, and model evaluation metrics (accuracy, precision, recall, F1-score,
confusion matrix, ROC-AUC).
3. Unsupervised Learning: Clustering concepts, algorithms (K-Means, Hierarchical Clustering,
DBSCAN) with visualizations, dimensionality reduction (PCA, t-SNE), and cluster evaluation (silhouette
score, Davies–Bouldin index).
4. Reinforcement Learning: Introduction to RL terminology (agent, environment, reward, policy),
overview of Q-learning and value iteration with simplified examples.
5. Real-World Datasets: Demonstrations using classic datasets like Iris, Titanic, Wine, Digits, and
Boston Housing.
6. Exercises: Hands-on practice questions after major sections (with solutions) to reinforce
understanding.
7. Final Notes: Best practices in the ML workflow, including data cleaning, splitting, cross-validation,
feature engineering, and model tuning.

Let's begin!

1. Introduction to Machine Learning

What is Machine Learning? Machine Learning (ML) is broadly defined as the field of study that gives
computers the ability to learn and improve from experience without being explicitly programmed. A classic
formal definition by Tom Mitchell is:

"A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E." 1

In simpler terms, ML algorithms identify patterns in data and use those patterns to make predictions or
decisions on new, unseen data without being hard-coded with specific rules. Instead of writing a rigid
algorithm for every scenario, we provide a model with data and let it figure out the underlying structure.

1
Types of Machine Learning: ML tasks generally fall into a few broad categories: Supervised learning,
Unsupervised learning, and Reinforcement learning (among others like semi-supervised and self-
supervised, which are hybrids).

• Supervised Learning: The algorithm learns from labeled data – each training example comes with
an output or target. The model makes predictions and is corrected when those predictions are
wrong. Over time it learns the mapping from inputs to the correct output 2 . Both classification
(predicting discrete labels) and regression (predicting continuous values) are forms of supervised
learning. (Example: predicting if an email is spam or not spam, based on a labeled training set of
emails 3 .)

• Unsupervised Learning: The algorithm works with unlabeled data – it must find structure in the
data on its own. Common tasks include clustering (grouping similar data points) and
dimensionality reduction (reducing data complexity while retaining structure). The model discovers
patterns such as groupings or anomalies without explicit feedback 4 . (Example: grouping
customers into segments based on purchasing behavior without predefined categories 5 .)

• Reinforcement Learning (RL): An agent learns by interacting with an environment, taking actions
and receiving rewards or penalties as feedback 6 . The goal is to learn a policy (strategy) that
maximizes cumulative reward. Unlike supervised learning, there is no direct “right/wrong” label for
each action – the agent discovers which actions yield the most reward through trial and error.
(Example: training a game-playing AI – the agent (AI) receives positive rewards for winning or
making good moves and negative rewards for losing or making bad moves.)

Real-World Applications of ML: Machine learning is ubiquitous in modern technology. Some notable
applications include:

• Image and Speech Recognition: e.g., facial recognition in photos, voice assistants transcribing
speech to text 7 .
• Natural Language Processing: e.g., language translation services, chatbots, sentiment analysis of
text.
• Recommender Systems: e.g., movie or product recommendations on Netflix and e-commerce sites,
which learn user preferences 8 .
• Healthcare: ML models assist in medical diagnosis by analyzing images (X-rays, MRIs) or patient
data to predict diseases.
• Finance: Fraud detection systems learn to flag unusual transactions; algorithmic trading uses ML for
decision making.
• Autonomous Vehicles: Self-driving cars use reinforcement learning and supervised learning for
tasks like path planning and object detection.

These examples barely scratch the surface – ML is also used in cybersecurity, robotics, agriculture, and
many other fields to improve efficiency and outcomes.

2
How Machines "Learn": At the core, an ML model makes predictions and then adjusts itself based on
errors. This typically involves an optimization algorithm (like gradient descent) that tweaks model
parameters to better fit the data. The process often includes:

• Data Input: Collect plenty of data relevant to the task.

• Feature Extraction: (If not using end-to-end models) Transform raw data into informative features.
• Model Training: Choose a model (e.g., a neural network, or a decision tree) and let it ingest training
data, adjusting internal parameters to reduce prediction error.
• Evaluation: Measure performance on unseen data (hold-out test set or via cross-validation) to
ensure the model generalizes.
• Iteration: Tune hyperparameters or try different modeling approaches if needed.

Note: High-quality data is crucial. The saying "garbage in, garbage out" holds – models trained
on biased or noisy data will produce poor predictions. Indeed, data is the foundation of
machine learning – without relevant, clean data, even advanced models cannot perform well
9 .

Now that we have a high-level idea of what ML is, let's dive deeper into the main types of learning, starting
with supervised learning.

2. Supervised Learning
In supervised learning, we have input variables X (features) and an output variable Y (target), and we use an
algorithm to learn the mapping function from X to Y such that the model’s predictions $\hat{Y}$ for new X
are as close as possible to the true Y. This section is divided into two parts:

• Classification – where Y is categorical (e.g., class labels).

• Regression – where Y is continuous (e.g., a real number).

We will explore various algorithms for each and discuss how to evaluate their performance.

2.1 Classification

Classification is the task of predicting a discrete class label for an input. For example, given attributes of a
tumor, predict whether it is "benign" or "malignant" (binary classification), or given an image of a
handwritten digit, identify which digit 0-9 it represents (multi-class classification).

A key concept in classification is the decision boundary. This is the surface (in feature space) that separates
different class predictions. Formally, a decision boundary is a hypersurface that partitions the underlying
feature space into regions, one for each class 10 . Points lying on different sides of the boundary are predicted
as different classes. For a simple 2D example, the decision boundary is a curve (or line, if the classes are
linearly separable) dividing the plane into regions (class A vs class B).

If the decision boundary is a linear hyperplane, the classes are linearly separable. More
complex boundaries can be non-linear. Different classifiers have different shaped decision
boundaries (e.g., linear models vs. decision trees vs. k-NN all produce different boundaries).

3
To illustrate decision boundaries, let's look at a visual comparison of several classification algorithms on a
toy 2D dataset (with two features):

Figure 1: Decision boundaries of various classifiers on example 2D datasets. Each subplot shows a classifier (e.g.,
Nearest Neighbors, Linear SVM, RBF SVM, Decision Tree, Random Forest, Neural Net, AdaBoost, Naive Bayes, QDA)
trained on a synthetic dataset with two classes (red vs blue). The colored regions represent the predicted class for
any point, and points show the training data (solid) and test data (semi-transparent). This illustrates how different
algorithms partition the feature space with different shaped boundaries 11 12 .

In Figure 1, notice how: - Linear models (like Linear SVM or Logistic Regression, not shown but similar to
Linear SVM) produce a single straight-line boundary (or hyperplane in higher dimensions). - k-Nearest
Neighbors (k-NN) creates a very wiggly, locally-defined boundary that can be quite irregular (following the
training points closely). - Decision Tree yields a piecewise-constant boundary (rectangular segments in 2D).
- RBF SVM (a non-linear SVM) and Neural Network can produce smooth but complex curved boundaries. -
Naive Bayes here produces roughly linear boundaries (for Gaussian NB, boundaries are quadratic curves
but often close to linear). - Ensembles like Random Forest (many trees) or AdaBoost can produce complex
boundaries as well, often less smooth.

This highlights that model choice matters – some models are more flexible (able to capture complex
patterns) but may risk overfitting, while others are simpler and may underfit if the true boundary is
complex.

Now, let's discuss common classification algorithms, with brief explanations and code examples for each:

2.1.1 Logistic Regression

Despite its name, Logistic Regression is actually a classification algorithm (binary classification by default).
It models the probability of the positive class using a logistic (sigmoid) function. The model is a linear
function of the input features passed through the sigmoid, giving an output between 0 and 1 that can be
interpreted as $P(Y=1|X)$.

• Model form: $P(Y=1|X) = \sigma(w^T X + b)$, where $\sigma(z) = 1/(1+e^{-z})$ is the sigmoid. The
decision boundary corresponds to $w^T X + b = 0$ (where probability = 0.5 for binary case, often
thresholded at 0.5 to decide class).
• Training: Optimize the coefficients $w, b$ to minimize logistic loss (equivalent to maximizing
likelihood). Often solved with gradient descent. No closed-form solution (unlike linear regression)
due to the sigmoid, but convex optimization ensures a single optimum.

4
• Regularization: Logistic regression often uses L2 (ridge) or L1 (lasso) regularization to prevent
overfitting, especially when feature count is high.

Use cases: Logistic regression is popular for its simplicity, interpretability (the coefficients indicate feature
influence on log-odds of the outcome), and efficiency on high-dimensional sparse data (e.g., text
classification).

Let's see logistic regression in action on a simple dataset. We'll use the classic Iris dataset for
demonstration (though Iris has 3 classes, logistic regression can handle multi-class via one-vs-rest by
default in scikit-learn).

from sklearn.linear_model import LogisticRegression

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset (features: flower measurements, target: species index 0,1,2)
iris = load_iris()
X = iris.data
y = iris.target
# Binary classification example: to simplify, let's classify whether species is
"Virginica" (class 2) or not
y_binary = (y == 2).astype(int)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.3,
random_state=0, stratify=y_binary)

# Train logistic regression

model = LogisticRegression(solver='lbfgs', penalty='none', max_iter=1000)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Coefficients:", model.coef_, "Intercept:", model.intercept_)

In the code above, we turned Iris into a binary problem ("Virginica vs not"). The output will show accuracy
and the learned coefficients. Logistic regression would draw a linear boundary in the 4D feature space to
separate Virginica from the other species.

5
2.1.2 Decision Trees

A Decision Tree is a versatile supervised learning algorithm capable of both classification and regression. It
learns a set of hierarchical if-else rules to predict the target.

• Model form: A tree structure where each internal node tests a feature (e.g., "petal length > 2.5?"),
and each leaf node outputs a prediction (class label for classification, or a value for regression).
• Learning: The tree is built by recursively splitting the data on the feature and threshold that yields
the largest information gain (or equivalently, minimizes impurity like Gini index or entropy in
classification) 13 . Splitting continues until leaves are pure (all one class) or a stopping criterion is
met (max depth, min samples, etc.).
• Nature: Decision trees are non-parametric and can model complex decision boundaries by their
piecewise splits 14 . However, they easily overfit if grown too deep (essentially memorizing the
training set).

To combat overfitting, we can prune the tree (stop splitting early or trim branches post-training). Simpler
trees generalize better. An advantage of trees is interpretability – one can follow the path and see the
conditions leading to a prediction.

Use cases: Decision trees are intuitive and handle heterogeneous data (categorical and numerical features)
and missing values well. They don't require feature scaling and can capture non-linear relationships.
However, a single tree often is not the most accurate predictor compared to other methods, unless it's
boosted or in a forest.

Let's train a decision tree on a dataset. We will use the Wine dataset (classification of wine cultivars based
on chemical analysis) as an example:

from sklearn.tree import DecisionTreeClassifier

from sklearn.datasets import load_wine

# Load Wine dataset (3 classes)

wine = load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,

random_state=0, stratify=y)
tree = DecisionTreeClassifier(max_depth=3, random_state=0) # limit depth for
simplicity
tree.fit(X_train, y_train)
print("Tree depth:", tree.get_depth(), " Number of leaves:",
tree.get_n_leaves())
print("Accuracy (train):", tree.score(X_train, y_train))
print("Accuracy (test):", tree.score(X_test, y_test))

6
With max_depth=3 , we restrict the complexity to prevent overfitting and for readability. The output will
show the tree’s depth and number of leaves, as well as accuracy on training vs test. (A large gap between
train and test accuracy signals overfitting; we might see train accuracy is 100% if tree was deep enough to
memorize data.)

We can visualize the trained tree (though in practice, large trees are hard to interpret fully):

from sklearn import tree

import matplotlib.pyplot as plt

plt.figure(figsize=(12,8))
tree.plot_tree(tree, feature_names=wine.feature_names,
class_names=wine.target_names.astype(str), filled=True)
plt.show()

This will plot the tree structure with conditions at each node and class distributions at leaves. Each node
shows the splitting rule (e.g., "proline <= 755.5"), and colored leaves indicate predicted class.

2.1.3 Random Forest

A Random Forest is an ensemble of decision trees, built via the technique of bagging (bootstrap
aggregating) plus random feature selection. The idea is to reduce the variance of a single decision tree by
averaging many trees trained on random subsets of data and features.

• Training: We train multiple trees (e.g., 100) on bootstrap samples of the training data. Each tree is
grown typically to full depth (or with minimal pruning), but at each split, the algorithm is limited to
choose the best split among a random subset of features (this decorrelates the trees).
• Prediction: For classification, each tree votes for a class, and the forest predicts the majority vote.
For regression, it averages the predictions.

Random forests generally achieve better accuracy than individual trees and are more robust to noise. They
also come with an out-of-bag estimation for error (using those samples not in a tree’s bootstrap sample to
test that tree) and measures of feature importance (based on how much each feature split improves the
purity, averaged over trees).

Use cases: When you need a strong default classifier/regressor that works out-of-the-box, random forest is
often a good choice. It handles nonlinear relations and interactions well, requires little parameter tuning,
and tends not to overfit badly with enough trees 15 . It can still struggle with very high-dimensional sparse
data or purely linear relationships where simpler models suffice.

Example on the Iris dataset again (multiclass):

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, random_state=0)

7
clf.fit(X_train, y_train)
print("Test Accuracy:", clf.score(X_test, y_test))

# Feature importance
for name, importance in zip(wine.feature_names, clf.feature_importances_):
print(f"{name}: {importance:.3f}")

This will train a forest of 100 trees and report accuracy. It also prints feature importances, showing which
features the model found most informative (these are normalized values that sum to 1).

2.1.4 k-Nearest Neighbors (k-NN)

The k-Nearest Neighbors algorithm is a simple yet effective approach that makes predictions based on the
memorized training dataset. For classification, the predicted class of a new point is determined by the
majority class among its k closest training points (neighbors) 16 . "Closeness" is typically defined by a
distance metric, usually Euclidean distance in feature space.

• Training phase: There really isn’t one – k-NN simply stores the training data.
• Prediction phase: Compute distance from the new point to all training points, find the k nearest,
and take a majority vote (for classification) or average (for regression) of their outputs.
• Parameter: The choice of k (and distance metric) is important. A small k (like 1) can lead to
overfitting (very jagged decision boundary following individual points), while a large k smooths out
predictions but may underfit.

Characteristics: k-NN is intuitive and can model very irregular decision boundaries (by local voting) 17 .
However, it becomes slow as the dataset grows (for each prediction, distance to all training points must be
computed – though techniques like KD-trees or ball trees can speed this up in lower dimensions). It also
suffers if features are on different scales or if irrelevant features introduce noise in distance – feature
scaling and selection can help.

Let's demonstrate k-NN on a subset of the data for simplicity:

from sklearn.neighbors import KNeighborsClassifier

# Use only first two features of Iris for easy 2D visualization

X_train2 = X_train[:,:2]
X_test2 = X_test[:,:2]

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train2, y_train)
print("Test accuracy (k=5):", knn.score(X_test2, y_test))

We restricted to 2 features for potential plotting. The accuracy might be a bit lower than using all features,
but we can visualize the decision boundary in 2D for k-NN:

8
import numpy as np

# Plot decision boundary for k-NN in 2D

x_min, x_max = X_train2[:,0].min()-0.5, X_train2[:,0].max()+0.5
y_min, y_max = X_train2[:,1].min()-0.5, X_train2[:,1].max()+0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(6,4))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
plt.scatter(X_train2[:,0], X_train2[:,1], c=y_train, cmap='coolwarm',
edgecolor='k')
plt.title("5-NN Decision Boundary (Iris, 2 features)")
plt.xlabel(iris.feature_names[0]); plt.ylabel(iris.feature_names[1])
plt.show()

In the contour plot above, you’ll see how k-NN partitions the feature plane into complex regions (especially
near the class boundaries, it can be quite irregular). More training points and a higher k would smooth the
boundaries.

2.1.5 Support Vector Machine (SVM)

Support Vector Machines are powerful classifiers that find the optimal separating hyperplane between
classes by maximizing the margin (distance) between the hyperplane and the nearest points (support
vectors) 18 . Key points:

• For linear separable data, SVM finds the hyperplane with maximum margin.
• For non-linear data, SVM can employ the kernel trick: implicitly map data to a higher-dimensional
space where it is linearly separable, without explicitly computing the coordinates in that space.
Common kernels: polynomial, RBF (Gaussian), sigmoid.
• Soft-margin SVM allows some misclassifications (with a penalty) to handle overlapping classes and
improve generalization (controlled by hyperparameter C).

SVMs are effective in high dimensions, memory efficient (only support vectors matter), and can be robust.
However, they do not natively provide probabilistic outputs (though we can calibrate or use Platt scaling)
and can be slow on large datasets (time complexity can be quadratic in number of samples for training).

Let's use SVM on a dataset (say the Wine dataset again). We'll use an RBF kernel to allow non-linear decision
boundaries:

from sklearn.svm import SVC

svc = SVC(kernel='rbf', C=1.0, gamma='scale', probability=True, random_state=0)

9
svc.fit(X_train, y_train)
print("Test accuracy (SVM):", svc.score(X_test, y_test))

You can experiment with kernel='linear' or 'poly' and see how it affects performance. The
gamma parameter controls the kernel width for RBF (higher gamma = smaller radius for influence of each
point, can lead to overfitting if too high).

2.1.6 Naive Bayes

Naive Bayes classifiers are simple probabilistic classifiers based on applying Bayes’ Theorem with a strong
(naive) assumption: that features are independent given the class label 19 . Despite this assumption often
being false in real data, Naive Bayes can perform surprisingly well, especially for high-dimensional
problems like text classification (where independence between words is assumed).

There are different variants depending on the distribution assumption for features: - Gaussian Naive
Bayes: Assumes continuous features follow a Gaussian distribution per class. - Multinomial Naive Bayes:
For counts (e.g., word counts in text). - Bernoulli Naive Bayes: For binary features (e.g., presence/absence
of a word).

How it works: The model uses Bayes’ theorem: $P(C|X) \propto P(X|C) P(C)$. With the independence
assumption, $P(X|C) = \prod_{i} P(X_i | C)$. It estimates these probabilities from training data (e.g., mean/
variance for Gaussian NB, or frequency of feature values for others). Prediction is then $ \hat{y} =
\arg\max_C P(C) \prod_{i} P(x_i | C)$. Taking log (to avoid underflow) turns it into sums of log-probabilities.

Naive Bayes is extremely fast to train (just counting occurrences, basically) and to predict. It requires very
little data to estimate parameters (since it needs just per-feature statistics). However, if the independence
assumption is grossly violated, its probability estimates can be off (it may still get the class right, but the
confidence scores are unreliable).

Let's demonstrate Naive Bayes on a simple text classification task to illustrate (e.g., classifying very short
text as positive or negative sentiment). We will use MultinomialNB for a toy example:

from sklearn.naive_bayes import MultinomialNB

# Simple example data: 6 documents, and a vocabulary of words (bag-of-words

features)
docs = ["happy joy love", "love happy peace", "hate anger", "anger hate
despise", "love peace joy", "hate war anger"]
y_docs = [1, 1, 0, 0, 1, 0] # 1 = positive, 0 = negative sentiment (just
hypothetical)

# Build vocabulary and feature counts

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(binary=True) # binary occurrence for simplicity
X_docs = vec.fit_transform(docs)

10
nb = MultinomialNB()
nb.fit(X_docs, y_docs)
test_docs = ["happy love", "hate joy"]
X_test_docs = vec.transform(test_docs)
preds = nb.predict(X_test_docs)
print("Predictions:", preds) # expect [1, 0] perhaps
print("Predicted probabilities:", nb.predict_proba(X_test_docs))

This example converts text to features (bag-of-words) and applies NB. The predictions and probabilities
show how confident the model is that each test phrase is positive or negative. Despite the small dataset and
simplistic assumption, NB can correctly generalize that "happy love" is positive (both words are strongly
associated with the positive class in training) and "hate joy" might be predicted as negative (because "hate"
is a strong negative indicator, even though "joy" is positive, NB weighs them independently).

2.1.7 Gradient Boosting (e.g., XGBoost / GradientBoostingClassifier)

Gradient Boosting refers to a class of ensemble techniques where new models are added sequentially to
correct the errors of the existing ensemble. Typically, this is done with decision trees as the base learners
(resulting in a Gradient Boosted Decision Trees model). Each new tree is fit to the residual errors
(gradients) of the current model, hence "gradient boosting" 20 .

Notable implementations include XGBoost, LightGBM, CatBoost, and scikit-learn's

GradientBoostingClassifier . These models have become extremely popular in machine learning
competitions and real-world tasks due to their accuracy and efficiency.

Key points: - Boosting vs Bagging: Unlike random forests (bagging), boosting trees are built sequentially,
each trying to fix errors of the previous ensemble. There is no bootstrap sampling; each tree typically uses
the full dataset but with weighted observations (higher weight on those previously mispredicted). -
Regularization: Boosting models have parameters like learning rate (how much each new tree contributes),
max depth of individual trees (often kept small, like 3-8), and number of trees. A smaller learning rate and
more trees can improve generalization (at the cost of training time). - Performance: Gradient boosting
often outperforms random forests in terms of pure predictive accuracy, especially when carefully tuned, but
can be more prone to overfitting if not regularized. Techniques like shrinkage (learning rate), row/column
sampling, and early stopping (stop adding trees when validation error stops improving) mitigate this.

Let's use scikit-learn's GradientBoostingClassifier on the Breast Cancer dataset (a binary

classification dataset with 30 features):

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.datasets import load_breast_cancer

# Load data
cancer = load_breast_cancer()
X = cancer.data; y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
random_state=0)

11
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
max_depth=3, random_state=0)
gbc.fit(X_train, y_train)
print("Accuracy (train):", gbc.score(X_train, y_train))
print("Accuracy (test):", gbc.score(X_test, y_test))

You will likely see a very high accuracy on both train and test (breast cancer is an “easy” dataset, and
boosting is powerful). If the train accuracy is significantly higher than test, that indicates some overfitting;
one might lower max_depth or learning_rate or use fewer estimators to remedy that.

Feature Importance: Like random forests, boosted trees also provide feature importance. You can inspect
gbc.feature_importances_ to see which features are most used in the splits.

Note: Modern gradient boosting libraries like XGBoost or LightGBM often have further optimizations and
conveniences (handling missing data, GPU training, etc.). But the essence is the same.

2.2 Regression

Regression is about predicting a continuous numeric value for each input. Many algorithms are similar to
classification counterparts but optimized for a numeric target with appropriate loss (e.g., mean squared
error).

Some common regression algorithms: - Linear Regression: A fundamental model that assumes a linear
relationship between inputs and the output. It minimizes the sum of squared errors and has an analytical
solution (normal equation) if no regularization. It's highly interpretable (coefficients show effect of features)
but limited to linear trends (which can be extended via polynomial features). - Decision Tree Regression: A
decision tree where leaves predict an average value (or median) of training samples in that leaf. Tends to
create piecewise constant prediction regions. - Random Forest Regression: Ensemble of trees averaged to
yield smoother predictions than a single tree. - k-NN Regression: Predict by averaging the values of nearest
neighbors. - Support Vector Regression (SVR): SVM adapted for regression, with a margin of tolerance
(epsilon-insensitive loss). - Gradient Boosting Regression: Analogous to boosting for classification but
optimizing squared error or absolute error, etc.

The process and considerations (overfitting, feature scaling for some models, etc.) are similar to
classification. The evaluation metrics differ (we use error metrics, see next section).

Example – Linear Regression on Boston Housing dataset: The Boston Housing dataset (historic dataset
of house prices in Boston areas) is a classic regression benchmark. It has features like average number of
rooms, crime rate, etc., and target is median house value in $1000s.

(Note: Scikit-Learn’s load_boston is deprecated due to ethical concerns; in practice one might use the
California housing dataset or another. We'll proceed with Boston for demonstration.)

12
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

# Load Boston housing data

boston = load_boston()
X = boston.data; y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

lr = LinearRegression()
lr.fit(X_train, y_train)
print("Train R^2:", lr.score(X_train, y_train))
print("Test R^2:", lr.score(X_test, y_test))
print("Coefficients:", lr.coef_)

The .score method for regression returns the coefficient of determination $R^2$ (which is 1 - (MSE of
model)/(MSE of trivial mean model)). An $R^2$ of 1 is perfect fit, 0 means model is no better than predicting
the mean of y, and negative means it's worse than that mean prediction baseline.

The coefficients printed show the linear relationship learned: for each feature, holding others constant, how
many units the house price changes per unit increase in that feature (according to the model). For example,
a coefficient of -2 on “RAD” feature would mean if the accessibility to radial highways goes up by 1 (with
other features fixed), the predicted price goes down by $2k, suggesting perhaps houses near more
highways are cheaper (just an interpretation).

We can also examine non-linear or ensemble regression. Let's try a Random Forest Regressor on the same
data for comparison:

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=0)
rf.fit(X_train, y_train)
print("Test R^2 (Random Forest):", rf.score(X_test, y_test))

Often the random forest will have a higher $R^2$ than linear regression if the true relationships are non-
linear or involve complex interactions.

2.3 Model Evaluation (Classification Metrics)

When we build models, we need ways to evaluate their performance in order to choose the best model
and to know if the model is good enough for our application. For classification, there are several important
metrics:

• Accuracy: The fraction of predictions the model got right. $$\text{Accuracy} = \frac{TP + TN}{TP + TN
+ FP + FN}$$ where TP = true positives, TN = true negatives, FP = false positives, FN = false negatives.

13
Accuracy is intuitive but can be misleading in imbalanced datasets (e.g., 99% accuracy could mean
it’s just predicting the majority class always, ignoring the minority class).

• Precision: Out of all instances predicted as a certain class (say "positive"), how many were actually
that class? It’s a measure of exactness – low precision means many false positives. Formally, for
positive class, $\text{Precision} = \frac{TP}{TP + FP}$ 21 . High precision is important in scenarios
where false alarms are costly (e.g., precision of a spam detector – if low, you’d mislabel important
emails as spam).

• Recall (Sensitivity or True Positive Rate): Out of all actual instances of a class, how many did the
model correctly identify? It’s a measure of completeness – low recall means many false negatives. $
\text{Recall} = \frac{TP}{TP + FN}$ 22 . High recall is crucial when missing a positive case is very bad
(e.g., recall of a cancer diagnostic test – we want to catch as many cases as possible).

• F1-Score: The harmonic mean of precision and recall: $F1 = 2 \frac{\text{Precision} \cdot
\text{Recall}}{\text{Precision} + \text{Recall}}$ 23 . It gives a single score that balances both
concerns. A high F1 means both precision and recall are reasonably high. This is useful for
comparing models when one may have higher precision but lower recall, and another vice versa.

• Confusion Matrix: A table layout of model predictions vs actual values. For binary classification, it's
a 2x2 matrix: Predicted
Pos Neg
Actual Pos TP FN
Neg FP TN It shows the counts in each category. The confusion matrix gives a
fuller picture of performance, from which you can derive all the metrics above 24 25 .

• ROC Curve (Receiver Operating Characteristic): This is a plot of the True Positive Rate (Recall)
against the False Positive Rate (FPR = FP/(FP+TN)) at various threshold settings of the classifier. It
characterizes the trade-off between sensitivity and specificity. The AUC (Area Under the ROC Curve)
is often used as a threshold-independent summary of the classifier’s performance. An AUC of 0.5 is
random guessing, 1.0 is perfect. A useful interpretation: AUC is the probability that a randomly chosen
positive instance is ranked higher than a randomly chosen negative instance by the classifier 26 . ROC is
great for understanding model performance across all classification thresholds, particularly in binary
classification 27 .

Let's compute some of these metrics for an example model to see them in practice. We'll train a logistic
regression on the breast cancer dataset and evaluate it:

from sklearn.metrics import confusion_matrix, precision_score, recall_score,

f1_score

y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)
print("Precision:", precision_score(y_test, y_pred))

14
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

For a well-performing model on this dataset, we might get something like: - Confusion matrix showing most
cases on the diagonal (correct predictions), e.g., [[57, 7], [ 5, 102]] meaning 57 malignant
correctly, 7 malignant misclassified, 5 benign misclassified, 102 benign correctly (just an example output). -
High precision and recall (both maybe around 0.93+).

We can better visualize performance with a confusion matrix heatmap and an ROC curve:

Figure 2: Confusion matrix for logistic regression on the breast cancer dataset (malignant vs benign). The matrix
shows counts of true vs predicted labels. E.g., here 57 malignant cases were correctly predicted as malignant (true
positives), 7 malignant cases were predicted as benign (false negatives), 5 benign cases were predicted malignant
(false positives), and 102 benign were correctly predicted (true negatives).

From Figure 2, we can derive: - Accuracy = (TP+TN)/total = (57+102)/171 ≈ 0.93 (93%). - Precision
(malignant) = 57/(57+5) ≈ 0.919, Recall (malignant) = 57/(57+7) ≈ 0.890. (Usually we compute precision/
recall for the positive class by default, or macro-average for all classes if multi-class.)

15
Figure 3: ROC Curve for the logistic regression model. The curve plots True Positive Rate vs False Positive Rate as
we vary the decision threshold. The dashed line is the diagonal (random performance). The model’s curve bows
towards the top-left, indicating good performance (AUC ≈ 0.99 in this case, meaning excellent separability
between classes) 26 .

When to use which metric: - If classes are balanced and each prediction is equally important, accuracy is
fine. - If data is imbalanced or certain errors are more costly, use precision/recall or F1. For example, in
fraud detection (fraud is rare), accuracy can be very high by always predicting "not fraud", but we care about
detecting the frauds (recall) while not annoying too many customers with false alarms (precision). - ROC
AUC is useful for comparing models and seeing the trade-off behavior. However, for very imbalanced data,
precision-recall curves might be more informative than ROC.

For multi-class, we extend these concepts: - Confusion matrix becomes NxN. - We calculate metrics per class
(one vs rest), and can report average (macro-average = simple average of per-class metrics, micro-average =
aggregate TP/FP counts). - There are also concepts of specificity (true negative rate), NPV (negative
predictive value), etc., but they are just the analogous measures for the negative class.

2.4 Model Evaluation (Regression Metrics)

(Briefly, since classification was the focus above, but completeness for regression:)

For regression, common metrics include: - Mean Squared Error (MSE): Average of squared differences
between predicted and actual values. Emphasizes larger errors (outliers heavily penalized). - Root Mean
Squared Error (RMSE): $\sqrt{MSE}$, more interpretable in same units as target. - Mean Absolute Error
(MAE): Average of absolute differences. More robust to outliers (doesn't square the error). - R^2
(Coefficient of Determination): $1 - \frac{\sum (y-\hat{y})^2}{\sum (y-\bar{y})^2}$. Interpretable as
proportion of variance in Y explained by the model. Can be negative if model is worse than always
predicting mean.

In scikit-learn:

16
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

y_pred_lr = lr.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred_lr))
print("RMSE:", mean_squared_error(y_test, y_pred_lr, squared=False))
print("MAE:", mean_absolute_error(y_test, y_pred_lr))
print("R^2:", r2_score(y_test, y_pred_lr))

One must consider the context: e.g., an RMSE of 5 (thousands of dollars) on house prices might be
acceptable or not depending on the range of prices.

At this point, we have covered an array of supervised learning algorithms and how to evaluate them. Next,
we move on to unsupervised learning, where there are no labels guiding the training.

3. Unsupervised Learning
Unsupervised learning deals with finding patterns or structure in data without any labels or targets. We will
discuss two main unsupervised tasks: Clustering and Dimensionality Reduction, along with how to
evaluate clustering results.

3.1 Clustering

Clustering is the task of grouping a set of objects such that those in the same group (cluster) are more
similar to each other than to those in other groups 17 . It is a form of pattern discovery – for example,
segmenting customers into distinct groups based on purchasing behavior, or finding communities in a
social network.

Clustering is inherently subjective; its goal is to uncover any underlying structure in the data: - Hard
clustering: each point belongs to exactly one cluster. - Soft (fuzzy) clustering: a point can have degrees of
belonging to multiple clusters. - Different clustering algorithms have different notions of what constitutes a
cluster (e.g., spherical clusters vs. density-based clusters vs. hierarchical groupings).

Common clustering algorithms: - K-Means Clustering - Hierarchical Clustering (Agglomerative or

Divisive) - DBSCAN (Density-Based Spatial Clustering of Applications with Noise) - Others include
Gaussian Mixture Models, Mean Shift, OPTICS, etc., but we'll focus on the above three.

3.1.1 K-Means Clustering

K-Means is a popular and simple partitioning method: - It requires the number of clusters k to be specified
in advance. - It starts with k initial cluster centroids (can be randomly chosen data points or other
initialization methods like k-means++ for better results). - Expectation-Maximization iterative
refinement: In each iteration, - Assign step: assign each data point to the nearest centroid (nearest in
Euclidean distance usually). - Update step: recompute each centroid as the mean of all points assigned to it.
- Repeat until assignments do not change or max iterations reached.

17
The objective is to minimize the sum of squared distances of points to their cluster centroid (within-cluster
variance). K-Means effectively finds clusters that are convex and roughly spherical in shape because it
uses distance to a mean (centroid) as criterion 28 .

Pros: Scalable to large datasets, easy to implement, often finds reasonably good clusters if clusters are nice
and separated.

Cons: Must choose k. Sensitive to outliers (mean shifts), and to initial seeds (bad initialization can lead to
poor results or local minima). Only finds convex clusters; cannot handle complex shapes. Also assumes
equal importance (and scaling) of features due to Euclidean distance usage.

Let's run k-means on a simple dataset for demonstration. We'll use a synthetic dataset where we know the
true clusters for illustration, and see if k-means recovers them.

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Create a synthetic dataset of 2D points in three clusters

np.random.seed(0)
X1 = np.random.randn(100, 2) + np.array([0, 0]) # cluster around (0,0)
X2 = np.random.randn(100, 2) + np.array([5, 5]) # cluster around (5,5)
X3 = np.random.randn(100, 2) + np.array([0, 8]) # cluster around (0,8)
X_syn = np.vstack([X1, X2, X3])

# Perform K-Means with k=3

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_syn)
centroids = kmeans.cluster_centers_

# Plot the clusters

plt.figure(figsize=(6,5))
plt.scatter(X_syn[:,0], X_syn[:,1], c=labels, cmap='viridis', alpha=0.5)
plt.scatter(centroids[:,0], centroids[:,1], s=200, c='red', marker='X') #
centroids
plt.title("K-Means clustering (k=3)")
plt.show()

In the scatter plot, points are colored by their k-means cluster assignment, and red 'X' marks are the
centroids. We would expect k-means to have identified clusters near (0,0), (5,5), (0,8) if it worked well
(perhaps not perfectly if clusters had some overlap or if initialization was poor, but with distinct clusters it
should do fine).

18
3.1.2 Hierarchical Clustering

Hierarchical clustering does not require a pre-specified number of clusters. Instead, it creates a hierarchy
of clusterings which can be represented as a tree (dendrogram). There are two main types: -
Agglomerative (bottom-up): Start with each point as its own cluster, then iteratively merge the two closest
clusters until eventually all points are in one cluster 29 . "Closest" is defined by a linkage criterion (e.g.,
single-linkage = closest pair of points between clusters, complete-linkage = farthest pair, average-linkage =
average distance between points of clusters, etc.). - Divisive (top-down): Start with all points in one cluster
and recursively split clusters.

Agglomerative is more common. It results in a dendrogram where cutting the tree at a certain level gives a
particular number of clusters.

Pros: You get a full hierarchy; you can choose any number of clusters post-hoc by cutting the dendrogram.
Can capture nested patterns. No need to commit to a particular k upfront (though you still need to decide a
stopping criterion or cut).

Cons: Typically $O(n^2)$ complexity (computing distance matrix) so not feasible for very large n (unless
using approximation). Sensitive to noise and outliers if not handled. Merging decisions are irreversible
(greedy), so early mistaken merges can’t be corrected.

Example: We use agglomerative clustering on the same synthetic dataset:

from sklearn.cluster import AgglomerativeClustering

import scipy.cluster.hierarchy as sch

agg = AgglomerativeClustering(n_clusters=None, distance_threshold=0,

linkage='ward')
labels_full = agg.fit_predict(X_syn)
# n_clusters=None with distance_threshold=0 gives the full tree (each point
separate initially)

# We can use scipy to plot a dendrogram for a subset (because plotting all 300
points will be too crowded)
sample_indices = np.random.choice(len(X_syn), size=50, replace=False)
X_sample = X_syn[sample_indices]
# compute the linkage matrix
linkage_matrix = sch.linkage(X_sample, method='ward')
plt.figure(figsize=(10, 5))
sch.dendrogram(linkage_matrix)
plt.title("Hierarchical Clustering Dendrogram (sample of points)")
plt.xlabel("Sample index"); plt.ylabel("Distance")
plt.show()

19
The dendrogram plot shows how clusters merge at increasing distances. You could decide a distance cutoff
to get a desired cluster partition. For instance, if you see three distinct merges far apart, cutting just before
those merges might give 3 clusters.

In practice, one might decide cluster count by looking for a "gap" in distances (large jump indicates merging
two very dissimilar clusters) or other criteria like the silhouette score for different k.

3.1.3 DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering method that
groups points that are closely packed together (points with many neighbors in a radius) while marking
points in sparse regions as outliers 30 .

Key concepts: - epsilon (ε): radius of neighborhood. - minPts: minimum number of points in a
neighborhood (of radius ε) for a point to be considered a core point. - Core points: have at least minPts
points (including itself) within ε. - Border points: not core points, but within ε of a core point (reachable
from core). - Noise points: not core or border (no sufficient neighbors, and not near a core).

DBSCAN algorithm: - Find all core points. - For each core point not yet assigned to a cluster, start a new
cluster and include all core points reachable (directly or transitively) from it (if point A is within ε of core B,
they're in same cluster; and B within ε of C, etc.). Border points get assigned to clusters of their nearest core
if applicable. - Noise points remain unassigned.

Pros: Can find arbitrarily shaped clusters (not just convex) and can identify outliers (noise) explicitly. No
need to specify k beforehand – you set ε and minPts which are more intuitive sometimes (if you have
domain knowledge of what density constitutes a cluster). Cons: Choice of ε is critical. If density varies, a
single ε may not work well (clusters in dense areas vs sparser areas). Also, DBSCAN struggles in high
dimensions due to curse of dimensionality (distance becomes less meaningful).

Example: Consider the famous "two moons" dataset – two interleaving half-circle shapes. K-means fails to
cluster them properly (would cut across the moons), but DBSCAN can.

from sklearn.datasets import make_moons

from sklearn.cluster import DBSCAN

# Two moons data

X_moons, y_moons = make_moons(n_samples=300, noise=0.05, random_state=0)
# Try KMeans vs DBSCAN
labels_km = KMeans(n_clusters=2, random_state=0).fit_predict(X_moons)
labels_db = DBSCAN(eps=0.2, min_samples=5).fit_predict(X_moons)

# Plot KMeans vs DBSCAN result

fig, axes = plt.subplots(1, 2, figsize=(10,4))
axes[0].scatter(X_moons[:,0], X_moons[:,1], c=labels_km, cmap='coolwarm')
axes[0].set_title("K-Means Clustering")
axes[1].scatter(X_moons[:,0], X_moons[:,1], c=labels_db, cmap='coolwarm')

20
axes[1].set_title("DBSCAN Clustering")
plt.show()

Figure 4: Clustering results on the "two moons" dataset. Left: K-Means splits the data into two clusters but the
linear boundary misassigns points (one moon is split into two parts). Right: DBSCAN correctly finds the two
crescent-shaped clusters (red and blue) and would label any isolated noise points as -1 (if present, shown as a
separate color).

In Figure 4, K-Means forced spherical clusters, which doesn't fit the curved shapes. DBSCAN (with an
appropriate ε) identifies the moons properly as separate clusters, showing its advantage for non-globular
structures. DBSCAN also automatically treated some scattered noise (if any) as outliers (cluster label -1).

3.2 Dimensionality Reduction

Often data in high dimensions (many features) can be hard to visualize and even hard for models to handle
due to the curse of dimensionality (distance metrics become less informative, too many parameters, risk
of overfitting). Dimensionality reduction techniques seek to reduce the number of features while
preserving as much data variance or structure as possible 31 . This can be used for: - Data compression and
efficiency. - Noise reduction. - Visualization (reducing to 2 or 3 dimensions to plot). - Feature extraction
(derive new composite features that capture most information).

Principal Component Analysis (PCA): PCA is a linear technique that finds new orthogonal axes (principal
components) that maximize the variance of the data 31 . The first principal component is the direction of
highest variance. The second is orthogonal to the first and has the next highest variance, and so on. By
projecting data onto the top k components, we get a k-dimensional representation that retains most of the
variability of the original data (information).

• PCA is essentially an eigen-decomposition of the covariance matrix of data. It’s unsupervised

(ignores class labels).
• Often we normalize features before PCA (so that variance isn’t dominated by scale of features).
• PCA can be used to visualize high-D data or as a preprocessing step before modeling (to
decorrelate features or reduce dimensionality).

t-SNE (t-Distributed Stochastic Neighbor Embedding): t-SNE is a popular non-linear dimensionality

reduction technique especially for visualization. It aims to preserve local structure (i.e., points that are

21
close in original space should be close in reduced space) 32 . It converts distances into probabilities
(similarity measures) and then tries to minimize the divergence between the distributions in original and
reduced space. It's great for plotting high-dimensional data into 2D or 3D where clusters or manifolds
might become visually apparent (like grouping of images by type, or word embeddings by semantic
clusters).

• t-SNE is non-linear and probabilistic, it will highlight local clusters but may distort global
relationships (distance between far clusters isn't necessarily meaningful).
• It has parameters like perplexity (related to how it balances local vs global) and can be slow on very
large datasets.
• It's only for visualization, not for general reduction to feed into other algorithms (because it
doesn't preserve a clear metric structure for all points, mainly the local neighbor relations).

Other techniques: UMAP (Uniform Manifold Approximation and Projection) is a newer non-linear
technique, often faster than t-SNE and preserves more global structure. There are also autoencoders
(neural network based reduction), Factor Analysis, Independent Component Analysis (ICA), etc.

Example – PCA on Digits dataset: The Digits dataset (8x8 images of handwritten digits 0-9, flattened to 64
features) is 64-dimensional. We can use PCA to reduce it to 2D, and t-SNE for another 2D, and compare.

from sklearn.datasets import load_digits

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

digits = load_digits()
X_digits = digits.data
y_digits = digits.target

# PCA to 2D
pca2 = PCA(n_components=2)
X_pca2 = pca2.fit_transform(X_digits)

# t-SNE to 2D (on a subset for speed)

tsne = TSNE(n_components=2, init='pca', learning_rate='auto', random_state=0)
X_tsne2 = tsne.fit_transform(X_digits)

Now plot the results:

plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
for digit in range(10):
plt.scatter(X_pca2[y_digits==digit, 0], X_pca2[y_digits==digit, 1],
label=str(digit), alpha=0.6)
plt.title("PCA (2D) of Digits"); plt.legend()

plt.subplot(1,2,2)

22
for digit in range(10):
plt.scatter(X_tsne2[y_digits==digit, 0], X_tsne2[y_digits==digit, 1],
label=str(digit), alpha=0.6)
plt.title("t-SNE (2D) of Digits")
plt.legend()
plt.show()

Figure 5: Comparing dimensionality reduction on the digits dataset. Left: PCA projection to 2D – some digits form
clusters but there is overlap (e.g., 3 (pink) and 8 (yellow) mix). Right: t-SNE 2D embedding – clearer separation of
digit clusters (each color) 32 . For instance, green '6's and orange '9's form distinct clusters in t-SNE, whereas in
PCA they were closer and mixed with others.

In Figure 5, PCA being a linear method could not completely separate all classes (digits) with just 2
components, though it captured ~varIance. t-SNE, focusing on local neighbor structure, found tighter
clusters for each digit, making the grouping by digit much more visually apparent. Each cluster corresponds
to a digit label fairly well.

3.3 Clustering Evaluation

Evaluating clustering is tricky since we often don’t have ground truth labels (if we did, it wouldn’t be
unsupervised!). However, if we do have some labeled data for benchmarking, we can use external indices like
ARI (Adjusted Rand Index), NMI (Normalized Mutual Information), etc., which compare the cluster
assignments to true labels.

For unlabeled data, we use internal indices that examine the structure: - Silhouette Coefficient: For each
point, silhouette = (b - a) / max(a, b), where - a = average distance from the point to others in the same
cluster (cohesion), - b = average distance from the point to points in the nearest other cluster (separation).
Silhouette ranges from -1 to 1. High value means the point is much closer to its own cluster than to others
(good) 33 . We often use the mean silhouette score over all points as a measure of cluster quality. It can
also help to choose k: compute silhouette for different k and pick the highest.

• Davies–Bouldin Index (DBI): This index measures the average “similarity” between each cluster and
its most similar cluster, where similarity is defined as the ratio of within-cluster scatter to between-
cluster separation 34 . Lower DBI is better – it means clusters are compact and far from each other.
It's computed as: for each cluster i, find R_ij = (scatter_i + scatter_j) / distance(center_i, center_j) for

23
every other cluster j; for cluster i take D_i = max_j R_ij (the worst case similarity with another cluster);
DBI = average of all D_i. In simpler terms, clusters that are far apart and tight will have low DBI 35
36 .

• Within-cluster SSE (Sum of Squared Errors) / Inertia: Used by k-means (sum of squared distances
of points to their centroid). By itself, this always decreases with more clusters, but one can look at
the "elbow" in SSE vs k curve to choose an optimal k (point where adding another cluster doesn't
significantly reduce error).

Let’s apply silhouette and DBI to compare k-means vs DBSCAN in the earlier two-moons example
quantitatively:

from sklearn.metrics import silhouette_score, davies_bouldin_score

print("K-Means Silhouette:", silhouette_score(X_moons, labels_km))

print("K-Means DBI:", davies_bouldin_score(X_moons, labels_km))
print("DBSCAN Silhouette:", silhouette_score(X_moons, labels_db))
print("DBSCAN DBI:", davies_bouldin_score(X_moons, labels_db))

We might get outputs like (for illustration, not exact):

K-Means Silhouette: 0.49

K-Means DBI: 0.78
DBSCAN Silhouette: 0.33
DBSCAN DBI: 1.15

In this hypothetical, K-means had a higher silhouette (point labeling was more separated in distance than
DBSCAN clusters perhaps) and lower DBI (indicating better cluster separation in its sense). Does that mean
K-means was better? Not really – we know K-means actually "split" one true cluster incorrectly (so external
metrics would judge it poorly), but silhouette/DBI being internal, evaluated on distance, might sometimes
favor a method even if clusters don't align with the actual truth we expect.

Caution: Internal metrics aren’t perfect. For example, silhouette assumes convex clusters as ideal. In the
two moons, silhouette for DBSCAN was lower possibly because the clusters are close in some parts and far
in others (and not convex). Yet DBSCAN clearly gave the more correct grouping. So use these metrics as
guides, not absolutes. Visual inspection and domain sense often play a role in unsupervised learning
evaluation.

Use of Evaluation: - If you don't know how many clusters k to use, try a range and see where silhouette is
highest or DBI is lowest (and also consider interpretability – what makes sense for your application). - If
clusters are very clear, metrics will reflect it, but if they are muddled, the metrics help confirm that (you
might get low silhouettes across the board). - Also watch out for degenerate solutions: e.g., silhouette is
undefined for 1 cluster and not meaningful; a very high DBI could mean clusters are not well separated at
all.

24
We’ve now covered unsupervised techniques for clustering and reducing dimensionality, and how to assess
clustering quality. The next section introduces a different paradigm: reinforcement learning, where learning
happens via interactions and feedback rather than from static datasets.

4. Reinforcement Learning
Reinforcement Learning (RL) is a learning paradigm inspired by behavioral psychology, where an agent
learns to make a sequence of decisions by interacting with an environment. The agent observes the state
of the environment, takes an action, and receives a reward (a scalar feedback signal). The goal is to learn a
policy (mapping from states to actions) that maximizes the cumulative reward over time 6 .

Key RL terminology: - State (s): A representation of the current situation the agent is in. - Action (a): A
choice the agent can make at a state. - Reward (r): Feedback from the environment for taking an action in a
state (can be positive, zero, or negative). - Episode: A sequence of states, actions, rewards, typically
terminating in a terminal state (like end of game). - Policy (π): Strategy of the agent, π(s) = action chosen in
state s (can be deterministic or stochastic). - Value (V) of a state: The expected cumulative reward (often
discounted) from that state under a given policy. - Q-value (Q) of a state-action: The expected cumulative
reward from taking a certain action in a given state and following a policy thereafter.

The agent's objective is to learn an optimal policy that yields the highest long-term rewards. Unlike
supervised learning, the agent often does not get told the correct action – it must explore and discover
which actions yield more reward. Also, rewards may be delayed (an action now may yield a reward much
later), making credit assignment non-trivial.

A common framework is the Markov Decision Process (MDP) where states, actions, reward function, and
transition probabilities are defined. Many RL algorithms revolve around solving MDPs.

Let's discuss two fundamental RL algorithms: Q-Learning and Value Iteration.

4.1 Q-Learning

Q-Learning is a model-free RL algorithm (meaning it doesn't require knowing the environment's dynamics;
it learns from experience) that learns the action-value function Q(s,a) directly 37 . The Q-value Q(s,a)
represents how good it is to take action a in state s (in terms of expected future rewards).

It uses the Bellman equation for Q-values: $$ Q_{\text{new}}(s,a) \leftarrow Q(s,a) + \alpha \Big[ r(s,a) +
\gamma \max_{a'} Q(s', a') - Q(s,a) \Big] $$ Here: - s = current state, a = action taken, r(s,a) = reward received,
s' = next state after action. - $\alpha$ = learning rate (how much new info overrides old). - $\gamma$ =
discount factor (how much future rewards are worth relative to immediate rewards, between 0 and 1). - $
\max_{a'} Q(s', a')$ = best estimated future value from the next state (assuming optimal action onward).

This update is applied each time the agent experiences a transition (s,a,r,s'). It's basically moving Q(s,a)
towards the sampled value r + γ max Q(next_state, ·) which is an estimate of what Q(s,a) should
be if we assume we are one step closer to optimal.

25
Over many episodes of interaction, Q-values converge to the optimal Q-function (given sufficient
exploration). The learned Q can then define an optimal policy: in any state s, choose the action a with
highest Q(s,a) (greedy policy). In practice, during learning, one uses an exploration strategy (like $
\epsilon$-greedy: with probability $\epsilon$ choose a random action to explore, otherwise choose current
best action) to ensure all state-action pairs are tried.

Characteristics: Q-learning is off-policy (it learns the optimal policy regardless of how the agent is
currently exploring; the update uses the max over next actions, not following the agent's current policy
strictly). It can handle stochastic transitions and rewards, and will find an optimal policy for any finite MDP
given infinite exploration and a decaying learning rate 38 .

Let's illustrate Q-learning with a simple example: imagine a grid world where an agent must reach a goal
cell for reward +1, and gets 0 otherwise, moving up/down/left/right with no obstacles. (This is a common
didactic example.)

We'll do a simplified pseudo-code:

import numpy as np

# Grid 4x4, goal at (3,3)

n_states = 16
n_actions = 4 # up, right, down, left
Q = np.zeros((n_states, n_actions))
alpha = 0.1; gamma = 0.9; epsilon = 0.1

def state_to_pos(s): return (s//4, s%4)

def pos_to_state(pos): return pos[0]*4 + pos[1]

# define reward structure and transition

goal_state = 15
def step(s, a):
i, j = state_to_pos(s)
if a == 0: i = max(i-1, 0)
if a == 1: j = min(j+1, 3)
if a == 2: i = min(i+1, 3)
if a == 3: j = max(j-1, 0)
s2 = pos_to_state((i,j))
reward = 1 if s2 == goal_state else 0
return s2, reward

# Q-learning loop
for episode in range(1000):
s = 0 # start at top-left corner (state 0)
done = False
while not done:
# epsilon-greedy action

26
if np.random.rand() < epsilon:
a = np.random.randint(n_actions)
else:
a = np.argmax(Q[s])
s2, r = step(s, a)
# update Q
Q[s,a] += alpha * (r + gamma * np.max(Q[s2]) - Q[s,a])
s = s2
if s == goal_state:
done = True

print("Learned Q-values for state 0:", Q[0])

print("Optimal policy from state 0 ->", ["up","right","down","left"]
[np.argmax(Q[0])])

Running this, we'd expect the agent learns to go to the goal (which is at bottom-right). Likely the optimal
move from state 0 is "down" or "right" (since goal is bottom-right, an optimal policy is to always move right
or down). The Q-values for state 0 might end up something like [something, high, high,
something] where the highest corresponds to moving either right or down depending on how ties broke.

This simple example demonstrates how Q-learning iteratively improves its estimates. In early episodes Q is
all zeros; by exploring, it finds the reward eventually and propagates value back along paths leading to the
goal. After sufficient episodes, Q converges and the derived policy consistently leads to the goal.

Q-learning has been pivotal in many RL successes (often combined with function approximation like deep
neural networks – the famous "Deep Q Network (DQN)" that learned to play Atari games used Q-learning
with a deep network to approximate Q).

4.2 Value Iteration

Value Iteration is a classical dynamic programming method to compute the optimal value function for an
MDP (assuming you have a model of the environment). Unlike Q-learning, this is not learning from real
experience but rather computing solution given known probabilities of transitions and rewards.

It uses the Bellman Optimality equation for state values: $$ V_{k+1}(s) = \max_a \sum_{s'} P(s' | s, a)
[ R(s,a,s') + \gamma V_k(s') ] $$ 39

• Start with an initial value function $V_0(s)$ (e.g., zero for all states).
• At each iteration, update $V(s)$ for all states using the above formula (take the action that
maximizes expected future reward plus immediate reward, using current $V$ estimates for next
states).
• Repeat until values converge (change is below a threshold).

Once converged, you have $V^(s)$, the optimal state values. And you can derive the optimal policy: for each
state, choose the action that achieves the max in the above equation (i.e., $\pi^(s) = \arg\max_a \sum_{s'} P(s'|
s,a)[R + \gamma V^*(s')]$) 40 .

27
Value iteration is guaranteed to converge to optimal values for discounted finite MDPs. It's effectively
performing a contraction mapping update (Bellman operator).

Policy Iteration is another approach where you iterate between evaluating a given policy (calculate V^π)
and improving the policy (greedify it with respect to current V), until no change.

Example: For a simple grid world like above (if we had the transition probabilities table), value iteration
could compute the optimal values exactly. It's like solving a system of equations via iteration.

Pseudo-code:

states = range(n_states)
V = np.zeros(n_states)
theta = 1e-6
while True:
delta = 0
for s in states:
if s == goal_state:
continue
max_val = -inf
for a in actions:
s2, r = deterministic_step(s,a) # assuming deterministic for
simplicity
val = r + gamma * V[s2]
if val > max_val: max_val = val
newV = max_val
delta = max(delta, abs(newV - V[s]))
V[s] = newV
if delta < theta: break

# Derive policy
policy = {}
for s in states:
best_a = None; best_val = -inf
for a in actions:
s2, r = deterministic_step(s,a)
val = r + gamma * V[s2]
if val > best_val:
best_val = val; best_a = a
policy[s] = best_a

This would yield the same result as Q-learning given the model, but via computation rather than sampling.

When to use Value Iteration: It requires knowing the environment model (transition probabilities P(s'|s,a)
and rewards). It's useful in planning problems or small MDPs where you can enumerate states and actions.

28
For large state spaces, it's intractable to update every state (that's where function approximation or other
methods come in). Still, understanding value iteration is foundational to understanding how optimal
policies are defined.

Summary and Applications

Reinforcement learning has achieved fame with examples like: - Teaching agents to play games (Atari video
games, Go – AlphaGo used advanced RL, OpenAI Five for Dota, etc.). - Robotics control (learning to make a
robot walk, or control a helicopter). - Recommendation systems and personalized decisions can sometimes
be framed as RL (learn from sequential interaction with user). - Resource management and operations
research problems (e.g., RL for job scheduling in computing clusters).

It’s a very powerful framework, though more complex hyperparameter-wise and often requiring many trial-
and-error interactions to get right.

We have given an overview of RL basics and simple algorithms. A full treatment is beyond scope (it typically
involves Markov decision processes, exploration-exploitation dilemma, function approximation, etc.), but
this provides the foundation to understand how agents can learn by themselves with minimal guidance.

5. Real-World Datasets in Practice

Throughout this notebook, we’ve briefly encountered some well-known datasets. Here we summarize a few
and discuss how they can be used for demonstrations and exercises:

• Iris Dataset: Perhaps the most famous small dataset. 150 samples of iris flowers with 4 features
(sepal length/width, petal length/width) and 3 species. Great for demonstrating classification
algorithms (it's relatively easy: linear models, trees, etc. can get >95% on it), as well as clustering
(unsupervised) and even PCA (4D to 2D visualization).

• Use in practice: Quick tests of classifiers, visual examples (pair plots, etc.). We saw it in logistic
regression and K-NN sections.

• Titanic Dataset: Passenger data from the Titanic disaster (commonly from Kaggle). Features like
age, sex, passenger class, etc., and target is survival (yes/no). Useful for demonstrating data cleaning
(it has missing ages), feature engineering (titles from names, family size), and binary classification (a
slightly imbalanced dataset).

• Use: Teaching logistic regression or decision trees on a mix of numeric and categorical features.
Good for exercise in data preprocessing and model evaluation (e.g., computing precision/recall to
find the best model since accuracy can be ~78% baseline with no model).

• Wine Dataset: Chemical composition of 178 wines from 3 cultivars. 13 continuous features. It's a
nice multi-class classification set (as we used with decision tree, SVM). It’s not very large, so good for
algorithms that scale poorly, and can also be used to demonstrate PCA (13D to 2D) or clustering
(maybe the classes correspond to clusters in feature space).

29
• Use: Classification (e.g., try logistic regression vs. KNN vs. SVM on it), or clustering vs actual labels
comparisons.

• Digits Dataset: 1797 8x8 images of handwritten digits (0 through 9). We saw it in PCA/t-SNE context.
It's high-dimensional (64 features) but not too large. It’s a classic for classification (a softer precursor
to MNIST). Many algorithms can achieve >95% accuracy here as well. Also good for clustering
(unsupervised) – do the images cluster by digit without labels? (Often yes, as t-SNE showed).

• Use: Could train a simple neural network or SVM on it in an exercise. Or use it for an exercise on
model comparison (try multiple classifiers, compare accuracy and confusion matrices). Also for
demonstrating cross-validation because enough data to hold out a test set or do k-fold.

• Boston Housing Dataset: 506 instances, 13 features (various demographics and statistics for
suburbs of Boston in 1970s), target is median home value. We used it to demonstrate regression. It's
a bit outdated (and has some ethical concerns), but still widely referenced. It's good for showing
linear vs non-linear regression, feature importance (e.g., LASSO selecting features), etc.

• Use: Could be an exercise to perform regression, evaluate with RMSE, maybe try polynomial
regression or tree-based regression and compare.

Other interesting datasets that could be used in exercises (not explicitly listed in prompt, but as an
instructor you might mention): - Breast Cancer Wisconsin Dataset: Binary classification (benign/
malignant) with 30 features. We used it in classification metrics section. Good for classification examples
(especially demonstrating model evaluation on imbalanced data – malignant cases are minority). - MNIST
(handwritten digits, 28x28 images) – too large for inclusion here but a classic for image classification. -
COCO or ImageNet etc. for advanced image tasks (likely beyond an 8h intro class, but worth mentioning as
real-world scale). - CIFAR-10 (image classification small images in 10 classes). - 20 Newsgroups (text
classification into 20 categories), good for demonstrating text feature extraction (CountVectorizer/Tfidf) and
Naive Bayes or SVM on text.

Using real datasets grounds the theory: - It highlights practical issues like missing values (Titanic age, etc.),
categorical variables (embark town on Titanic), feature scaling needs (some models need scaled data). - It
demonstrates that not all models perform equally on all data (e.g., k-NN might do worse on high-dim data
vs. a neural net). - It provides a playground for exercises: e.g., "Using the Titanic dataset, build a classifier to
predict survival and tune it."

Often, the best way to solidify understanding is to try things out on these datasets. So in the exercises
section next, we will leverage some of them.

6. Exercises
Now that we've covered a broad range of topics, it's time for some hands-on practice. Below are a set of
exercises corresponding to major sections. Attempt to solve them before peeking at the answers. The
solutions (in code or explanation) are provided for verification and learning.

30
6.1 Supervised Learning Exercises

Exercise 6.1.1 (Logistic Regression & Evaluation): Using the Titanic dataset (you can load a preprocessed
version or use sns.load_dataset('titanic') if available), train a logistic regression model to predict
survival. Then: - Compute the confusion matrix, precision, recall, and F1-score on the test set. - Interpret the
precision and recall in context (e.g., what does a certain precision value mean for predicting survivors?).

Solution Outline: First, load and preprocess Titanic data (handle missing ages, encode categorical variables).
Fit LogisticRegression . Then use confusion_matrix , precision_score , etc., from
sklearn.metrics . You might get something like precision ~0.75, recall ~0.70 (depending on how you set
threshold). That would mean "out of all people model predicted as survived, 75% actually survived" and "the
model catches 70% of actual survivors".

In [Python code], it would look like:

import seaborn as sns

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score,
f1_score
from sklearn.model_selection import train_test_split

# Load data (if using seaborn's dataset for simplicity)

titanic = sns.load_dataset('titanic').dropna(subset=['age','embarked','fare'])
# drop rows with missing values for simplicity
# Feature selection and encoding
X = pd.get_dummies(titanic[['pclass','sex','age','fare','embarked']],
drop_first=True)
y = (titanic['survived'] == 1).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
random_state=0)
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1:", f1_score(y_test, y_pred))

After running, interpret the results as discussed.

Exercise 6.1.2 (Compare Classifiers): On the Wine dataset, compare at least three classifiers (e.g., Logistic
Regression, Decision Tree, k-NN, SVM) using 5-fold cross-validation. Which model performs best in terms of
accuracy? Does one model significantly outperform others?

31
Solution Outline: Use cross_val_score for each model on load_wine() . Likely, all will do fairly well
(>90%). SVM or RandomForest might slightly outperform Logistic or k-NN. But differences might not be
huge due to small dataset. It's a chance to show how to use cross-validation for fair comparison.

Example code:

from sklearn.model_selection import cross_val_score

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

wine = load_wine()
models = [
("LogReg", LogisticRegression(max_iter=1000)),
("DecisionTree", DecisionTreeClassifier()),
("KNN", KNeighborsClassifier(n_neighbors=5)),
("SVM", SVC(kernel='linear'))
]
for name, clf in models:
scores = cross_val_score(clf, wine.data, wine.target, cv=5)
print(name, "accuracy:", scores.mean())

After running, you might see e.g. LogReg ~0.97, Tree ~0.86 (could be lower due to overfitting on small
folds), KNN ~0.93, SVM ~0.98. Conclude which is best (SVM in this hypothetical) and note any trade-offs (e.g.,
tree is lower but offers interpretability).

Exercise 6.1.3 (Hyperparameter Tuning): Use the Digits dataset and a RandomForestClassifier. Perform a
grid search to find the best n_estimators (e.g., try 50, 100, 200) and max_depth (None, 5, 10). Use 5-
fold CV on the training set to select hyperparams. What combination gives the highest validation accuracy?
What is that accuracy, and how does the model perform on a held-out test set?

Solution Outline: Use GridSearchCV or manually loop. Probably n_estimators=100 and

max_depth=None or 10 might be best. Digits being fairly easy, the best val accuracy might be around
0.98+. On test similar. Illustrate how tuning helps (though in random forest, more trees generally better
until plateau).

Pseudo-code:

from sklearn.model_selection import GridSearchCV

param_grid = {"n_estimators":[50,100,200], "max_depth":[None,5,10]}

rf = RandomForestClassifier(random_state=0)
gs = GridSearchCV(rf, param_grid, cv=5)
gs.fit(X_train, y_train)
print("Best params:", gs.best_params_)
print("Best CV score:", gs.best_score_)

32
best_rf = gs.best_estimator_
print("Test accuracy:", best_rf.score(X_test, y_test))

Then report the results.

6.2 Unsupervised Learning Exercises

Exercise 6.2.1 (Clustering Evaluation): The Iris dataset (ignoring labels) can be clustered. Apply K-Means
with k=3 to the iris data (only features). Then compare the clusters to the true species labels: - Compute the
confusion matrix between predicted cluster assignments and true labels. (You may need to permute cluster
indices to match labels, since clustering labels have no inherent order.) - Compute the silhouette score of
the clustering. - Briefly discuss how well K-Means performed in recovering the known classes.

Solution Outline: After KMeans, use e.g. metrics.cluster.adjusted_rand_score or confusion to

compare to true. Likely KMeans will perfectly separate Setosa (one cluster) but might mix Versicolor/
Virginica somewhat. Silhouette might be ~0.55 for k=3 on iris. You'd comment that two of the species are
closer so clusters overlap.

Exercise 6.2.2 (Visualizing Clusters): Use PCA to reduce the Wine dataset (13 features) to 2 principal
components. Plot the data points in 2D and color by the wine cultivar (target label). Can the three classes be
visually separated by the first two PCs? Then try running a clustering (say Agglomerative with 3 clusters) on
the PCA-reduced data and see if those clusters align with the true labels.

Solution Outline: Compute PCA, plot. The three wine classes might show some grouping but with overlaps. If
Agglomerative clustering is performed on the 2D data, ARI or confusion can check alignment. Possibly it
clusters 1 vs (2+3 together) if not clearly separated. The exercise teaches that PCA can help visualize but
might not perfectly separate classes.

Exercise 6.2.3 (Anomaly Detection with DBSCAN): Generate a dataset of 2D points consisting of two
obvious clusters and some random noise points far away. Use DBSCAN to cluster with appropriate eps so
that it identifies the two clusters and marks noise as outliers. How many noise points did it find? Plot the
result showing noise vs clustered points.

Solution Outline: You could sample two Gaussians, and 5 random outliers far away. DBSCAN with eps maybe
intermediate will isolate those 5 as noise. The number of noise = 5 ideally. The plot will show outliers as
having label -1. This illustrates DBSCAN’s ability to find noise.

6.3 Reinforcement Learning Exercise

Exercise 6.3.1 (Q-Learning on Grid): Consider a 3x3 grid world. Top-left is start (S), bottom-right is goal (G).
Reward +1 at G, 0 otherwise. Moves: up/down/left/right (no wrap, hitting wall leaves state same).
Implement a simple Q-learning (as pseudo-code or actual code) to find an optimal policy. How many
episodes (roughly) does it take until the agent consistently goes to the goal? Provide the learned policy
(sequence of moves from start) and the Q-values for the start state.

33
Solution Outline: This is similar to the earlier example but 3x3. Likely it learns in few hundred episodes. The
optimal policy from S likely “right, right, down, down” (if indexing from top-left). The Q(start, right) or down
become highest, etc. One could either show a table or arrow diagram. This exercise is more conceptual if
coding RL is too advanced; one can just reason or write pseudo-code (the answer can highlight the process
and result, as we did in earlier Q-learning code cell).

These exercises provide hands-on engagement with key concepts. By working through them, one should
gain a deeper understanding of applying ML algorithms and interpreting their results.

7. Final Notes: Best Practices in ML Workflow

We wrap up with some important best practices and considerations when working on machine learning
projects. Having a solid process can often be as important as the model choice itself:

• Data Preprocessing & Cleaning: Real-world data is messy. Always examine your data (use
descriptive stats, visualize distributions). Handle missing values (drop, impute, or use models that
handle them). Remove or fix outliers if they are errors. Ensure data types are correct and text is
normalized if needed. This step prevents garbage input to your model (remember: Garbage in,
garbage out).

• Feature Engineering: Good features can make a simple model powerful. This could include:

• Scaling/normalizing features (especially for methods like SVM, k-NN, neural nets).
• Encoding categorical variables properly (one-hot, ordinal encoding if order matters, etc.).
• Creating new features from existing ones (e.g., combining date fields into day of week; text to word
counts).
• Dimensionality reduction or feature selection to reduce noise and redundancy.
• Considering transformations (log, sqrt, etc.) if a feature has skewed distribution.

Keep in mind the context – domain knowledge can guide what features might be relevant. Feature
engineering is often where human insight adds value beyond automated algorithms 41 .

• Train/Test Split (and Cross-Validation): Always set aside a test set that your model doesn't see
during training, to evaluate generalization. Use cross-validation on the training set for model
selection and hyperparameter tuning. This avoids overfitting to the test data (a common mistake is
tuning hyperparameters on test set – this leaks information and makes test performance overly
optimistic).
• Use stratified splitting for classification to maintain class proportions.
• For time series, use time-based splits (train on past, test on future) instead of random.

• Cross-validation (e.g., 5-fold) gives a more stable estimate by averaging performance over folds 42 .

• Model Selection and Tuning: Start with simple models as baselines (e.g., linear or logistic
regression, decision stump, etc.). Then try more complex ones. Use grid search or random search to
tune hyperparameters systematically. Keep track of what combinations you've tried (it's easy to

34
forget). Automate what you can (e.g., using GridSearchCV ). Also consider ensemble strategies if
multiple models seem to have complementary strengths.

• Avoiding Overfitting: Key strategies include:

• Regularization (L1/L2 penalties) in linear and logistic regression or neural nets.

• Pruning in decision trees, limiting tree depth.
• Dropout in neural networks.
• Simpler model or fewer features if you detect overfitting.
• More training data if possible or data augmentation (especially in vision/NLP tasks).
• Cross-validate to detect if a model is overfitting (train score >> validation score).

• Sometimes using an ensemble of many overfit models (like in bagging) can ironically reduce
overfitting due to variance reduction (random forests).

• Feature Importance and Interpretability: Whenever possible, examine which features are
important to your model. Models like random forests provide importances, linear models have
coefficients. This can yield insights (and catch issues, like a leaked feature that makes no sense but
has high importance). Interpretability is crucial in fields like healthcare or finance. If using a complex
model, you might use techniques like SHAP values or LIME to explain individual predictions.

• Pipeline and Reproducibility: Use pipelines ( sklearn.pipeline.Pipeline ) to chain

preprocessing and modeling, so that you treat new samples consistently. Set random seeds for
reproducibility of results (especially for stochastic methods like random initialization in neural nets or
random forests). Keep a log of experiments (hyperparams tried, scores) – tools like MLflow or even
spreadsheets can help.

• Evaluation Metrics aligned with Objective: We covered many metrics; choose the one that
matches the problem needs. For example, if false negatives are worse than false positives, focus on
recall (sensitivity). If data is imbalanced, accuracy is not informative – use precision/recall or AUC, etc.
In ranking/recommendation, you might use MAP or NDCG. Always think about what the numbers
mean for the application (e.g., an AUC of 0.9 might sound good, but if it's a cancer test, what is the
recall at an acceptable precision?).

• Cross-checking and Validation: Before deploying a model, do sanity checks:

• Does it perform reasonably across different segments of data? (Check performance by subgroup –
e.g., does a model do much worse for a certain demographic? That might indicate bias.)
• If possible, have a small human-verified dataset to compare.

• Stress test by giving some adversarial or out-of-distribution inputs if possible (robustness).

• Iterate and Experiment: The ML workflow is iterative. You might find your model underperforms –
then analyze error cases. That analysis might suggest new features or data collection. Or maybe you
realize some labels are wrong – fix them if you can. It's a loop of improvement.

35
• Data Splitting for Final Model: After doing all comparisons and choosing the best approach via
cross-validation on train, you typically train that final model on the entire training set (potentially
even include the validation folds now) to maximize data usage, then do a final evaluation on the
untouched test set for reporting results. Only at this final step should you evaluate on test (to avoid
biasing earlier decisions).

• Deployment considerations: (Beyond scope of this class, but noteworthy) If a model will be
deployed:

• Ensure the pipeline can handle missing or unexpected values at prediction time.
• Monitor model performance over time (data drift can degrade a model).

• Consider how to update the model with new data (will you retrain periodically? Does the model
automatically adapt online?).

• Ethical and Bias considerations: Be mindful of biases in data. If the training data reflects historical
biases (e.g., along gender or race), the model may learn them. Fairness in ML is an active area.
Sometimes feature selection (omitting certain attributes) or bias mitigation techniques are needed.
Always think of the societal impact of errors (e.g., false rejection in loan applications affecting certain
groups more).

In summary, a successful machine learning project is not just about choosing a fancy algorithm. It's about a
rigorous process of understanding the problem, preparing data, selecting appropriate models, tuning
them, and critically evaluating them in context 9 43 . By adhering to these best practices, you set yourself
up to build models that are not only accurate but also robust and reliable.

Congratulations on making it through this comprehensive journey of Machine Learning! We covered a lot
of ground, from fundamental concepts to specific algorithms in supervised, unsupervised, and
reinforcement learning, with practical examples and exercises. With these foundations and hands-on
practice, you are well-equipped to tackle real-world machine learning problems, continually refine your
approach, and keep learning more advanced techniques. Good luck, and happy modeling!

1 01-ml-overview.key
https://sebastianraschka.com/pdf/lecture-notes/stat479fs18/01_ml-overview_slides.pdf

2 3 4 5 6 7 8 9 Introduction to Machine Learning: What Is and Its Applications |

GeeksforGeeks
https://www.geeksforgeeks.org/introduction-machine-learning/

10 18 Decision boundary - Wikipedia

https://en.wikipedia.org/wiki/Decision_boundary

11 12 Classifier comparison — scikit-learn 1.6.1 documentation

https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

13 14 15 What is a Decision Tree? | IBM

https://www.ibm.com/think/topics/decision-trees

36
16 k-nearest neighbors algorithm - Wikipedia
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

17 28 29 30 Clustering. Clustering is the task of dividing data… | by Jorge Leonel | Medium

https://medium.com/@jorgesleonel/clustering-d2895d9e264c

19 Naive Bayes classifier - Wikipedia

https://en.wikipedia.org/wiki/Naive_Bayes_classifier

20 Gradient Boosting Definition | DeepAI

https://deepai.org/machine-learning-glossary-and-terms/gradient-boosting

21 22 23 24 25 Understanding the Confusion Matrix in Machine Learning | GeeksforGeeks

https://www.geeksforgeeks.org/confusion-matrix-machine-learning/

26 Classification: ROC and AUC | Machine Learning | Google for Developers

https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

27 Receiver operating characteristic - Wikipedia

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

31 What is Dimensionality Reduction? | IBM

https://www.ibm.com/think/topics/dimensionality-reduction

32 Introduction to t-SNE: Nonlinear Dimensionality Reduction and Data Visualization | DataCamp

https://www.datacamp.com/tutorial/introduction-t-sne

33 Silhouette (clustering) - Wikipedia

https://en.wikipedia.org/wiki/Silhouette_(clustering)

34 35 36 Davies-Bouldin Index | GeeksforGeeks

https://www.geeksforgeeks.org/davies-bouldin-index/

37 38 Q-learning - Wikipedia
https://en.wikipedia.org/wiki/Q-learning

39 40 Implement Value Iteration in Python | GeeksforGeeks

https://www.geeksforgeeks.org/implement-value-iteration-in-python/

41 43 Typical Workflow for Building a Machine Learning Model - viso.ai

https://viso.ai/computer-vision/typical-workflow-for-building-a-machine-learning-model/

42 10 Tips for Effective Model Tuning in Machine Learning - DataHeroes

https://dataheroes.ai/blog/10-tips-for-effective-model-tuning-in-machine-learning/

Machine Learning Basics & Techniques
No ratings yet
Machine Learning Basics & Techniques
13 pages
Machine Learning?
100% (5)
Machine Learning?
114 pages
Checkmate Iv Celox Checkmate Iv Quik-Cup
100% (1)
Checkmate Iv Celox Checkmate Iv Quik-Cup
4 pages
Machine Learning Introduction and Types
No ratings yet
Machine Learning Introduction and Types
7 pages
UNIT III (ML, Classification, Regression, Types of ML)
No ratings yet
UNIT III (ML, Classification, Regression, Types of ML)
19 pages
Department of Emerging Technology (SB) III B.Tech - I Semester
No ratings yet
Department of Emerging Technology (SB) III B.Tech - I Semester
12 pages
ML Insights for Researchers & Practitioners
No ratings yet
ML Insights for Researchers & Practitioners
17 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
4 pages
ML Report
No ratings yet
ML Report
19 pages
Introduction To Machine Learning Basics
No ratings yet
Introduction To Machine Learning Basics
12 pages
Introduction To Machine Learning2 - 085047
No ratings yet
Introduction To Machine Learning2 - 085047
11 pages
Introduction To Machine Learning - GeeksforGeeks
No ratings yet
Introduction To Machine Learning - GeeksforGeeks
10 pages
ML Unit 1 Intro ML
No ratings yet
ML Unit 1 Intro ML
43 pages
1 - Machine Learning Overview
No ratings yet
1 - Machine Learning Overview
56 pages
Report Rahul
No ratings yet
Report Rahul
26 pages
Supervised & Deep Learning Guide
No ratings yet
Supervised & Deep Learning Guide
83 pages
2 ML
No ratings yet
2 ML
2 pages
Bca ML I
No ratings yet
Bca ML I
26 pages
MLT Uint1
No ratings yet
MLT Uint1
26 pages
1 - Machine Learning Overview
No ratings yet
1 - Machine Learning Overview
53 pages
Ai Faheem
No ratings yet
Ai Faheem
16 pages
Intro To ML - 1
No ratings yet
Intro To ML - 1
29 pages
Machine Learning: Louis Fippo Fitime
No ratings yet
Machine Learning: Louis Fippo Fitime
37 pages
DIR Notes 1
No ratings yet
DIR Notes 1
39 pages
DA Chap2
No ratings yet
DA Chap2
14 pages
01 Introduction
No ratings yet
01 Introduction
28 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Machine: Learning ATO Z - I
No ratings yet
Machine: Learning ATO Z - I
131 pages
Deep Learning l1
No ratings yet
Deep Learning l1
47 pages
MLDAP Module1
No ratings yet
MLDAP Module1
43 pages
Unit 9 - Machine Learning
No ratings yet
Unit 9 - Machine Learning
18 pages
Introduction To Data Science Module 3
No ratings yet
Introduction To Data Science Module 3
24 pages
Machine Learning Lecture-01
No ratings yet
Machine Learning Lecture-01
37 pages
Introduction To AI and ML
No ratings yet
Introduction To AI and ML
24 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
48 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
ML Key Concepts
No ratings yet
ML Key Concepts
139 pages
Module 1 - Intro To ML - V2
No ratings yet
Module 1 - Intro To ML - V2
47 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
5 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Machine Learning: Understanding The Basics of Machine Learning and Its Applications
No ratings yet
Machine Learning: Understanding The Basics of Machine Learning and Its Applications
24 pages
ML 7th Sem AIML ITE Notes Complete LONG (1) - 10-33
No ratings yet
ML 7th Sem AIML ITE Notes Complete LONG (1) - 10-33
24 pages
CE880 Lecture5 Slides
No ratings yet
CE880 Lecture5 Slides
32 pages
CHP 1
No ratings yet
CHP 1
47 pages
Karthik
No ratings yet
Karthik
10 pages
What Is Machine Learning?
No ratings yet
What Is Machine Learning?
11 pages
01 - Introduction To Machine Learning
No ratings yet
01 - Introduction To Machine Learning
29 pages
MLP Unit-I
No ratings yet
MLP Unit-I
62 pages
Fundamentals of ML 1
No ratings yet
Fundamentals of ML 1
38 pages
Machine Learning Overview
100% (2)
Machine Learning Overview
42 pages
Unit-1 ML
No ratings yet
Unit-1 ML
31 pages
ML Lec 02 Introduction II
No ratings yet
ML Lec 02 Introduction II
22 pages
? Understanding Machine Learning
No ratings yet
? Understanding Machine Learning
3 pages
DM Chapter 0
No ratings yet
DM Chapter 0
4 pages
Asset-V1 MKAU+SEng9032+DEV 01+type@asset+block@ChapOne
No ratings yet
Asset-V1 MKAU+SEng9032+DEV 01+type@asset+block@ChapOne
29 pages
Introduction To ML
No ratings yet
Introduction To ML
17 pages
Summer of Science Report On - Intro To Machine Learning
No ratings yet
Summer of Science Report On - Intro To Machine Learning
36 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
3 pages
Mini Monitor Module Installation Guide: Troubleshooting
No ratings yet
Mini Monitor Module Installation Guide: Troubleshooting
2 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Duramin Indenter
No ratings yet
Duramin Indenter
6 pages
Lung Cancer Detection Using CT Scan Images: Sciencedirect
No ratings yet
Lung Cancer Detection Using CT Scan Images: Sciencedirect
8 pages
06 Synchronization
No ratings yet
06 Synchronization
52 pages
Java Notes Module 4 3rd Year
No ratings yet
Java Notes Module 4 3rd Year
24 pages
Where To Download Guest Additions Iso
No ratings yet
Where To Download Guest Additions Iso
3 pages
Design and Implementation of An Embedded Edge-Processing Water Quality Monitoring System For Underground Waters
No ratings yet
Design and Implementation of An Embedded Edge-Processing Water Quality Monitoring System For Underground Waters
4 pages
Excel Skills Lab Guide for MBA Students
No ratings yet
Excel Skills Lab Guide for MBA Students
49 pages
HaggleRuleSet SidSacksonsOriginalV1.1
No ratings yet
HaggleRuleSet SidSacksonsOriginalV1.1
3 pages
The Possibility of Creating Thinking Machines Raises A Host of Ethical Issues.
No ratings yet
The Possibility of Creating Thinking Machines Raises A Host of Ethical Issues.
2 pages
RM Plagarism Report
No ratings yet
RM Plagarism Report
10 pages
Bus Ticket Reservation
No ratings yet
Bus Ticket Reservation
40 pages
Inocontroller Control Module Instructions Manual Sames DRT7134 Uk
No ratings yet
Inocontroller Control Module Instructions Manual Sames DRT7134 Uk
44 pages
Exam Guide - 406 - Kinetic Tools Management
No ratings yet
Exam Guide - 406 - Kinetic Tools Management
8 pages
1 Udemy For Business Courses in Native Bahasa Indonesia
No ratings yet
1 Udemy For Business Courses in Native Bahasa Indonesia
7 pages
Installing Ubuntu Server
100% (1)
Installing Ubuntu Server
13 pages
Interface Management On Megaprojects: A Case Study
No ratings yet
Interface Management On Megaprojects: A Case Study
6 pages
LaTeX Homework Help Service
100% (1)
LaTeX Homework Help Service
6 pages
Home HTTPD Data Media-Data 4 PhotonX25 ASIO
No ratings yet
Home HTTPD Data Media-Data 4 PhotonX25 ASIO
4 pages
Cisco CCIE Lab 4.2 Configuration Guide
No ratings yet
Cisco CCIE Lab 4.2 Configuration Guide
67 pages
2025 UP College of Law LAE Manual For Examinees
No ratings yet
2025 UP College of Law LAE Manual For Examinees
23 pages
AIML Lab: Regression Models Guide
No ratings yet
AIML Lab: Regression Models Guide
7 pages
Module 1 Algo Cncpts
No ratings yet
Module 1 Algo Cncpts
4 pages
Presentation 3 PDF
No ratings yet
Presentation 3 PDF
8 pages
Day 7 Task: Understanding Package Manager and Systemctl: Tasks
No ratings yet
Day 7 Task: Understanding Package Manager and Systemctl: Tasks
6 pages
Anurag Resume
No ratings yet
Anurag Resume
3 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
2 pages
Mio-5377r DS (100223) 20231002134454
No ratings yet
Mio-5377r DS (100223) 20231002134454
2 pages