Chapter 1
Questions and Answers: Chapter 1
Alaa Tharwat
Q1: What is the purpose of the discriminant function?
Answer: A discriminant function g(x) assigns a scalar value to each input vector x, which is used to assign it to a
specific class. Its purpose is to separate the (multidimensional) feature space in such a way that:
• Class Separation: The decision is unambiguous due to the sign (for binary classification) or the maximum among
several discriminant functions (for multi-class classification). For example, if g(x) > 0, the input x belongs to class 1;
if g(x) < 0, it belongs to class 2.
• Ranking: Value intervals allow an estimation of the certainty (“margin”) of the decision. For instance, if g(x1 ) = 2
and g(x2 ) = 0.5, we can infer that the classification for x1 is more certain than for x2 .
• Formulation of the Decision Area: The equation g(x) = 0 defines a function line or a hyper-surface that separates
the classes. For instance, in a 2D feature space, the line defined by g(x1 , x2 ) = 0 separates the feature space into two
regions, each corresponding to one of the classes.
Example: Consider a simple case of classifying flowers based on two features: petal length and petal width. The
discriminant function could be defined as:
g(x) = w1 · petal length + w2 · petal width + w0
where w1 and w2 are weights determined during training, and w0 is a bias term. The line g(x) = 0 would separate the
flowers into two classes: say, ”setosa” and ”versicolor”.
Q2: How does the perceptron learning algorithm guarantee convergence, and what does this reveal about
the data?
Answer: The perceptron learning algorithm converges if the data is linearly separable, meaning there exists a straight
line (in two dimensions) or a hyperplane (in higher dimensions) that can perfectly divide the classes without any misclas-
sification.
When the algorithm encounters a misclassified example, it updates the weights based on the error, moving the decision
boundary closer to the correct position. This process is repeated for each misclassified instance until all instances are
correctly classified, or it reaches a specified number of iterations.
If the data is not linearly separable, such as when the classes are intertwined (e.g., points from Class A and Class B
are mixed together), the perceptron will not converge. It will continue to adjust the weights indefinitely without finding a
stable solution.
Alaa Tharwat
Center for Applied Data Science Gütersloh (CfADS), Hochschule Bielefeld-University of Applied Sciences, 33619 Bielefeld, Germany, e-mail:
[email protected],[email protected]
1
2 Alaa Tharwat
In summary, the convergence of the perceptron learning algorithm is a strong indicator that the data can be cleanly
separated by a linear boundary, revealing important information about the structure of the data.
Q3: What is the difference between supervised learning and reinforcement learning?
Answer: Supervised learning uses labeled data. Here the model learns a direct input-output mapping with clear feed-
back. In reinforcement learning an agent makes sequential decisions with delayed, scalar rewards instead of labeled
outcomes. This makes reinforcement learning more exploratory and more suitable for dynamic environments like games
or robotics, where the agent has to adapt to get the maximal (expected) reward or utility (long-term rewards).
Q4: Why is the hypothesis space needed in the machine learning process, and what role does the learning
algorithm play within it?
Answer: The hypothesis space is the set of all possible functions that a model can consider when approximating the
unknown target function. It defines the range of potential solutions the learning algorithm can explore.
The learning algorithm plays a crucial role within this hypothesis space by navigating through it to identify and select
the hypothesis that best fits the training data. This process involves evaluating how well each hypothesis predicts the
outcomes based on the input features, typically by minimizing a loss function that quantifies the error between predicted
and actual values.
Q5: What is a linear algorithm, and how does it differ from a non-linear algorithm?
Answer: A linear algorithm models the relationship between input features (X) and output (Y ) using a linear equation.
For example, in a simple linear regression, the relationship can be expressed as:
y = wx + b
In higher dimensions, the equation extends to a linear combination of inputs:
y = w1 x1 + w2 x2 + . . . + wn xn + b.
In contrast, a non-linear algorithm models relationships where the output cannot be expressed as a simple linear combi-
nation of inputs. Non-linear algorithms can include operations such as multiplication, squaring, division, or transformation
through non-linear functions like log, exp, or sin.
For example, a non-linear relationship might be expressed as:
y = w0 x0 · w1 x1 + b.
This means the output is influenced by the product of input features, capturing more complex patterns in the data.
Q6: In machine learning, what informs the choice between using a linear or non-linear model to predict the
output for a given dataset?
Answer: Several factors guide the choice between using a linear or non-linear model:
• Manual Inspection: If the output (y) consistently increases or decreases as an input (x) changes, a linear model may
be appropriate. This suggests that the relationship is likely linear.
• Visualization: Plotting the inputs against the output can reveal patterns. A near-straight line in the plot suggests that
a linear model might be suitable, while scattered or curved patterns indicate the need for a non-linear model.
1 Questions and Answers: Chapter 1 3
• Start with Linear: It is often advisable to start with a linear model and evaluate its performance using error metrics
like Mean Squared Error (MSE). If the MSE is too high, indicating poor fit, it may be time to consider a non-linear
model.
• High-Dimensional Data: Complex, high-dimensional datasets often exhibit non-linear relationships. In such cases,
non-linear models may be more appropriate to capture the intricate patterns present in the data.
Q7: Machine learning uses statistical techniques to learn, but statistical results are probabilistic, operating
in a world of uncertainty and inference rather than absolutes. Does this mean domain experts are always
needed to vet the outputs of machine learning models?
Answer: Yes, machine learning models rely on statistical techniques to identify patterns in data and to generalize using
methods such as optimization and linear algebra. While machine learning can uncover valuable patterns, these patterns
may not always be meaningful or contextually relevant.
Domain experts are essential for several reasons:
• Interpreting Results: Experts provide insights into whether the identified patterns align with domain knowledge and
make sense in context.
• Validating Model Outputs: They assess the accuracy and reliability of the model’s predictions, ensuring that outputs
are valid.
• Ensuring Practical Applicability: In critical fields, such as healthcare or finance, decisions based on model outputs
can have significant consequences. Domain experts help ensure that models are applied correctly and ethically.
Q8: Why is it important that the target function in Machine Learning is unknown?
Answer: If the target function were known, learning would not be necessary. Machine Learning is required precisely
because the true relationship between inputs and outputs must be inferred from data without explicit instructions.
Q9: What happens if there is no underlying pattern in the data provided to a Machine Learning model?
Answer: If no underlying pattern exists, the Machine Learning model cannot learn meaningful relationships, and its
predictions will be no better than random guessing.
Q10: Provide an example of a non-linear relationship in a Machine Learning problem and briefly explain
it.
Answer: An example of a non-linear relationship is image recognition, where the relationship between pixel intensities
and the corresponding object category (such as distinguishing between a cat and a dog) is highly complex and cannot be
represented by a simple straight line or linear function. Neural networks are typically used to capture such non-linear
patterns through multiple layers and non-linear activation functions.
Q11: How work a training of a linear classification model?
Answer: The training of a linear classification model involves several key steps:
1. Data Collection: First, collect the data points, where X represents the input dimensions and C represents the output
classes.
2. Weight Initialization: Initialize the weights W to small random values. This is important for breaking symmetry
during training.
3. Training Process:
4 Alaa Tharwat
• Prediction: For each input data point, predict the output using the linear function y = W T · X + b (where b is the
bias, often initialized to 1).
• Error Calculation: Identify which predicted values are incorrect by comparing them to the actual class labels.
• Weight Update: For each misclassified example, update the weights using the formula:
W ← W + α · (ytrue − ypred ) · X
where α is the learning rate.
• Repeat the prediction and updating steps until either all predictions are correct or a specified number of iterations
is reached.
Q12: Why discriminant function is always one dimension less than the whole space?
Answer: The discriminant function represents the decision boundary in classification problems. It is always one di-
mension less than the feature space because it defines the boundary that separates different classes.
In an n-dimensional space, the decision boundary is a hyperplane of dimension n − 1. This means:
• In 2D space (where n = 2), the decision boundary is a line (1D).
• In 3D space (where n = 3), the decision boundary is a plane (2D).
This relationship holds because the discriminant function essentially captures the conditions under which the classes
change, forming a boundary that is defined by all possible points of equal likelihood for each class. Therefore, to separate
the classes effectively, the boundary must exist in a space that is one dimension lower than the space of the input features.
Q13: Why is conditional probability important in machine learning?
Answer: We use conditional probability in machine learning because, in most scenarios, our target function is not
a well-defined mathematical function. There is often noise in the data, meaning that the same input can yield different
outputs.
By using conditional probability, we can model the likelihood of an output given a specific input, rather than attempting
to predict a single, precise output. This approach allows us to account for the inherent uncertainty and variability in the
data.
For example, in classification tasks, we may want to determine the probability of different classes given the input fea-
tures. This probabilistic framework enables more robust decision-making and helps in managing uncertainty effectively.
Q14: How perceptron learning algorithm changes its weight vectors when it finds a misclassified point?
Why model’s hypothesis is improved by this update?
Answer: In the Perceptron Learning Algorithm (PLA), a misclassified training point (xn , yn ) indicates that the current
hypothesis h(x) = sign(wT x) has incorrectly predicted the label of xn .
The PLA updates the weight vector w using the following rule:
w(t + 1) = w(t) + yn xn
For the misclassified point, the term yn xn adjusts the weight vector in the direction that improves the prediction:
• If yn = +1 and the model predicted −1, the algorithm increases the weight vector by adding xn , effectively moving
the decision boundary to better classify this point.
• Conversely, if yn = −1 and the model predicted +1, the algorithm decreases the weight vector by subtracting xn ,
shifting the boundary in the opposite direction.
1 Questions and Answers: Chapter 1 5
This update modifies the discriminant function wT x = 0, thereby reducing the error for that specific point. As this
process is repeated over iterations, the hypothesis h ∈ H is refined, bringing it closer to the unknown target function f
over time.
Q15: If the target function f is fixed but unknown, and the goal is to approximate it using a hypothesis g
from a predefined set of hypotheses H (where g ∈ H), what criteria should we use to select H to ensure it
does not limit the model’s ability to generalize to unseen data? What should we consider before selecting
the model?
Answer: When training a model, the target function f is unspecified, meaning we attempt to approximate it using a
hypothesis g from a set of possible hypotheses, known as the hypothesis set H.
The selection of H significantly impacts generalization. Here are the key considerations:
• Model Complexity: If H is too simple, the model will struggle to capture the underlying patterns in the data, leading
to underfitting. This occurs when the model fails to learn enough from the training data and performs poorly on both
training and unseen data. Conversely, if H is too complex, the model may learn the training data patterns too closely,
including noise. This results in overfitting, where the model performs well on training data but poorly on new, unseen
data.
• Balance: It is crucial to find a balance in the complexity of H. A model that is too simple may overlook important
patterns, while a complex model may become too tailored to the training data.
• Domain Knowledge: Leverage domain knowledge to inform the selection of H. Understanding the nature of the data
and the relationships between features can guide the choice of an appropriate hypothesis set.
Q16: What are the main components of a learning problem? Give an example and identify each component
within it.
Answer: The main components of a learning problem are:
• Input (X): The features used to describe each instance.
• Output (y): The value we want to predict (e.g., price, label).
• Training Data (D): A dataset containing examples of input-output pairs (x1 , y1 ), . . . , (xn , yn ).
• Target Function ( f ): The unknown function that maps each input x to the correct output y.
• Learning Algorithm: The method used to analyze the training data D and choose a good approximation g for the
target function f .
• Hypothesis Set (H): A set of candidate functions from which the algorithm can choose.
Example: Predicting if an email is spam
• Input (X): Features could include the presence of certain words, the length of the email, etc.
• Output (y): 0 if the email is not spam, 1 if it is spam.
• Training Data (D): A collection of labeled emails, such as (email1, 0), (email2, 1), . . ..
• Target Function ( f ): The true rule that determines whether an email is spam (this is unknown).
• Learning Algorithm: For example, logistic regression, which uses the training data D to find a rule g.
• Hypothesis Set (H): The set of all logistic models that the algorithm could choose from.
Q17: Why is the concept of a discriminant function important in classification problems, and how does it
relate to the hypothesis function in a linear model like the Perceptron?
Answer: In classification problems, a discriminant function is crucial as it defines the boundary that separates different
classes within the input space. This boundary determines the regions where the model predicts distinct output labels.
6 Alaa Tharwat
In a linear model like the Perceptron, the discriminant function is represented by the equation:
wT x = 0
This equation describes a hyperplane that divides the input space into two halves: one region where the hypothesis
function h(x) = sign(wT x) = +1 (for example, to approve credit) and another where h(x) = −1 (for example, to deny
credit).
The weights w play a vital role in determining the orientation and position of this boundary.
The discriminant function directly influences the model’s classification behavior. An effective hypothesis function not
only classifies the training data correctly but also positions the boundary in such a way that new, unseen data are more
likely to be classified accurately.
Q18: What is the Cocktail Party Algorithm?
Answer: In the course, we discussed two unsupervised methods: clustering and the cocktail party problem. While I am
familiar with clustering, I initially had not heard of the cocktail party concept, prompting me to do some quick research.
The ”Cocktail Party Effect” refers to the brain’s ability to focus on a single speaker while filtering out other voices and
background noise. This phenomenon presents a challenge rather than a specific algorithm.
During my research, I found various interesting solutions, one of which was implemented by Siddharth Sharma and
shared on Medium. He utilized an unsupervised learning method known as Independent Component Analysis (ICA),
which is a computational technique for separating a multivariate signal into its additive subcomponents.
This approach effectively addresses the cocktail party problem by isolating individual voices from overlapping audio
signals.
Q19: Can you elaborate more on Error Functions?
Answer: In Chapter 1, we defined what an error function is and its role in learning. However, I was curious about the
different types of error functions and their specific applications.
To explore this further, I searched through the table of contents of our book, “Learning from Data: A Short Course,”
but did not find additional information. Therefore, I conducted some quick research to gather more details.
Here are some common error functions, also known as loss functions:
1. **Mean Square Error (MSE)**: The mean square error is the average of the squared differences between the target
function f and the current function h:
1
MSE = (yi − ȳ)2
n∑
MSE is useful when the model needs to be sensitive to outliers, as larger deviations from the target result in significantly
greater penalties.
2. **Mean Absolute Error (MAE)**: This function sums the absolute differences without squaring them:
1
MAE = |yi − ȳ|
n∑
Since it treats all errors equally, MAE is preferable when outliers do not need to be penalized severely.
Both of these functions are commonly used in regression tasks, which will be explained in greater detail later in the
book.
Q20: What’s the effect of a non-random sample on the feasibility of learning?
1 Questions and Answers: Chapter 1 7
Answer: The goal of learning is to find a hypothesis that can generalize across the entire population. Given that we
only have a sample, can we develop a reliable hypothesis?
We have demonstrated that learning is feasible and can generate a hypothesis with a tolerance ε using Hoeffding’s
Inequality. For example, consider flipping a coin N times and recording the results as follows:
X1 = 1, X2 = 0, X3 = 1, . . .
where 1 represents heads and 0 represents tails.
Let µ = E[X] be the true probability of heads (which is unknown) and ν be the observed proportion of heads:
1 N
ν= ∑ Xi
N i=1
According to Hoeffding’s Inequality:
2N
P(|ν − µ| > ε) ≤ 2e−2ε
This inequality states that the probability of the sample estimate being off by more than ε from the true value is at most
2
2e−2ε N .
While we have shown that learning is feasible and can produce a hypothesis with a certain tolerance, this holds true only
under the assumption that our sample is completely random, meaning it is free of bias. Although theoretically possible,
achieving a truly random sample is very difficult in practice.
Wiem Souai discussed in her article “Impacts of Sampling Bias on Model Performance” on Medium how various types
of sampling bias can hinder a model’s generalization. Here are some common types of sampling bias:
1. Selection Bias: Occurs when data is not collected equally from the entire population, such as gathering survey re-
sponses only from a city while ignoring suburbs.
2. Non-Response Bias: Arises when individuals refuse to answer a poll or survey.
3. Data Source Bias: Results from collecting data from a specific platform or medium.
Collecting a random sample is a complex process. The more biased a sample is, the harder it becomes to derive a
hypothesis that generalizes well across the entire population.
Q21: I would like to ask how we can analyze a large number of features when working in a company that
manufactures windows?
Answer: There are many aspects to consider, and I am interested in identifying the main features that are important for
our machine learning model to predict the production timeline for specific types of windows.
In thinking about this, I considered that we might compare all the data to see which features vary significantly and
which remain relatively constant. By doing so, we could identify and select the top five characteristics that have the most
impact on our predictions.
Using techniques like Principal Component Analysis (PCA) could help in reducing dimensionality and highlighting
the most significant features. This approach allows us to focus on the most relevant data while minimizing the influence
of less important variables.
Q22: Why does generalization remain uncertain even if a hypothesis perfectly fits the training data?
Answer: Generalization remains uncertain because multiple hypotheses may perform equally well on the training set
but can differ significantly when applied to unseen data. A perfect fit to the training data does not guarantee accurate
predictions on new examples, particularly when the true target function is unknown.
8 Alaa Tharwat
Q23: How does probability theory reconcile the contradiction between learning from finite data and the
unknown target function?
Answer: Deterministic reasoning suggests that we cannot learn anything about unseen data without knowing the
complete target function. However, probability theory provides a workaround by allowing us to estimate the likelihood
that our model will generalize based on the training data. This probabilistic approach enables learning even in the presence
of incomplete information.
Q24: Why do we distinguish between in-sample error and out-of-sample error in learning, and what is the
significance of their difference?
Answer: In-sample error (Ein ) measures how well a hypothesis fits the training data, whereas out-of-sample error (Eout )
reflects its performance on unseen data, which is ultimately what matters. The difference between these two errors, known
as the generalization gap, indicates the reliability of the learned model. A small gap suggests good generalization, while a
large gap is a warning sign of overfitting or poor learning.
Q25: What does Hoeffding’s Inequality say about the relationship between training error and true error?
Answer: Hoeffding’s Inequality provides a probabilistic bound on the difference between in-sample error (Ein ) and
out-of-sample error (Eout ). It indicates that with a sufficiently large sample size, Ein is likely to be close to Eout with high
probability. This gives us statistical confidence that learning from finite data can generalize effectively.
Q26: Why is the hypothesis set size (|H|) important in determining the feasibility of learning?
Answer: The size of the hypothesis set (|H|) is important because a larger number of hypotheses increases the likeli-
hood that at least one will fit the training data well purely by chance, rather than by accurately capturing the true underlying
pattern, leading to overfitting. In this scenario, Hoeffding’s bound must be adjusted using the union bound.
As the number of hypotheses increases, our confidence in the model’s generalization becomes looser, unless we com-
pensate by increasing the number of training examples N. Therefore, keeping the hypothesis set small—or employing
techniques like regularization—is essential for making learning feasible.
Q27: Why do we need a discriminant function and what are the requirements to use it?
Answer: - The discriminant function divides the input space into two sets, one for each class.
- It is typically used in supervised learning.
- It requires labeled data, and is generally applicable to binary classification (two classes).
- The function is defined as wT x = 0, which indicates that the transposed weights of the hypothesis multiplied by the
input vector x equals zero.
Q28: What is the main difference between supervised learning and unsupervised learning in machine learn-
ing?
Answer: In supervised learning, the model is trained on a dataset that contains correct outputs, allowing it to learn
the input-output relationship. This enables the model to make accurate predictions for new data based on the patterns it
learned from previously recorded outputs.
1 Questions and Answers: Chapter 1 9
In contrast, unsupervised learning uses unlabeled data, allowing the model to discover structures and patterns within the
data. The goal is for the model to understand the underlying structure in a chaotic environment and to generate meaningful
insights or solutions.
Q29: How does the reward system in Reinforcement Learning affect the training process of the model?
What is the role of the reward system in the learning process of the model?
Answer: The reward system is a fundamental feature that distinguishes Reinforcement Learning (RL) from supervised
and unsupervised learning. It aims to reinforce the interaction between the model and the environment, optimizing the
selection of correct actions.
For this reason, it is crucial that the reward structure is designed effectively to ensure that the model can be trained
quickly and efficiently. If the reward system is set incorrectly, the learning process may slow down, and the model may
repeat incorrect behaviors, leading it to focus on ineffective strategies.
Q30: Types of Learning in Machine Learning and their major differences?
Answer: Supervised Learning: when a model gets trained on a ”Labelled Dataset”. Labelled datasets have both input
and output parameters. In Supervised Learning algorithms learn to map points between inputs and correct outputs. It has
both training and validation datasets labelled.
• Data: Labelled (input-output pairs)
• Goal: Learn a mapping from inputs to outputs
• Examples: Spam detection, image classification, stock price prediction
• Common Algorithms: Linear Regression, Decision Trees, Support Vector Machines
Unsupervised Learning: It draws inferences from unlabelled datasets, facilitating exploratory data analysis and en-
abling pattern recognition and predictive modelling.It uses clustering algorithms to categorise data points according to
value similarity.
• Data: Unlabelled
• Goal: Discover hidden patterns or intrinsic structures in data
• Examples: Customer segmentation, anomaly detection, topic modeling
• Common Algorithms: K-Means Clustering, Hierarchical Clustering, Principal Component Analysis
10 Alaa Tharwat
Reinforcement Learning: The model (agent) learns by interacting with an environment and receiving feedback in the
form of rewards or punishments. Unlike supervised learning, it does not receive explicit correct answers but instead learns
through trial and error to maximize cumulative reward.
• Data: Feedback from actions (rewards or penalties)
• Goal: Learn optimal actions through trial and error to maximize cumulative reward
• Examples: Game playing (e.g., AlphaGo), robotic control, recommendation systems
• Common Algorithms: Q-Learning, Deep Q-Networks, Policy Gradient Methods
Type Data Requirement Output Type Common Use Cases
Supervised Labeled Prediction Classification, regression tasks
Unsupervised Unlabeled Pattern discovery Clustering, anomaly detection
Reinforcement Environment feedback Action policies Robotics, games, sequential control
Table 1.1: Comparison of Machine Learning Types
Q31: In the Perceptron Learning Algorithm, why do we multiply our error by x?
I understand the logic behind the update rule:
wi = wi−1 + learning rate · (ytrue − ypred )
If our model’s prediction is higher than y, the result of (ytrue − ypred ) will be negative, leading to a decrease in wi
and less activation in the next iteration. If the prediction is lower, the weight will be increased.
But why then do we still need the x in
wi = wi−1 + learning rate · (ytrue − ypred ) · x?
Or as noted in the slides, under the assumption that this is only applied to misclassified samples:
w(t + 1) ← w(t) + xn yn
Answer:
1 Questions and Answers: Chapter 1 11
When we multiply the error by x, we’re essentially saying ”adjust each weight in proportion to how strongly its corre-
sponding input feature was activated.” This makes intuitive sense because:
• If a feature xi is large (e.g., xi = 5), it had a strong influence on the incorrect prediction, so its weight should be
adjusted more significantly.
• If a feature xi is small or zero (e.g., xi = 0), it had little or no influence on the prediction, so its weight shouldn’t
change much or at all.
Example:
Imagine a simplified case with just two features x1 and x2 , and we misclassify a point:
• If the point has features [x1 = 5, x2 = 0.1]
• The error (ytrue − ypred ) = 1
• Then our weight updates would be:
– w1 would be updated by 5 × 1 = 5 (a large adjustment)
– w2 would be updated by only 0.1 × 1 = 0.1 (a small adjustment)
This makes sense because feature x1 had a much stronger presence in this example, so its corresponding weight deserves
a larger correction.
What Would Happen Without Multiplying by x?
If we didn’t multiply by x and just used
w(t + 1) ← w(t) + learning rate · (ytrue − ypred )
we would:
• Adjust all weights equally regardless of the corresponding feature values.
• Ignore the geometric relationship between the input features and the decision boundary.
• Likely take much longer to converge or possibly never converge to a solution.
The multiplication by x ensures that the learning algorithm is both mathematically sound and intuitively reasonable in
how it adjusts the decision boundary.