Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views32 pages

Advanced AI Final Sheet

The document covers advanced concepts in AI, including inductive learning, Bayesian decision theory, and pattern recognition, with multiple-choice and written questions designed to assess understanding. Key topics include the principles of inductive learning, the role of hypotheses, overfitting, and the significance of Bayesian classification. Additionally, it discusses the importance of feature extraction and the challenges of pattern recognition in real-world applications.

Uploaded by

hmwcqnrt77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views32 pages

Advanced AI Final Sheet

The document covers advanced concepts in AI, including inductive learning, Bayesian decision theory, and pattern recognition, with multiple-choice and written questions designed to assess understanding. Key topics include the principles of inductive learning, the role of hypotheses, overfitting, and the significance of Bayesian classification. Additionally, it discusses the importance of feature extraction and the challenges of pattern recognition in real-world applications.

Uploaded by

hmwcqnrt77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Advanced AI Q&A

1. Learning from Observations


❖ MCQ Questions

1) What is Inductive Learning?


A) Learning with expert guidance
B) Learning from input-output pairs
C) Learning by imitation
D) Learning by trial and error

2) What does a hypothesis in supervised learning aim to do?


A) Memorize data
B) Predict the unknown function
C) Eliminate errors
D) Reduce noise

3) What is the primary goal of inductive learning?


A) To construct a function that is complex
B) To approximate the unknown target function
C) To collect more training data
D) To simplify the training data

4) Which principle suggests preferring simpler hypotheses in inductive learning?


A) Law of Large Numbers
B) Ockham's Razor
C) Central Limit Theorem
D) Law of Averages

5) What does it mean for a hypothesis to generalize well?


A) It fits all training data perfectly
B) It accurately predicts unseen data
C) It requires no validation
D) It adapts to new inputs using memorization
6) What type of learning utilizes labeled data?
A) Unsupervised Learning
B) Reinforcement Learning
C) Supervised Learning
D) Federated Learning

7) What is a training set in inductive learning?


A) Random examples without labels
B) A collection of input-output pairs used to create the hypothesis
C) The final output of the algorithm
D) A list of irrelevant examples

8) What does the hypothesis space consist of?


A) The best-known algorithms
B) All possible hypotheses the model can choose from
C) The dataset used for testing
D) Random data points

9) What is entropy used for in decision trees?


A) To determine the complexity of the tree
B) To measure uncertainty in the dataset
C) To assess training time
D) To select predictive algorithms

10) In a decision tree, what do leaf nodes represent?


A) Decision points
B) The dataset
C) Final outcomes or predictions
D) Data processing stages

11) What is overfitting?


A) Wrong predictions
B) Learning noise from training data
C) Achieving high accuracy
D) Generalizing well
12) Which method helps prevent overfitting?
A) Increasing training data size
B) Pruning decision trees
C) Reducing model complexity
D) Using more features

13) What does reinforcement learning depend on?


A) Supervised training
B) Punishment and reward concepts
C) Data grouping
D) Constant feedback

14) What is cross-validation used for?


A) Generating more training data
B) Reducing overfitting
C) Enhancing decision tree branches
D) Specifying training examples

15) What signifies a consistent hypothesis?


A) High complexity
B) Agreement with training data examples
C) Infinite data points
D) Lack of validation

16) Which algorithm is suitable for predicting numerical values?


A) Classification tree
B) Regression tree
C) Decision tree
D) Clustering algorithm

17) What is the goal of the PAC model in learning theory?


A) To maximize data points
B) To provide mathematical guarantees on model accuracy
C) To simplify prediction processes
D) To enhance computational complexity
18) What does 'generalization' mean in the context of machine learning?
A) Predicting class labels accurately
B) Adjusting the model for older data
C) The ability to perform well on new, unseen examples
D) Focusing on the training set only

19) What does the term "stationarity assumption" refer to?


A) Data not changing over time
B) Random sampling of data
C) All data being irrelevant
D) Data varying continuously

20) What is a common method to choose the best attribute in decision trees?
A) Random selection
B) Information Gain
C) Memory allocation
D) Speed of computation

❖ Written Questions

1) What is the process of inductive learning?


Inductive learning involves learning a function that maps inputs to outputs based on given
examples (input-output pairs), with the goal of approximating an unknown target function

2) How does generalization impact machine learning models?


Generalization is critical as it reflects how well a machine learning model performs on
unseen data; a model that generalizes well can make accurate predictions beyond the training
dataset

3) Explain the concept of 'Ockham's Razor' in the context of learning models.


Ockham's Razor is a principle favoring simpler models or hypotheses over more complex
ones, emphasizing the idea that the simplest hypothesis consistent with the data is often the
best choice

4) What are the main types of learning paradigms discussed in machine learning?
The main types include supervised learning, unsupervised learning, and reinforcement
learning, each employing different mechanisms for learning from data
5) What is overfitting, and how can it be mitigated in decision trees?
Overfitting occurs when a model learns irrelevant patterns from the training data, leading to
poor performance on unseen data. It can be mitigated through techniques such as pruning
and cross-validation

6) What does entropy measure in decision trees?


In decision trees, entropy measures the level of uncertainty or disorder in the dataset, helping
to determine how best to split the data at each decision node

7) Define 'hypothesis space' in machine learning.


The hypothesis space is the set of all possible hypotheses or functions that a machine
learning algorithm can consider when attempting to model the relationship between inputs
and outputs

8) Detail the significance of reinforcement learning with an example.


Reinforcement learning enhances the performance of models through rewards and
punishments based on their actions. For instance, in game AI, players receive rewards for
winning and penalties for losing to improve their decision-making over time

9) What is the role of a training set in inductive learning?


A training set provides the necessary input-output pairs that the learning algorithm uses to
establish a hypothesis that approximates the unknown function

10) Explain how cross-validation works and its importance.


Cross-validation involves splitting the dataset into multiple subsets to train and validate a
model iteratively. It is important for assessing the model's performance and ensuring it
generalizes well to unseen data
2. Bayesian Decision Theory
❖ MCQ Questions

1) What does p(x|wj) represent in Bayesian Decision Theory?


A) Posterior probability
B) Joint probability
C) Class-conditional probability
D) Evidence

2) What does Bayesian Decision Theory primarily quantify?


A) Neural network performance
B) Tradeoffs between decisions using probabilities and costs
C) Image recognition accuracy
D) Reinforcement learning rewards

3) The Bayes decision rule chooses the class with:


A) The smallest prior probability
B) The highest loss
C) The largest posterior probability
D) The lowest likelihood

4) What is the goal of minimum-risk classification?


A) Maximize error
B) Minimize computational complexity
C) Minimize the expected loss
D) Maximize posterior probability

5) In two-category classification, a decision is made based on:


A) Mean of feature values
B) Prior only
C) Likelihood ratio compared to a threshold
D) Variance of data

6) Which distribution has the maximum entropy for given mean and variance?
A) Binomial
B) Gaussian
C) Poisson
D) Uniform
7) What does the zero-one loss function assume?
A) All decisions are correct
B) Errors are unequally penalized
C) All errors are equally costly
D) Only one state exists

8) What does the ROC curve show?


A) Tradeoff between false positive and true positive rates
B) Classification boundaries
C) Training error over epochs
D) Posterior distribution over time

9) What is the Bayes error?


A) Maximum classification error
B) Error with maximum loss
C) Minimum possible error for any classifier
D) Total misclassified data points

10) The Mahalanobis distance is used in multivariate Gaussian models to measure:


A) Euclidean similarity
B) Squared error loss
C) Distance normalized by variance and correlation
D) Risk difference between decisions

❖ Written Questions

1) How does Bayes’ theorem form the foundation of Bayesian classification? Derive the
formula and explain each component.

Bayes’ theorem is the core of Bayesian classification. It provides a systematic way to update
the probability estimate for a hypothesis (class) based on observed data (features).

Bayes' Theorem Formula:


Where:
• P(wj∣x): Posterior probability of class wj given observation x
• p(x∣wj): Likelihood – probability of observing xxx given class wj
• P(wj): Prior probability of class wj
• p(x): Evidence – total probability of observing xxx across all classes

Explanation:
• The theorem transforms our prior belief P(wj) into a posterior belief P(wj∣x) after seeing
evidence x.
• The evidence term ensures normalization so that the posterior probabilities sum to 1.
This theorem enables informed classification by combining prior expectations and observed
data in a probabilistic framework.

2) What is class-conditional probability, and how does it differ from prior and posterior
probabilities? Provide mathematical representation and interpretation.

Class-conditional probability P(x|wj)is the likelihood of observing feature x given that the
actual class is wj.

Mathematically: p(x∣wj)=Probability density of feature x if the class is wj

Contrast with:
• Prior P(wj): How probable the class is without any feature observation.
• Posterior P(wj∣x): How probable the class is after observing x.

Interpretation:
It reflects the distribution of data within each class. In fish classification, p(x∣w1) might
represent the distribution of lightness values for sea bass.

Use:
It’s crucial in computing the posterior via Bayes’ theorem and in determining decision
boundaries.

3) Compare and contrast minimum-risk classification and minimum-error-rate


classification. Include examples where each is appropriate.

• Minimum-error-rate classification assumes all errors are equally bad (zero-one loss),
and chooses the class with the maximum posterior probability.
• Minimum-risk classification incorporates a loss function to weight errors differently
and chooses the action with the lowest expected loss.
Example:
• Spam filtering with equal costs = minimum-error-rate.
• Medical diagnosis (e.g., false negatives being more costly) = minimum-risk
classification.

4) What is conditional risk in Bayesian classification, and how does it influence the
decision rule?

Conditional risk R(αi∣x) is the expected loss when taking action αi after observing feature
vector x. It’s defined as:

It influences decision-making by guiding us to choose the action (e.g., class) with the
minimum conditional risk.

This ensures that decisions are not just based on which class is most likely, but also on how
costly wrong decisions are.

5) Discuss how the ROC curve helps evaluate the performance of a binary Bayesian
classifier.

The ROC (Receiver Operating Characteristic) curve is a plot that shows the trade-off
between:
• True Positive Rate (TPR) — sensitivity or recall
• False Positive Rate (FPR) — 1 – specificity

By adjusting the decision threshold in a Bayesian classifier (e.g., the posterior probability
cutoff), different points on the ROC curve can be generated.

Benefits:
• Useful when dealing with imbalanced datasets
• Highlights the trade-offs between catching positives and avoiding false alarms
• The Area Under the Curve (AUC) provides a single metric for overall performance:
▪ AUC = 1: perfect classifier
▪ AUC = 0.5: random guessing
6) Define Bayes risk and Bayes error. Why are these considered benchmarks for
classification performance?

• Bayes risk is the minimum possible risk (expected loss) achievable by any classifier
under known distributions.
• Bayes error is the lowest achievable probability of misclassification, found by
applying the Bayes decision rule with zero-one loss.

These benchmarks define the theoretical limit of classification performance, and all real-
world classifiers aim to approximate or approach them.

7) Compare linear and quadratic discriminant analysis (LDA vs QDA) from a Bayesian
perspective. When is each appropriate?

Both LDA and QDA are derived from Bayesian classification using Gaussian likelihoods.

Linear Discriminant Analysis (LDA):


• Assumes all classes share the same covariance matrix.
• Discriminant functions are linear in x.
• Decision boundaries are hyperplanes.

Quadratic Discriminant Analysis (QDA):


• Allows each class to have its own covariance matrix.
• Discriminant functions are quadratic in x.
• Decision boundaries are curved (hyperquadrics).

When to Use:
• LDA: When the assumption of equal spread is valid; less data needed.
• QDA: When classes have significantly different variances or orientations.

Trade-off:
• LDA is simpler and requires fewer parameters.
• QDA is more flexible but prone to overfitting with limited data.
8) How does the choice of covariance matrix affect decision boundaries in Gaussian
classifiers?

The form of the covariance matrix Σ in a Gaussian classifier directly affects the shape of
the decision boundaries:

• Case 1: Σi =σ2I
▪ Covariances are equal and isotropic (spherical).
▪ Linear boundaries perpendicular to the line between means.

• Case 2: Σi=Σ
▪ Covariances are equal but not isotropic.
▪ Still linear,and Hyperplane passes through x0 but is not necessarily orthogonal to the
line between the means

• Case 3: Σi = arbitrary
▪ Each class has its own covariance matrix.
▪ Quadratic boundaries – can curve and adapt to complex data.
▪ Decision boundaries are hyperquadric

This choice influences flexibility, computational complexity, and the risk of overfitting.
3. Introduction to Pattern Recognition
❖ MCQ Questions

1) What is a pattern in the context of pattern recognition?


A) A mathematical equation
B) An entity that can be given a name, such as a fingerprint image or a handwritten word
C) A type of computer algorithm
D) A physical object like a chair or table

2) Which of the following is NOT an application of pattern recognition?


A) Optical character recognition
B) Weather forecasting
C) Junk mail filtering
D) DNA sequence analysis

3) What is the primary goal of pattern recognition?


A) To create new patterns
B) To observe the environment, learn to distinguish patterns, and make decisions about their
categories
C) To replace human perception entirely
D) To develop new hardware for computers

4) In the fish sorting example, which feature was initially considered but found to be
unreliable for distinguishing sea bass from salmon?
A) Lightness
B) Width
C) Length
D) Tail shape

5) What is a key challenge when adding more features to a pattern recognition system?
A) Increased computational cost
B) The curse of dimensionality
C) Noise in measurements
D) All of the above
6) Which of the following is a step in the pattern recognition system pipeline?
A) Data acquisition
B) Feature extraction
C) Classification
D) All of the above

7) What is the purpose of post-processing in a pattern recognition system?


A) To remove noise from data
B) To evaluate confidence in decisions and exploit context
C) To collect training data
D) To select features

8) Which type of learning involves a teacher providing category labels for training data?
A) Unsupervised learning
B) Reinforcement learning
C) Supervised learning
D) None of the above

❖ Written Questions

1) Explain the concept of pattern recognition and provide two real-world applications.
Pattern recognition is the study of how machines can observe the environment, learn to
distinguish patterns of interest, and make decisions about their categories. A pattern is any
entity that can be named, such as a fingerprint, handwritten word, or speech signal.

Applications:
• Optical Character Recognition (OCR): Converting handwritten or printed text into
machine-readable form.
• Biometric Recognition: Identifying individuals using fingerprints, iris scans, or facial
features.

2) What is the role of feature extraction in a pattern recognition system?


Why is it important to select discriminative and invariant features?
Feature extraction transforms raw data into a meaningful representation (features) that
simplifies classification. Discriminative features help separate different classes (e.g.,
lightness distinguishes fish species), while invariant features remain consistent under
transformations (e.g., rotation-invariant features for object recognition).
Importance:
• Poor features lead to misclassification.
• Good features reduce computational complexity and improve generalization.

3) Compare and contrast supervised, unsupervised, and reinforcement learning. Provide


an example of each.
• Supervised Learning: Uses labeled data (e.g., spam vs. non-spam emails). Example:
Training a classifier to recognize handwritten digits.

• Unsupervised Learning: Finds patterns in unlabeled data (e.g., clustering customer


data). Example: Grouping similar news articles without predefined categories.

• Reinforcement Learning: Learns via feedback (reward/punishment). Example: A robot


learning to navigate a maze by trial and error.

4) What are the main steps in the design cycle of a pattern recognition system? Why is
evaluation a critical step?
Steps:
1) Data collection
2) Feature selection
3) Model selection
4) Training
5) Evaluation

Importance of Evaluation:
• Ensures the system works on new, unseen data (not just training data).
• Detects overfitting and measures real-world performance.

5) Discuss the tradeoff between model complexity and generalization in pattern


recognition.
In pattern recognition, there is a critical tradeoff between model complexity and
generalization.

Model complexity refers to the ability of a model to fit a wide variety of patterns, often by
using many parameters or flexible decision boundaries. A complex model can fit the training
data very accurately, potentially capturing subtle patterns or even noise.

Generalization, on the other hand, refers to how well the model performs on unseen data. A
model that generalizes well can correctly classify or make predictions on new, real-world
inputs.
4. Knowledge in Learning
❖ MCQ Questions

1) What is the goal of traditional learning?


A) Use prior knowledge
B) Build a function from data only
C) Analyze logical rules
D) Learn by memorization

2) Which of the following best describes the key difference between EBL and KBIL when
handling noisy or incomplete data?
A) EBL completely ignores background knowledge
B) KBIL adapts hypotheses based only on current examples
C) KBIL combines background knowledge and inductive logic to remain robust
D) EBL is more statistically driven than KBIL

3) Given the logical constraint: Background ∧ Hypothesis ∧ Descriptions ⊨


Classifications, which of the following must hold true?
A) Hypothesis must be consistent only with Background
B) Descriptions alone must imply Classifications
C) Classifications are a direct result of the Background and Hypothesis
D) Hypothesis and Classifications imply Descriptions

4) In Current-Best Hypothesis Search, a contradicting negative example leads to:


A) Deletion of hypothesis
B) Specialization of hypothesis
C) Memorization of example
D) Abandoning learning

5) Which of the following is a key strength of KBIL compared to pure inductive learning?
A) It reduces reliance on repeated examples
B) It prefers random generalizations
C) It excludes background knowledge from analysis
D) It relies solely on descriptions of examples
6) S-set in Version Space is:
A) Most general hypotheses
B) All negative examples
C) Most specific consistent hypotheses
D) Statistical functions

7) G-set refers to:


A) Goals of learning
B) General knowledge
C) Most general consistent hypotheses
D) Grounded examples

8) In Version Space Learning, what happens if no hypothesis exists that fits both S-set and
G-set?
A) The version space collapses
B) All hypotheses are valid
C) The hypothesis space expands
D) The algorithm ignores inconsistencies

9) Ockham's Razor prefers hypotheses that are:


A) Complex
B) Memorized
C) Small and consistent
D) Randomly selected

10) Explanation-Based Learning (EBL) is:


A) Memorizing all solutions
B) Extracting general rules from examples
C) Guessing solutions
D) Ignoring prior knowledge

11) In EBL, rules are formed based on:


A) Data only
B) Background knowledge and observations
C) Noise in the data
D) Repeated guessing
12) Which method includes incremental development based on knowledge?
A) Pure inductive learning
B) Explanation-Based Learning
C) Autonomous learning agents
D) Randomized learning

13) What constraint must hypotheses satisfy in inductive learning?


A) Hypothesis = Classifications
B) Hypothesis ∧ Descriptions ⊨ Classifications
C) Descriptions = Background
D) Classifications ⊨ Hypothesis

14) What does KBIL stand for?


A) Knowledge-Based Intelligent Learning
B) Knowledge-Based Inductive Learning
C) Knowledge-Based Information Logic
D) Knowledge-Built In Learning

15) Memoization in EBL helps by:


A) Forgetting irrelevant data
B) Saving computational results
C) Recalculating everything
D) Learning statistics

16) A system uses the rule: ∀x, u. ArithmeticUnknown(u) → Simplify(1 × (0 + u), u).
Which learning method is being applied?
A) Statistical Learning
B) KBIL
C) Explanation-Based Learning
D) Memorization

17) Functional determination means:


A) Learning by random patterns
B) One feature fully determines another
C) Statistical mapping
D) Generalizing random guesses
18) Explanation-Based Learning enhances performance by:
A) Using rules to avoid redundant work
B) Avoiding logic
C) Ignoring background
D) Adding noise

19) Which of the following is NOT an advantage of using prior knowledge in learning?
A) Reduces complexity
B) Improves generalization
C) Requires more training data
D) Narrows hypothesis space

20) What distinguishes the memoization used in EBL from traditional memorization?
A) It stores each example separately
B) It saves only raw computation results
C) It creates general rules that avoid recomputation
D) It repeats all steps exactly every time

❖ Written Questions

1) What is the difference between Current-Best Hypothesis learning and Version Space
Learning?
Current-Best Hypothesis maintains a single hypothesis and updates it based on new data.
Version Space Learning tracks all consistent hypotheses using S-set and G-set.

2) A learner is trying to simplify the expression 1 × (0 + x) to x. Show how Explanation-


Based Learning can generalize this to other expressions?
EBL uses background rules like:
0+x=x
1×x=x
From this, the learner can generalize a rule: 1 × (0 + u) → u
This rule can now be reused for similar expressions, improving efficiency.

3) How does prior knowledge enhance the learning process in intelligent systems?
It reduces the hypothesis space, speeds up learning, and enables better generalization based
on existing knowledge.
4) Explain how memoization helps avoid redundant computations in a system that
differentiates algebraic expressions like d/dx(x² + x) and d/dx(x² + 3x)?
Memoization stores the result of previously computed derivatives like
d/dx(x²) = 2x and d/dx(x) = 1.

When a similar expression like x² + 3x appears, it reuses stored results instead of


recomputing, saving time and computational effort.

5) 5. How is Knowledge-Based Inductive Learning (KBIL) different from Pure Inductive


Learning?
KBIL uses both data and background knowledge, while Pure Inductive Learning relies solely
on data.

6) 6. What are Functional Determinations, and why are they useful?


They are relationships where one feature determines another. They help simplify learning by
focusing on relevant variables.

7) How does Relevance-Based Learning infer rules from observations and background
knowledge?
It identifies relevant features (e.g., nationality → language) and generalizes meaningful rules
from few observations.

8) Describe a scenario where background knowledge could mislead a KBIL system if it’s
incorrect or outdated. How can such issues be mitigated?
If the system has incorrect knowledge like All birds can fly, it may wrongly classify
penguins or ostriches.

To mitigate this: Update background knowledge with exceptions or more accurate rules. Use
statistical checks to challenge outdated rules.
5. Hidden Markov Model
❖ MCQ Questions

1) Which of the following best describes a First-Order Markov Model?


A) The current state depends on all previous states and changes over time.
B) The current state is chosen randomly with no relation to past states.
C) The current state depends only on the previous state and the transition probabilities stay
the same over time.
D) The current state depends on the next state and past probabilities.

2) In a First-Order Markov Model, what does each state represent?


A) A randomly generated number
B) A hidden state
C) A physical (observable) event
D) A variable name without a value

3) What does the learning problem in HMMs aim to find?


A) The best observation sequence
B) The transition and emission probabilities
C) The most likely current state
D) The shortest path between states

4) In a First-Order Hidden Markov Model (HMM), the observation probability


distribution b_j(x) for continuous observations is typically modeled as:
A) A uniform distribution
B) A discrete probability table
C) A mixture of Gaussian distributions
D) A Poisson distribution

5) Which algorithm is used to solve the decoding problem in HMMs?


A) Forward algorithm
B) Backpropagation
C) Viterbi algorithm
D) Baum-Welch algorithm
6) In the “urn and ball” HMM example, what does the color of the selected ball
represent?
A) The hidden state
B) The transition probability
C) The observation
D) The emission matrix

7) Which of the following is NOT an application of Hidden Markov Models (HMMs)?


A) Speech recognition
B) Protein sequence modeling
C) Image compression
D) Robot navigation

8) What does the probability expression P(v(t) | w(t)) represent in a Hidden Markov
Model?*
A) The probability of transitioning from state w(t) to v(t)
B) The probability of observing output v(t) given the state at time t is w(t)
C) The probability of staying in the same state for t time steps
D) The probability of the next state being w(t)

9) What is the main goal of the evaluation problem in HMMs?


A) To find hidden states
B) To calculate the probability of an observed sequence
C) To modify state transitions
D) Generating new sequences

10) The Baum-Welch algorithm for HMMs is:


A) A supervised learning method requiring known states.
B) An EM algorithm that iteratively estimates parameters.
C) Used to find the most likely state sequence.
D) Only for discrete observations.
❖ Written Questions

1) What is a First-Order Markov Model?


A model where the current state only depends on the previous state.
This means:
P(w(t) | w(t − 1), ..., w(1)) = P(w(t) | w(t − 1))

2) What is a Hidden Markov Model (HMM)?


An HMM is a model where the system’s true states are hidden, but we observe outputs that
depend probabilistically on these states. It combines a hidden stochastic process with
observable outputs generated by another probabilistic process.

3) What are the main parts of a Hidden Markov Model (HMM)?


• Hidden states (N) • Observation probabilities (bⱼₖ)
• Observations (M) • Initial probabilities (πᵢ)
• Transition probabilities (aᵢⱼ)

4) What is an observable Markov model?


An observable Markov model is a process where states are directly visible.

5) What are the key characteristics of the observable Markov model?


Key characteristics:
• Transitions between states can be asymmetric.
• The same state can be repeated consecutively.
• Not all states need to be visited.

6) List some common applications of Hidden Markov Models (HMMs)?


HMMs are used in speech recognition, optical character recognition, natural language
processing, bioinformatics, video analysis, robot planning, and financial modeling.

7) What is the decoding problem in Hidden Markov Models, and which algorithm is used
to solve it?
The decoding problem is to find the most likely sequence of hidden states that could have
produced a given sequence of observations. This problem is solved using the Viterbi
algorithm.
8) Briefly explain the evaluation problem in Hidden Markov Models?
The evaluation problem in HMMs involves computing the probability of an observed
sequence given a specific model.

9) If the probability of staying in a state (aᵢᵢ) is 0.8, what is the expected number of days
the system will stay in that state consecutively?
Use the formula:
E[d|w_i] = 1/1 - a_ii = 1/1 - 0.8 = 5
So, the expected number of consecutive days is 5 days.
10) What is the learning problem in Hidden Markov Models?
The learning problem in HMMs involves adjusting the model parameters—transition,
emission, and initial probabilities—to best fit a given set of observation sequences.
6. Maximum Likelihood Estimation
❖ MCQ Questions

1) In supervised learning, what is known for each training sample?


A) The full probability distribution
B) The median and standard deviation
C) The number of features
D) The true class label

2) What type of learning is being considered in the text?


A) Unsupervised learning
B) Reinforcement learning
C) Supervised learning
D) Self-supervised learning

3) After estimating the parameters from the training data, how are they used?
A) They are completely ignored
B) They are updated continuously during testing
C) They are used as if they were the true values
D) They are not used in classification

4) Which of the following is an unbiased estimator for the covariance matrix Σ?


A) MLE of Σ
B) Sample covariance S² = (1/n) ∑ (x_i− µ̂)(x_i − µ̂)ᵀ
C) S² = (1/(n−1)) ∑ (x_i − µ̂)(x_i − µ̂)ᵀ
D) None of the above

5) Which distribution is the conjugate prior for the Bernoulli distribution?


A) Gaussian
B) Gamma
C) Beta
D) Dirichlet

6) After observing data D = {x₁, ..., xₙ} where xi ∈ {0,1}, and using a Beta(α, β) prior, the
posterior for θ is:
A) Beta(α + n, β + n)
B) Beta(α + m, β + n − m) (where m = ∑ xi)
C) Gamma(α + m, β + n)
D) Gaussian(m/n, σ²)
❖ Written Questions

1) What does Bayesian Decision Theory show?


Bayesian Decision Theory shows us how to design an optimal classifier if we know the prior
probabilities P(wi ) and the class-conditional densities P(x | wi ).

2) What are the characteristics of MLE(Maximum likelihood estimation)?


• The MLE is the parameter point for which the observed sample is the most likely.
• The procedure with partial derivatives may result in several local extrema. We should
check each solution individually to identify the global optimum.
• Boundary conditions must also be checked separately for extrema.
• Invariance property: if 𝜃̂ is the MLE of θ, then for any function 𝑓(𝜃),
the MLE of 𝑓(𝜃) is 𝑓(𝜃̂).

3) What is the main difference between Maximum Likelihood Estimation (MLE) and
Bayesian Estimation in how they treat the model parameters?

Maximum likelihood estimation:


• Views the parameters as quantities whose values are fixed but unknown.
• Estimates these values by maximizing the probability of obtaining the samples observed.

Bayesian estimation
• Views the parameters as random variables having some known prior distribution.
• Observing new samples converts the prior to a posterior density

4) Compare Maximum Likelihood Estimation (MLE) and Bayesian Estimation


MLE Bayes
computational Differential calculus, multidimensional integration
complexity gradient search
interpretability Point estimate Weighted average of models
prior information Assume the parametric Assume the models p(θ)and p(x|θ) but the
model p(x|θ) resulting
Distribution p(x|D) may not have the same
form as p(x|θ)
5) What is a conjugate prior in Bayesian statistics?
A conjugate prior is one which,when multiplied with the probability of the observation ,
gives aposterior probability having the same functional form as the prior.

6) In multi-class classification, how are estimates applied to different classes, and what are
the different sources of error that affect classification performance?

The training samples are divided into c subsets 𝐷1 ,...,𝐷𝑐 , with the samples in 𝐷𝑖 belonging to
class 𝑤𝑖 , and then estimate each density p(x|𝑤𝑖 , 𝐷𝑖 ) separately.

The different sources of error are:


• Bayes error: due to overlapping class-conditional densities.
• Model error: due to an incorrect model.
• Estimation error: due to estimation from a finite sample, which can be reduced by
increasing the amount of training data.

7) Suppose that X is a discrete random variable with the following probability


mass function: where 0 ≤ θ ≤ 1 is a parameter. The following 10 independent
observations

X 0 1 2 3
P(X) 2θ / 3 θ/3 2(1-θ) / 3 (1-θ) / 3

were taken from such a distribution: (3,0,2,1,3,2,1,0,2,1). What is the maximum


likelihood estimate of θ.
Solution:

L(θ) = P(X = 3)P(X = 0)P(X = 2)P(X = 1)P(X = 3)


× P(X = 2)P(X = 1)P(X = 0)P(X = 2)P(X = 1)

Substituting from the probability distribution given above, we have


𝑛
2𝜃 2 𝜃 3 2(1 − 𝜃) 3 1 − 𝜃 2
𝐿(𝜃) = ∏ 𝑃 (𝑋𝑖 ∣ 𝜃) = ( ) ( ) ( ) ( )
3 3 3 3
𝑖=1

Clearly, the likelihood function L(θ) is not easy to maximize.


The log likelihood function

L(𝜃) = log𝐿(𝜃) = ∑ log 𝑃 (𝑋𝑖 ∣ 𝜃)


𝑖=1
2 1 2
𝐿(𝜃) = 2 (log + log 𝜃) + 3 (log + log 𝜃) + 3 (log + log( 1 − 𝜃))
3 3 3
1
+ 2 (log + log( 1 − 𝜃))
3
𝐿(𝜃) = 𝐶 + 5 log 𝜃 + 5 log( 1 − 𝜃)

The derivative of l(θ) with respect to θ be zero:

𝑑𝑙(𝜃) 5 5
= − =0
𝑑𝜃 𝜃 1−𝜃
5 5
− =0
𝜃 1−𝜃

̂ = 𝟎. 𝟓
the solution gives us the MLE, which is 𝜽
7. Reinforcement learning
❖ MCQ Questions

1) Which component is essential in reinforcement learning?


A) Agent
B) Environment
C) Rewards
D) All of the above

2) What is the objective of a reinforcement learning agent?


A) To minimize errors
B) To maximize accuracy
C) To maximize rewards
D) To minimize computational resources

3) Which algorithm is the foundation of most reinforcement learning methods?


A) Q-learning
B) Deep Learning
C) K-means clustering
D) Random Forest

4) In reinforcement learning, what does the term "exploitation" refer to?


A) Trying new actions to gain more knowledge
B) Maximizing immediate rewards based on current knowledge
C) Balancing exploration and exploitation for optimal results
D) Trying random actions to avoid bias

5) What is the role of the reward function in reinforcement learning?


A) It defines the actions available to the agent
B) It provides feedback to the agent based on its actions
C) It specifies the termination condition of the learning process
D) It determines the size of the agent's memory

6) Which reinforcement learning algorithm uses a model to simulate the environment and
learn from it?
A) Actor-Critic
B) Model-Free learning
C) Model-Based learning
D) Q-learning
7) Which algorithm is used when the environment's dynamics are unknown in
reinforcement learning?
A) Model-Free learning
B) Model-Based learning
C) Q-learning
D) Deep Learning

8) Which algorithm is used to estimate the optimal value function directly without
explicitly learning the policy?
A) Q-learning
B) Policy gradient
C) Temporal Difference (TD) learning
D) Monte Carlo methods

9) Which reinforcement learning algorithm uses a neural network as a function


approximator?
A) Q-learning
B) Deep Q-Network (DQN)
C) Policy gradient
D) Monte Carlo methods

10) Which algorithm uses a policy network to directly approximate the policy in
reinforcement learning?
A) Q-learning
B) Policy gradient
C) Monte Carlo methods
D) Temporal Difference (TD) learning

❖ Written Questions

1) What is reinforcement learning (RL)?


Reinforcement learning is learning what to do—how to map situations to actions—so as to
maximize a numerical reward signal. The learner is not told which actions to take, as in most
forms of machine learning, but instead must discover which actions yield the most reward by
trying them.
2) Explain the different types of agents in reinforcement learning.
In Reinforcement Learning, various types of agents are used depending on how they make
decisions and learn from the environment. The main types include:

• Utility-based Agent:
This agent learns a utility function that assigns a numerical value to each state. It chooses
actions based on which future state has the highest expected utility.

• Q-learning Agent:
A Q-learning agent learns a Q-value function that estimates the expected utility of taking
a specific action in a given state. It does not require a model of the environment to make
decisions.

• Reflex Agent:
This agent operates based on a policy that maps each state directly to an action. It does
not use a utility or value function but instead responds immediately based on the current
state.

3) How does Q-learning work?


Q-learning maintains a table (called the Q-table) where each entry Q(s, a) represents the
expected future reward of taking action in states and following the optimal policy thereafter.

1- Initialization:
The Q-table is initialized with arbitrary values (often zeros).

2- Interaction:
The agent interacts with the environment by:
Observing the current state (s)
Selecting an action (a) (using an exploration strategy like ε-greedy)
Receiving a reward (r) and observing the next state (s’)

3- Update Rule:
The Q-value for the state-action pair is updated using the formula:
Q(s, a) ← Q(s, a) + α [r + γ * max Q(s’, a’) – Q(s, a)]

α: Learning rate (how much new information overrides old)


γ: Discount factor (importance of future rewards) (Gamma)
max Q(s’, a’): Maximum predicted reward for the next state
4- Repeat:
This process repeats for many episodes, allowing the Q-table to converge to optimal values.

4) Example of Q-learning in action: ( Imagine a 4x4 grid. )


The agent starts at the top-left corner (0,0).
The goal is at the bottom-right corner (3,3).
The agent can move up, down, left, or right.
Each move gives a reward of -1 (to encourage faster solutions), except reaching the
goal, which gives +10.
Solution:

Q-learning Steps

1- Initialize Q-table:
Each cell (state) and action pair (up, down, left, right) starts with Q-values of 0.

2- Agent’s Move:
The agent is at (0,0).
It chooses an action (say, right).
It moves to (0,1), receives a reward of -1.

3- Update Q-value:
Suppose α = 0.5 (learning rate), γ = 0.9 (discount factor).
The update for Q((0,0), right) is:
Q((0,0), right) ← Q((0,0), right) + 0.5 * [-1 + 0.9 * max Q((0,1), all actions) - Q((0,0),
right)]
Since all Q-values are 0 at first:
Q((0,0), right) ← 0 + 0.5 * [-1 + 0 - 0] = -0.5

4- Repeat:
The agent continues exploring, updating Q-values for each state-action pair.

5- Convergence:
After many episodes, the Q-table will reflect the best action to take from each cell to reach
the goal quickly.
5) How can we use the value determination algorithm to compute the expected loss from
using estimated utility U and model M, instead of the correct ones?
We estimate the reward the agent will get using an estimated utility U and model M, which
differs from the true utility of visited states.

To compute the agent’s policy, for each state i, choose the action a giving the highest

expected utility:

You might also like