0% found this document useful (0 votes)

5 views20 pages

Introduction To Probability

Uploaded by

Jayanth Gudimella

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views20 pages

Introduction To Probability

Uploaded by

Jayanth Gudimella

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

The Foundations of Probability: A Comprehensive Guide

1 Introduction to Probability
Probability is a numerical measure that quantifies the likelihood of an event occurring. It’s expressed
as a number between 0 and 1, inclusive:

• 0 (Impossible): An event with a probability of 0 will never happen.

– Example: The probability of rolling an 8 on a standard six-sided die is 0.

• 1 (Certain): An event with a probability of 1 is guaranteed to happen.

– Example: The probability that the sun will rise tomorrow is 1.

1.1 Core Concepts and the Probability Space

The foundation of probability theory lies in defining a probability space, which consists of three main
components:
1. Sample Space (Ω): The set of all possible outcomes.
2. Event Space (F): A collection of subsets of Ω that we define as ”events” (these are the things
we can assign probabilities to). For most practical purposes, this includes all individual outcomes
and combinations of outcomes.
3. Probability Measure (P ): A function that assigns a probability to each event in F.

1.1.1 Sample Space (Ω)

The sample space is the set of all possible outcomes of a random process or experiment. It’s the
complete collection of results that can occur.

• Example 1: Tossing a single coin

The possible outcomes are Heads or Tails. Ω = {Heads, Tails}

• Example 2: Rolling a single standard six-sided die

The possible outcomes are the numbers 1 through 6. Ω = {1, 2, 3, 4, 5, 6}

1.1.2 Event
An event is any subset of the sample space (Ω). It’s a specific outcome or a collection of outcomes
that we are interested in, belonging to the event space (F).

• Example 1: Rolling an even number on a die

Let Event A be ”Rolling an even number.” From the sample space Ω = {1, 2, 3, 4, 5, 6}, the even
numbers are 2, 4, and 6. A = {2, 4, 6}

• Example 2: Tossing a coin twice and getting exactly one Head

From the sample space Ω = {HH, HT, TH, TT}, the outcomes with exactly one head are HT and
TH. Let Event B be ”Getting exactly one Head.” B = {HT, TH}

2
1.1.3 Probability Axioms
These are fundamental rules that all probabilities must follow:

1. Range: The probability of any event A, denoted P (A), must be between 0 and 1. 0 ≤ P (A) ≤ 1
2. Total Probability: The probability of the entire sample space Ω (meaning something in the
sample space will happen) is 1. P (Ω) = 1

3. Additivity (for Disjoint Events): If two events, A and B, are disjoint (also known as mutually
exclusive, meaning they cannot happen at the same time and have no common outcomes, i.e.,
A ∩ B = ∅), then the probability of A or B happening is the sum of their individual probabilities.
P (A ∪ B) = P (A) + P (B)

• Example: When rolling a die, let A be ”rolling an odd number” (A = {1, 3, 5}) and B be
”rolling a 2” (B = {2}). A and B are disjoint. P (A) = 36 , P (B) = 16 . The event ”rolling
an odd number or a 2” is {1, 2, 3, 5}. Its probability is 46 . Using the axiom: P (A ∪ B) =
P (A) + P (B) = 36 + 16 = 46 . The results match.

1.2 Example: Basic Probability Calculation

A fair die is rolled once.

• Sample space: Ω = {1, 2, 3, 4, 5, 6}

• Let A = {2, 4, 6} (even numbers). To find P (A), we use the classical definition of probability (for
Number of outcomes in A 3
equally likely outcomes): P (A) = Total number of outcomes in Ω = 6 = 0.5

3
2 Random Variables
A random variable (RV) is a function that assigns a numerical value to each outcome in the
sample space of a random experiment. Random variables are typically denoted by capital letters like X,
Y, Z.

2.1 Discrete Random Variables

A discrete random variable can take on a finite number of values or a countably infinite number of values
(e.g., whole numbers).

• Example 1: Toss two coins

Let X be the random variable representing the number of heads.
– Outcome HH → X = 2
– Outcome HT → X = 1
– Outcome TH → X = 1
– Outcome TT → X = 0
So, the possible values for X are X ∈ {0, 1, 2}.

2.1.1 Probability Mass Function (PMF)

The Probability Mass Function (PMF) describes how the probability is distributed among the
discrete values that a random variable can take. For a discrete RV X, the PMF, denoted P (X = x)
or p(x), gives the probability that X takes on a specific value x.

• Properties of a PMF:
– 0 ≤ P (X = x) ≤ 1 for all values of x.
P
– x P (X = x) = 1 (the sum of all probabilities for all possible values must equal 1).

• Example: Toss two coins (continued):

Assuming a fair coin, each outcome in Ω = {HH, HT, TH, TT} has a probability of 1/4.
The PMF for X (number of heads) is:
– P (X = 0) = P (TT) = 1/4
– P (X = 1) = P (HT or TH) = P (HT) + P (TH) = 1/4 + 1/4 = 2/4 = 1/2
– P (X = 2) = P (HH) = 1/4
To verify, sum the probabilities: 1/4 + 1/2 + 1/4 = 1. This is a valid PMF.

2.2 Continuous Random Variables

A continuous random variable can take on any real number value within a given range (or across
the entire real number line). You cannot list all possible values.

• Examples:
– Temperature (28.63◦ C)
– Height (175.4 cm)
– Time (e.g., time it takes for a light bulb to burn out)

4
2.2.1 Probability Density Function (PDF)
For continuous random variables, we use a Probability Density Function (PDF), denoted f (x).
Important distinction: For a continuous RV, the probability of it taking on any single exact value
is zero: P (X = x) = 0. This is because there are infinitely many possible values, so the chance of hitting
one specific value is infinitesimally small.
Instead, probabilities for continuous variables are found by calculating the area under the curve
of the PDF over a given interval.
Rb
P (a ≤ X ≤ b) = a f (x) dx

• Properties of a PDF:
– f (x) ≥ 0 for all x (the probability density cannot be negative).
R∞
– −∞ f (x) dx = 1 (the total area under the entire PDF curve must equal 1).

• Example: A Simple Continuous PDF

Consider a random variable X with the following PDF: f (x) = 2x, for 0 ≤ x ≤ 1
f (x) = 0, otherwise.

– Verify if it’s a valid PDF:

1. Is f (x) ≥ 0? Yes, for 0 ≤ x ≤ 1, 2x is non-negative.
R∞ R1
2. Does −∞ f (x) dx = 1? 0 2x dx = [x2 ]10 = 12 − 02 = 1. Yes, it’s a valid PDF.
– Find P (0.2 ≤ X ≤ 0.5):
R 0.5
This is the area under the PDF curve from 0.2 to 0.5. P (0.2 ≤ X ≤ 0.5) = 0.2 2x dx =
[x2 ]0.5 2 2
0.2 = (0.5) − (0.2) = 0.25 − 0.04 = 0.21 What this tells us: This means there’s a 21%
chance that the random variable X will fall between 0.2 and 0.5. The PDF shows that values
closer to 1 are more likely than values closer to 0.

2.2.2 Cumulative Distribution Function (CDF)

The Cumulative Distribution Function (CDF), denoted F (x), gives the probability that a random
variable X takes on a value less than or equal to a specific value x. It’s a fundamental function that
applies to both discrete and continuous random variables.
F (x) = P (X ≤ x)

• Properties of a CDF:

– 0 ≤ F (x) ≤ 1 for all x.

– F (x) is non-decreasing: if x1 < x2 , then F (x1 ) ≤ F (x2 ).
– limx→−∞ F (x) = 0
– limx→∞ F (x) = 1
d
– For continuous RVs, f (x) = dx F (x) (the PDF is the derivative of the CDF).
– For any two points a, b: P (a < X ≤ b) = F (b) − F (a).
• What it tells us: The CDF provides a comprehensive view of the entire probability distribution,
showing the accumulated probability up to any given point. It allows for easy calculation of
probabilities for intervals.

• Example: CDF for Discrete Variable (Toss two coins, X=number of heads)
PMF: P (0) = 1/4, P (1) = 1/2, P (2) = 1/4.
– F (x) = 0 for x < 0
– F (0) = P (X ≤ 0) = P (X = 0) = 1/4
– F (1) = P (X ≤ 1) = P (X = 0) + P (X = 1) = 1/4 + 1/2 = 3/4
– F (2) = P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2) = 1/4 + 1/2 + 1/4 = 1
– F (x) = 1 for x ≥ 2

5
• Example: CDF for Continuous
Rx Variable (from PDF f (x) = 2x for 0 ≤ x ≤ 1)
For 0 ≤ x ≤ 1: F (x) = 0 2t dt = [t2 ]x0 = x2 So, F (x) = x2 for 0 ≤ x ≤ 1. F (x) = 0 for x < 0
F (x) = 1 for x > 1

– Using CDF to find P (0.2 ≤ X ≤ 0.5): P (0.2 ≤ X ≤ 0.5) = F (0.5)−F (0.2) = (0.5)2 −(0.2)2 =
0.25 − 0.04 = 0.21, matching the previous result.

6
3 Measures of Distribution
These are measures that describe the central tendency and spread of a random variable.

3.1 Expectation (Mean) - E[X]

The expectation or mean of a random variable X, denoted E[X] or µ, is the weighted average of all
possible values that the random variable can take. The weights are their respective probabilities.

• What it tells us: The expectation represents the ”long-run average” value if you were to repeat
the random experiment infinitely many times. It’s the ”central tendency” or ”balancing point”
of the distribution. For a fair die, an expectation of 3.5 means that, over many rolls, the average
outcome will tend towards 3.5, even though 3.5 is not an actual outcome. It’s the best single value
to summarize the distribution’s typical outcome.

• For Discrete Random Variables: E[X] = x · P (x)

P
R∞
• For Continuous Random Variables: E[X] = −∞ xf (x) dx
• Example 1: Expectation of a Die Roll
Let X be the outcome of a fair 6-sided die. PMF: P (X = x) = 1/6 for x ∈ {1, 2, 3, 4, 5, 6}.

1 1 1 1 1 1
E[X] = 1 · + 2· + 3· + 4· + 5· + 6·
6 6 6 6 6 6
1 21
E[X] = (1 + 2 + 3 + 4 + 5 + 6) = = 3.5
6 6
What this tells us: If you roll a fair die many times, the average of all your rolls will approach 3.5.
• Example 2: Expectation of a Continuous Variable
Using the PDF from the previous section: f (x) = 2x, for 0 ≤ x ≤ 1.
Z 1 Z 1
E[X] = x · (2x) dx = 2x2 dx
0 0
1
2x3 2(1)3 2(0)3

2
E[X] = = − =
3 0 3 3 3

What this tells us: If you were to repeatedly draw samples from this distribution, the average of
those samples would tend towards 2/3.

3.2 Variance (Var(X)) and Standard Deviation (σ)

• What they tell us: These measures quantify the spread or dispersion of the values around
the mean.
– A small variance/standard deviation means the values are tightly clustered around the
mean, implying more consistency, less variability, or greater predictability.
– A large variance/standard deviation means the values are widely scattered, indicating
greater variability, less consistency, or less predictability.
The standard deviation is particularly useful because it is in the same units as the random variable
itself, providing an intuitive measure of the typical deviation from the mean.
• Variance: Var(X) = E[(X − E[X])2 ] (the expected squared difference from the mean). A com-
putationally easier formula is: Var(X) = E[X 2 ] − (E[X])2
• Standard Deviation (σ): The square root of the variance. σ = Var(X)
p

• Example: Variance of a Die Roll

Die roll: X ∈ {1, 2, 3, 4, 5, 6}, P (x) = 1/6. Mean E[X] = 3.5.

7
First, calculate E[X 2 ]:

2 2 1 2 1 2 1 2 1 2 1 2 1
E[X ] = 1 · + 2 · + 3 · + 4 · + 5 · + 6 ·
6 6 6 6 6 6
1 91
E[X 2 ] = (1 + 4 + 9 + 16 + 25 + 36) = ≈ 15.1667
6 6
Now, calculate Variance: Var(X) = E[X 2 ] − (E[X])2 Var(X) = 91 2 35
6 − (3.5) = 12 ≈ 2.9167
q
Standard Deviation: σ = 35 12 ≈ 1.7078 What this tells us: On average, the outcomes of a fair die
roll deviate from the mean (3.5) by about 1.71 units.

8
4 Key Probability Distributions
4.1 Univariate (Single) Gaussian (Normal) Distribution
The Univariate Gaussian distribution, also known as the Normal distribution, is the most im-
portant and widely used continuous probability distribution for a single random variable. It’s often
described as the ”bell curve” due to its characteristic shape.

4.1.1 Definition:
(x−µ)2
The PDF of a normal distribution for a single variable X is given by: f (x; µ, σ 2 ) = √ 1
2πσ 2
e− 2σ 2

Where:
• µ (mu) is the mean of the distribution, which determines its central location and the peak of the
bell curve.

• σ 2 is the variance, which determines the spread or width of the distribution.

• σ (sigma) is the standard deviation (square root of variance). A larger σ means a wider, flatter
curve, while a smaller σ means a narrower, taller curve.

4.1.2 Properties:
• Bell-shaped and Symmetric: The curve is symmetrical around its mean µ.

• Parameters: It is completely defined by its mean (µ) and variance (σ 2 ).

• Empirical Rule (68-95-99.7 Rule): For a normal distribution:
– Approximately 68% of the data falls within one standard deviation of the mean (µ ± σ).
– Approximately 95% of the data falls within two standard deviations of the mean (µ ± 2σ).
– Approximately 99.7% of the data falls within three standard deviations of the mean (µ ± 3σ).
• Example: Heights of Adult Males
The heights of adult males in a population often approximate a normal distribution. Suppose the
mean height is µ = 175 cm and the standard deviation is σ = 7 cm. What this tells us: Most men
are around 175 cm tall. About 68% of men would have heights between 168 cm and 182 cm. This
rule helps quickly estimate the proportion of data expected within certain ranges from the mean,
providing a quick understanding of the spread of heights in the population.

4.2 Multivariate Gaussian Distribution

The Multivariate Gaussian Distribution is a generalization of the one-dimensional Gaussian dis-
tribution to multiple dimensions. It describes the joint probability distribution of a vector of related
continuous random variables.
A k-dimensional random vector X = [X1 , . . . , Xk ]T is said to have a multivariate Gaussian distribu-
tion if its PDF is given by:
f (x; µ, Σ) = √ 1 k exp − 12 (x − µ)T Σ−1 (x − µ)

(2π) |Σ|
Where:

• x is a k × 1 column vector representing a specific realization of the random variables.

• µ (mu) is the k × 1 mean vector. Each element µi is the mean of the corresponding random
variable Xi .
– What it tells us: The mean vector indicates the ”center” or ”average point” in the multi-
dimensional space where the data tends to cluster.

• Σ (Sigma) is the k × k covariance matrix.

– The diagonal elements Σii represent the variance of Xi .
– The off-diagonal elements Σij represent the covariance between Xi and Xj .

9
– What the covariance matrix tells us:
∗ Diagonal elements: Indicate the spread of each individual variable along its own axis.
∗ Off-diagonal elements (Covariance): Quantify how two variables vary together.
· Positive Covariance: Implies that if one variable increases, the other tends to
increase as well (e.g., height and weight). The elliptical contours of the distribution
would be stretched along the direction where both variables increase.
· Negative Covariance: Implies that if one variable increases, the other tends to
decrease (e.g., hours spent studying and hours spent playing video games, if they are
inversely related). The elliptical contours would be stretched along the anti-diagonal.
· Zero Covariance: Implies no linear relationship between the two variables. For
Gaussian distributions, zero covariance implies independence. The elliptical contours
would be aligned with the coordinate axes.
∗ Overall, the covariance matrix Σ defines the shape and orientation of the multi-dimensional
”bell” (an ellipsoid), showing how the variables are correlated and spread out together in
the multi-dimensional space.
• |Σ| is the determinant of the covariance matrix.

• Σ−1 is the inverse of the covariance matrix.

The covariance matrix Σ must be symmetric and positive semi-definite.

10
5 Multiple Random Variables and Their Relationships
These concepts explain how different random variables interact and how their probabilities are related.

5.1 Joint Probability

The joint probability describes the probability of two or more random variables taking on specific
values simultaneously.

• For discrete variables X, Y : P (X = x, Y = y) is the probability that X takes value x AND Y

takes value y.
• For continuous variables X, Y : f (x, y) is the joint probability density function. P (a ≤ X ≤
RdRb
b, c ≤ Y ≤ d) = c a f (x, y) dx dy.
• What it tells us: The joint probability (or joint PDF/PMF) provides information about the
likelihood of observing specific combinations of outcomes for multiple random variables. It describes
their combined behavior.
• Example (Discrete): Weather and Umbrella
Consider Weather (X: Rainy/Sunny) and Umbrella (Y: Yes/No).
– P (X=Rainy, Y=Yes) = 0.5 (50% chance it’s rainy AND someone has an umbrella)
• Example (Continuous): Uniform Distribution over a Square
Let X and Y be two continuous random variables jointly distributed with the PDF: f (x, y) = 1,
for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
f (x, y) = 0, otherwise. This means the probability density is uniform over the unit square.

– Find P (0.2 ≤ X ≤ 0.5, 0.3 ≤ Y ≤ 0.7):

Z 0.7 Z 0.5
P (0.2 ≤ X ≤ 0.5, 0.3 ≤ Y ≤ 0.7) = 1 dx dy
0.3 0.2
Z 0.7
= [x]0.5
0.2 dy
0.3
Z 0.7
= (0.5 − 0.2) dy
0.3
Z 0.7
= 0.3 dy
0.3
= [0.3y]0.7
0.3
= 0.3 · (0.7 − 0.3) = 0.3 · 0.4 = 0.12

What this tells us: There is a 12% chance that X will be between 0.2 and 0.5 and Y will be
between 0.3 and 0.7 at the same time. This shows the joint likelihood of two events occurring
in specific continuous ranges.

5.2 Marginal Probability

The marginal probability of a single variable is the probability of that variable taking a specific
value, regardless of the values of other variables. It’s obtained by ”summing out” (for discrete) or
”integrating out” (for continuous) the other variables from the joint distribution.

• For discrete variables: P (X = x) = y P (X = x, Y = y)

R∞
• For continuous variables: f (x) = −∞ f (x, y) dy
• What it tells us: The marginal probability (or marginal PDF/PMF) gives the individual prob-
ability distribution for one variable, effectively ignoring or ”averaging out” the influence of other
variables it might be jointly distributed with. It allows us to analyze each variable’s behavior
independently of the others.

11
• Example (Discrete): Weather and Umbrella (continued)
If we also knew P (X=Sunny, Y=Yes) = 0.2 and P (X=Sunny, Y=No) = 0.2. P (X=Rainy) =
P (X=Rainy, Y=Yes) + P (X=Rainy, Y=No) = 0.5 + 0.1 = 0.6 What this tells us: There is an
overall 60% chance of it being rainy, regardless of whether someone carries an umbrella or not.

• Example (Continuous): Using the Uniform Joint PDF (continued)

Using f (x, y) = 1 for 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, and we found f (y) = 1 for 0 ≤ y ≤ 1.
R1 R1
– Find the marginal PDF for X, f (x): f (x) = 0 f (x, y) dy = 0 1 dy = [y]10 = 1 − 0 = 1,
for 0 ≤ x ≤ 1. f (x) = 0, otherwise.
What this tells us: Even though X and Y are part of a joint distribution, the probability distribution
of X by itself (ignoring Y ) is a uniform distribution from 0 to 1.

5.3 Conditional Probability

Conditional probability is the probability of an event occurring given that another event has
already occurred. It quantifies how the probability of one event changes if we know something about
another related event.
The formula is: P (A|B) = P (AP and
(B)
B)
(provided P (B) > 0)
f (x,y)
For continuous variables, the conditional probability density function (PDF) is: f (x|y) = f (y)
(provided f (y) > 0)

• What it tells us: The conditional probability (or conditional PDF/PMF) allows us to update our
beliefs about one variable’s likelihood after observing the value of another. It reveals dependencies
between variables: if knowing one variable’s value changes the probability of another, they are
dependent.
• Example (Discrete): Weather and Umbrella (continued)
Find the probability that it’s Rainy, given that someone has an Umbrella (Yes): P (X=Rainy | Y=Yes) =
P (X=Rainy, Y=Yes)
P (Y=Yes) Using our previous examples, P (Y=Yes) = P (Rainy, Yes) + P (Sunny, Yes) =
0.5 + 0.2 = 0.7. P (X=Rainy | Y=Yes) = 0.5 0.7 ≈ 0.714 What this tells us: If you know someone has
an umbrella, there’s about a 71.4% chance it’s rainy. This is higher than the overall 60% chance
of rain, indicating a positive association between rain and carrying an umbrella.
• Example (Continuous): Using the Uniform Joint PDF (continued)
Using f (x, y) = 1 for 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, and we found f (y) = 1 for 0 ≤ y ≤ 1.
f (x,y) 1
– Find the conditional PDF for X given Y=y, f (x|y): f (x|y) = f (y) = 1 = 1, for
0 ≤ x ≤ 1 (for any specific y ∈ [0, 1]). f (x|y) = 0, otherwise.
What this tells us: In this specific uniform example, knowing the value of Y (as long as it’s within
its range) does not change the distribution of X. X is still uniformly distributed between 0 and 1.
This is a characteristic of independent variables. If X and Y were dependent, f (x|y) would be a
function of y, meaning Y ’s value would indeed influence X’s distribution.

5.4 Independence
Two events A and B (or two random variables X and Y) are considered independent if the occurrence
of one does not affect the probability of the other.

• Mathematically, for random variables X, Y :

– Discrete: P (X = x, Y = y) = P (X = x) · P (Y = y) for all x, y.
– Continuous: f (x, y) = f (x) · f (y) for all x, y.
• Alternatively, using conditional probabilities (if P (B) > 0 or f (y) > 0):
– P (A|B) = P (A)
– f (x|y) = f (x)

12
• What it tells us: Independence signifies a lack of predictive power between variables. Knowing
the value of one variable gives you no additional information about the likelihood of the other.
• Example: Tossing 2 coins.
Let X be the outcome of the first toss (Heads/Tails) and Y be the outcome of the second toss.
The outcome of the first toss is independent of the outcome of the second toss. If P (HH) = 0.25,
P (HT) = 0.25, etc., and P (X = H) = 0.5, P (Y = H) = 0.5. Since P (X = H, Y = H) = 0.25 and
P (X = H) · P (Y = H) = 0.5 · 0.5 = 0.25, they are independent. What this tells us: Knowing the
result of the first coin toss gives you no information that changes your prediction for the second
coin toss. The two events do not influence each other’s probabilities.

13
6 Bayesian Networks: Modeling Conditional Dependencies
Bayesian Networks (BNs), also known as Bayes nets or belief networks, are powerful probabilistic
graphical models that represent a set of random variables and their conditional dependencies via
a Directed Acyclic Graph (DAG). They provide a structured way to represent and reason about
uncertainty in complex systems.

6.1 Components of a Bayesian Network:

1. Nodes: Each node in the graph represents a random variable (e.g., ”Rain,” ”Wet Grass,” ”Sprin-
kler,” ”Alarm”). These variables can be discrete or continuous.

2. Directed Edges: Arrows connect nodes to indicate a direct probabilistic influence or causal
relationship. An arrow from A to B means A is a ”parent” of B, and B is a ”child” of A, implying
that B’s probability distribution directly depends on A.
3. No Cycles (Acyclic): The graph must not contain any directed cycles (you cannot start at a
node and follow arrows to return to the same node). This ensures that probabilities are well-defined
and that cause-and-effect relationships don’t loop back on themselves.
4. Conditional Probability Distributions (CPDs) / Tables (CPTs): Each node has an asso-
ciated CPD (or CPT for discrete variables) that quantifies the influence of its parents.
• For a node with no parents (a ”root” node), its CPD is simply its prior probability distribution.
• For a node with parents, its CPD defines the probability of that node’s value given the values
of its parents.

6.2 Key Property: Conditional Independence

The structure of the DAG in a Bayesian Network directly encodes conditional independence as-
sumptions. A key property is that a node is conditionally independent of its non-descendants given its
parents. This means that once you know the state of a node’s parents, information about its ancestors
(non-descendants) becomes irrelevant for predicting its state. This property allows the joint probabil-
ity distribution of all variables in the network to be factorized (broken down) into a product of simpler
conditional probabilities: Q
n
P (X1 , X2 , . . . , Xn ) = i=1 P (Xi |Parents(Xi ))
This factorization vastly simplifies the computation and representation of complex joint distributions,
as you only need to specify local conditional probabilities rather than the full joint table for all variables.

6.3 Inference in Bayesian Networks

One of the main uses of Bayesian Networks is probabilistic inference: given some observed evidence
(values for some variables), what are the updated probabilities of other unobserved variables? This in-
volves using the structure of the network and the CPDs to compute marginal or conditional probabilities.

• Example Inference Types:

– Predictive (or Causal) Inference: Predicting effects from causes (e.g., P (Wet Grass|Sprinkler On)).
– Diagnostic Inference: Diagnosing causes from effects (e.g., P (Rain|Wet Grass)).
– Intercausal Inference (Explaining Away): When two causes influence a common effect,
observing one cause can ”explain away” the other (e.g., P (Rain|Wet Grass, Sprinkler On)
might be lower than P (Rain|Wet Grass)).

6.4 Example: ”Wet Grass” Network

Consider a simple Bayesian Network with three binary variables:
• Rain (R): True/False
• Sprinkler (S): True/False (whether sprinkler is on)

14
• Wet Grass (W): True/False
Structure (DAG):
• Rain → Wet Grass

• Sprinkler → Wet Grass

This structure implies that Rain and Sprinkler cause Wet Grass, and that Rain and Sprinkler are
independent of each other (unless evidence is introduced about Wet Grass, which can create dependencies
through ”explaining away”).
Conditional Probability Distributions (CPDs):
• Prior for Rain: The probability of Rain being True is P (R = True) = 0.2, and False is P (R =
False) = 0.8.
• Prior for Sprinkler: The probability of Sprinkler being True is P (S = True) = 0.3, and False is
P (S = False) = 0.7.

• Conditional Probabilities for Wet Grass given Rain and Sprinkler:

– If Rain is False and Sprinkler is False, the probability of Wet Grass being True is 0.0.
– If Rain is False and Sprinkler is True, the probability of Wet Grass being True is 0.9.
– If Rain is True and Sprinkler is False, the probability of Wet Grass being True is 0.95.
– If Rain is True and Sprinkler is True, the probability of Wet Grass being True is 0.99.
(For each combination of Rain and Sprinkler, the probability of Wet Grass being False is simply
1 − P (W = True) for that combination).
What this tells us: This network explicitly models the probabilistic causal relationships: whether the
grass is wet depends directly on both rain and the sprinkler. By defining these local dependencies, the
network implicitly defines the full joint probability distribution P (R, S, W ) = P (W |R, S) · P (R) · P (S).
This allows us to perform various inferences, such as calculating the probability of rain given that the
grass is wet, or determining the likelihood of the sprinkler being on given no rain but wet grass.

15
7 Bayes’ Theorem
Bayes’ Theorem describes the probability of an event, based on prior knowledge of conditions that might
be related to the event. It’s a cornerstone of probabilistic inference, allowing us to update our beliefs
based on new evidence. It is frequently applied within Bayesian Networks for various inference tasks,
particularly diagnostic inference.
The formula for Bayes’ Theorem is:
P (A|B) = P (B|A)·P
P (B)
(A)

Where:
• P (A|B) is the posterior probability: the probability of event A occurring given that event B
has occurred. This is often what we want to find – our updated belief about A after observing B.
• P (B|A) is the likelihood: the probability of event B occurring given that event A has occurred.
This tells us how likely the observed evidence B is, assuming our hypothesis A is true.
• P (A) is the prior probability: the initial probability of event A occurring before considering any
information about B. This is our initial belief or knowledge about A.
• P (B) is the evidence or marginal probability of B: the total probability of event B occurring,
considering all possible scenarios. This acts as a normalizing constant. It can be calculated using
the law of total probability: P (B) = P (B|A)P (A) + P (B|not A)P (not A) (for two complementary
events A and not A).
What it tells us: Bayes’ Theorem provides a formal way to reverse conditional probabilities and update
our confidence in a hypothesis (A) when new evidence (B) becomes available. It shows how our initial
belief (P (A)) is weighted by how well the evidence supports the hypothesis (P (B|A)) relative to how
likely the evidence is overall (P (B)). It’s fundamental for inference and learning from data.

7.1 Example: Medical Diagnosis

Suppose a rare disease (D) affects 1 in 1000 people (P (D) = 0.001). There’s a test for this disease.
• The test is 99% accurate in detecting the disease if a person has it (true positive rate): P (Positive Test |D) =
0.99.
• The test has a 5% false positive rate (incorrectly identifies healthy people as having the disease):
P (Positive Test |not D) = 0.05.
You test positive. What is the probability that you actually have the disease? (P (D|Positive Test))
Let:
• A = D (Having the Disease)
• B = T + (Positive Test)
We want to find P (D|T +).
We have:
• P (D) = 0.001 (Prior probability of having the disease)
• P (T + |D) = 0.99 (Likelihood of positive test given disease)
• P (not D) = 1 − P (D) = 1 − 0.001 = 0.999 (Prior probability of not having the disease)
• P (T + |not D) = 0.05 (Likelihood of positive test given no disease)
First, we calculate P (T +) (the overall probability of getting a positive test):

P (T +) = P (T + |D)P (D) + P (T + |not D)P (not D)

P (T +) = (0.99 · 0.001) + (0.05 · 0.999)
P (T +) = 0.00099 + 0.04995 = 0.05094

(Evidence: the general chance of a positive test result in the population)

16
Now, apply Bayes’ Theorem:

P (T + |D) · P (D)
P (D|T +) =
P (T +)
0.00099
P (D|T +) = ≈ 0.0194
0.05094
What this tells us: Even with a positive test, the probability of actually having the disease is only about
1.94%! This illustrates the power of Bayes’ Theorem: it shows how the very low prior probability of the
disease (P (D) = 0.001) significantly tempers the impact of the positive test result, given the relatively
high false positive rate (5%) in the larger healthy population. This kind of analysis is crucial in fields
like medicine and other areas requiring probabilistic reasoning.

17
8 KL Divergence (Kullback-Leibler Divergence)
KL Divergence, also known as relative entropy, is a non-symmetric measure of the difference be-
tween two probability distributions. It quantifies how much information is lost when a probability
distribution Q is used to approximate another probability distribution P . In simpler terms, KL Diver-
gence tells us how ”different” or ”distant” one probability distribution Q is from a reference
probability distribution P .
It is commonly used in various fields, including statistics, information theory, and machine learning,
to measure how similar one probability distribution is to another, often when optimizing models to learn
target distributions.

8.1 Formula:
For discrete probability distributions
P (x) and Q(x) (where P (x) and Q(x) are their PMFs):
P P (x)
DKL (P ||Q) = x P (x) log Q(x)
For continuous probability
distributions P (x) and Q(x) (where p(x) and q(x) are their PDFs):
R∞
DKL (P ||Q) = −∞ p(x) log p(x)
q(x) dx
Note: The logarithm can be in any base. Using base 2 gives units in ”bits”, while using the natural
logarithm (base e) gives units in ”nats”. The choice of base only scales the result, not its fundamental
meaning.

8.2 What KL Divergence Measures: The Information Theory Perspective

To understand DKL (P ||Q), it’s helpful to consider its roots in information theory.
• Entropy (H(P )): The entropy of a distribution P measures the average number of bits required
to encode (or describe) an event drawn from P , assuming an optimal encoding strategy for P . It
quantifies
P the inherent unpredictability or ”surprise” of events from that distribution. H(P ) =
− x P (x) log(P (x))
• Cross-Entropy (H(P, Q)): Cross-entropy measures the average number of bits required to encode
an event drawn from
P distribution P , if we use an encoding strategy that is optimized for distribution
Q. H(P, Q) = − x P (x) log(Q(x))
KL Divergence can then be expressed as the difference between cross-entropy and entropy:
DKL (P ||Q) = H(P, Q) − H(P )
!
X X
=− P (x) log(Q(x)) − − P (x) log(P (x))
x x
X
= P (x)(log(P (x)) − log(Q(x)))
x

X P (x)
= P (x) log
x
Q(x)

Interpretation: DKL (P ||Q) represents the extra bits (or nats) required to encode samples from the
true distribution P when using an encoding scheme optimized for the approximating distribution Q,
compared to using an optimal encoding scheme for P itself. A higher DKL (P ||Q) means that Q is a
poorer approximation of P , and using Q’s encoding would be more inefficient, leading to more ”surprise”
or information loss when observing actual events from P .

8.3 Properties of KL Divergence:

1. Non-negativity: DKL (P ||Q) ≥ 0. The KL divergence is always non-negative. It is equal to 0 if
and only if P (x) = Q(x) for all values of x. This means if the distributions are identical, there’s
no ”information loss” or ”difference.”
2. Non-symmetry: DKL (P ||Q) ̸= DKL (Q||P ) in general. This is a crucial property. It means KL
divergence is not a true ”distance” metric (like Euclidean distance, where d(A, B) = d(B, A)). The
order matters significantly.

18
• Asymmetry Explained:
– If P (x) > 0 for some x where Q(x) = 0, then DKL (P ||Q) will be infinite. This is
because if the approximating distribution Q assigns zero probability to an event that the
true distribution P considers possible, the ”cost” of encoding that event using Q’s scheme
becomes infinitely large. This means Q must cover all possibilities that P covers.
– Conversely, if Q(x) > 0 for some x where P (x) = 0, that specific term in the sum/integral
contributes 0 to DKL (P ||Q). In this case, Q might assign probability to events that P
never produces; this is penalized less severely in DKL (P ||Q) than P assigning probability
to events that Q misses.
• This asymmetry is important in machine learning. For instance, in Maximum Likelihood
Estimation (MLE), we often minimize DKL (Pdata ||Pmodel ), which means the model tries to
cover all data modes. In Variational Inference (VI), we often minimize DKL (Qapprox ||Ptrue ),
which implies that the approximation Q should not assign probability mass where the true
posterior P has none (it will tend to be narrower than P ).

8.4 Examples:
8.4.1 Example 1: Comparing Two Coins
Let’s consider two binary distributions (coin tosses):
• Distribution P (True Coin): A biased coin that lands Heads (H) with probability 0.7 and Tails
(T) with probability 0.3.

– P (H) = 0.7
– P (T ) = 0.3
• Distribution Q (Approximation Coin): A fair coin.
– Q(H) = 0.5
– Q(T ) = 0.5
Let’s calculate DKL (P ||Q) using natural logarithms.

P (H) P (T )
DKL (P ||Q) = P (H) ln + P (T ) ln
Q(H) Q(T )

0.7 0.3
= 0.7 ln + 0.3 ln
0.5 0.5
= 0.7 ln(1.4) + 0.3 ln(0.6)
≈ 0.7(0.33647) + 0.3(−0.51083)
≈ 0.23553 − 0.15325 = 0.08228 nats

Now, let’s calculate DKL (Q||P ) to demonstrate asymmetry:

Q(H) Q(T )
DKL (Q||P ) = Q(H) ln + Q(T ) ln
P (H) P (T )

0.5 0.5
= 0.5 ln + 0.5 ln
0.7 0.3
= 0.5 ln(0.71428) + 0.5 ln(1.66667)
≈ 0.5(−0.33647) + 0.5(0.51083)
≈ −0.168235 + 0.255415 = 0.08718 nats

What this tells us:

• DKL (P ||Q) ≈ 0.08228 nats. This is the extra information (in nats) we’d expect to need to represent
outcomes from the biased coin (P) if we assumed they came from the fair coin (Q).

19
• DKL (Q||P ) ≈ 0.08718 nats. This is the extra information we’d expect to need to represent out-
comes from the fair coin (Q) if we assumed they came from the biased coin (P).
• Notice that DKL (P ||Q) = ̸ DKL (Q||P ), confirming the non-symmetric property. The values are
close in this simple case because the distributions aren’t dramatically different, but the principle
holds.

8.4.2 Example 2: Distributions Over Three Outcomes

Consider a discrete variable X that can take values A, B, C.
• True Distribution P:
– P (A) = 0.2
– P (B) = 0.5
– P (C) = 0.3
• Approximating Distribution Q:
– Q(A) = 0.4
– Q(B) = 0.4
– Q(C) = 0.2
Calculate DKL (P ||Q):

P (A) P (B) P (C)
DKL (P ||Q) = P (A) ln + P (B) ln + P (C) ln
Q(A) Q(B) Q(C)

0.2 0.5 0.3
= 0.2 ln + 0.5 ln + 0.3 ln
0.4 0.4 0.2
= 0.2 ln(0.5) + 0.5 ln(1.25) + 0.3 ln(1.5)
≈ 0.2(−0.6931) + 0.5(0.2231) + 0.3(0.4055)
≈ −0.13862 + 0.11155 + 0.12165
= 0.09458 nats

What this tells us: This positive value indicates that there is some information loss when using Q to
approximate P. Specifically, if we were to encode outcomes of P using a code optimized for Q, on average,
we would use about 0.09458 more nats than if we used a code optimized for P. The higher value compared
to the coin example reflects a greater difference between these two distributions. KL divergence provides
a single numerical value to quantify this discrepancy.

KPK Board Conceptual Questions 9th Class
100% (4)
KPK Board Conceptual Questions 9th Class
18 pages
Chapter3-Probability Distribution
100% (1)
Chapter3-Probability Distribution
35 pages
Probability Review
No ratings yet
Probability Review
12 pages
Probability FoundationalMathofAI S24
No ratings yet
Probability FoundationalMathofAI S24
7 pages
Lecture 3 - Probability - BMSLec02
No ratings yet
Lecture 3 - Probability - BMSLec02
16 pages
BMA2102 Probability and Statistics II Lecture 1
No ratings yet
BMA2102 Probability and Statistics II Lecture 1
15 pages
Orientation - Basic Mathematics and Statistics - Probability
No ratings yet
Orientation - Basic Mathematics and Statistics - Probability
48 pages
Lecture03 Probability Review
No ratings yet
Lecture03 Probability Review
48 pages
Lecture 9 - Probability COMP7180
No ratings yet
Lecture 9 - Probability COMP7180
58 pages
3 Prob-Review
No ratings yet
3 Prob-Review
77 pages
Lecture 1
No ratings yet
Lecture 1
81 pages
0.1. Probability Review
No ratings yet
0.1. Probability Review
6 pages
0 Deep Learning Fundamentals of Probability Theory
No ratings yet
0 Deep Learning Fundamentals of Probability Theory
31 pages
Stat 350 Study Guide
No ratings yet
Stat 350 Study Guide
37 pages
CHAPTER TWO (2) S
No ratings yet
CHAPTER TWO (2) S
69 pages
Discrete Probability Distributions Guide
No ratings yet
Discrete Probability Distributions Guide
23 pages
Basic Probability Review
No ratings yet
Basic Probability Review
77 pages
ML Unit2-1
No ratings yet
ML Unit2-1
11 pages
CQF Jan Maths Primer 2013 Probability Blank
No ratings yet
CQF Jan Maths Primer 2013 Probability Blank
84 pages
Probability Distribution Function
100% (1)
Probability Distribution Function
23 pages
Random Variables
No ratings yet
Random Variables
14 pages
Probability & Statistics Basics
No ratings yet
Probability & Statistics Basics
12 pages
Random Variables
No ratings yet
Random Variables
9 pages
RM2
No ratings yet
RM2
102 pages
MEFall2023 7
No ratings yet
MEFall2023 7
46 pages
Statistical Techniques-II - Complete Notes With Solved Examples
No ratings yet
Statistical Techniques-II - Complete Notes With Solved Examples
11 pages
Lecture 2 ML - Maths
No ratings yet
Lecture 2 ML - Maths
80 pages
Prepared By: Mohammad Saifuddin: Discrete or Continuous
No ratings yet
Prepared By: Mohammad Saifuddin: Discrete or Continuous
7 pages
Module 1
No ratings yet
Module 1
12 pages
Chap2 Discrete Distributions
No ratings yet
Chap2 Discrete Distributions
22 pages
Random Variables
No ratings yet
Random Variables
26 pages
Discrete Probability
No ratings yet
Discrete Probability
41 pages
PRP - Unit 2
No ratings yet
PRP - Unit 2
41 pages
Intro To Probability (Pattern Recognition)
No ratings yet
Intro To Probability (Pattern Recognition)
94 pages
Discrete Random Variables Class 4, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
No ratings yet
Discrete Random Variables Class 4, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
13 pages
PROBABILITY
No ratings yet
PROBABILITY
127 pages
2 Discrete Random Variables: 2.1 Probability Mass Function
No ratings yet
2 Discrete Random Variables: 2.1 Probability Mass Function
12 pages
Prob Distribution 1
No ratings yet
Prob Distribution 1
86 pages
SI Chapter-1
No ratings yet
SI Chapter-1
30 pages
Beamer GEC 4 Probability 025041
No ratings yet
Beamer GEC 4 Probability 025041
26 pages
PTSP
No ratings yet
PTSP
101 pages
Module II Probability Distribution
No ratings yet
Module II Probability Distribution
76 pages
Probability Basics
No ratings yet
Probability Basics
19 pages
Some Common Probability Distributions
No ratings yet
Some Common Probability Distributions
92 pages
S-11 - Random Variables and Discrete Probability Distributions
No ratings yet
S-11 - Random Variables and Discrete Probability Distributions
24 pages
PTSP
No ratings yet
PTSP
74 pages
Statistics and Probability Katabasis
No ratings yet
Statistics and Probability Katabasis
7 pages
3 Probability Mass Function
No ratings yet
3 Probability Mass Function
51 pages
1.017/1.010 Class 7 Random Variables and Probability Distributions
No ratings yet
1.017/1.010 Class 7 Random Variables and Probability Distributions
3 pages
Econ-2042 - Unit 2-HO
No ratings yet
Econ-2042 - Unit 2-HO
12 pages
6 Continuous Variables
No ratings yet
6 Continuous Variables
8 pages
PCS Unit 3
No ratings yet
PCS Unit 3
16 pages
Reliability Engineering Unit I
No ratings yet
Reliability Engineering Unit I
93 pages
Best Ones
No ratings yet
Best Ones
32 pages
02 Random Vars All Handout
No ratings yet
02 Random Vars All Handout
23 pages
Introduction To Discrete Probability Theory and Bayesian Networks
No ratings yet
Introduction To Discrete Probability Theory and Bayesian Networks
26 pages
Chapter 3
No ratings yet
Chapter 3
26 pages
Probability Review
No ratings yet
Probability Review
5 pages
MFCS Notes
No ratings yet
MFCS Notes
88 pages
Be I Sem Mech 2 Chemistry Unit 4 Course Material - DR PMK
No ratings yet
Be I Sem Mech 2 Chemistry Unit 4 Course Material - DR PMK
50 pages
Unit-i-Atomic and Molecular Structure Goutham
No ratings yet
Unit-i-Atomic and Molecular Structure Goutham
38 pages
MOT2
No ratings yet
MOT2
12 pages
R22 Gen AI Course Pack
No ratings yet
R22 Gen AI Course Pack
7 pages
FVSBN
No ratings yet
FVSBN
4 pages
Exercises 1.3: Vocabulary and Core Concept Check
No ratings yet
Exercises 1.3: Vocabulary and Core Concept Check
25 pages
A Review of AI-Driven Automation Technologies Late
No ratings yet
A Review of AI-Driven Automation Technologies Late
58 pages
EEA UNIT 4 - Accounting
No ratings yet
EEA UNIT 4 - Accounting
57 pages
Ktefhs 2022 Q
No ratings yet
Ktefhs 2022 Q
3 pages
Nikola Tesla
No ratings yet
Nikola Tesla
7 pages
A5000 Release Film: Violet, Red, White or Clear High Elongation FEP Fluorocarbon Release Films, Plain and Perforated
No ratings yet
A5000 Release Film: Violet, Red, White or Clear High Elongation FEP Fluorocarbon Release Films, Plain and Perforated
1 page
A Comparative Study of MHD Flow Analysis in A Porous Medium by Using Differential Transformation Method and Variational Iteration Method
No ratings yet
A Comparative Study of MHD Flow Analysis in A Porous Medium by Using Differential Transformation Method and Variational Iteration Method
15 pages
Cherenkov Radiation
No ratings yet
Cherenkov Radiation
7 pages
1992 Ebeling and Morrison PDF
100% (1)
1992 Ebeling and Morrison PDF
327 pages
Mangushev R. Rybnov E. Lashkova E. Osokin A, Examples of The Construction of Deep Excavation Ditches in Weak Soils
No ratings yet
Mangushev R. Rybnov E. Lashkova E. Osokin A, Examples of The Construction of Deep Excavation Ditches in Weak Soils
9 pages
Problem Set Forest Surveying
No ratings yet
Problem Set Forest Surveying
2 pages
Lecture 13
No ratings yet
Lecture 13
22 pages
Tech Tip: Measuring Superheat and Subcool
No ratings yet
Tech Tip: Measuring Superheat and Subcool
1 page
FRACTALI AEE (Air, Earth, Earth) She Is Born From The Air of Consciousness and Duplicates Herself Within The Largest To The Smallest Form
No ratings yet
FRACTALI AEE (Air, Earth, Earth) She Is Born From The Air of Consciousness and Duplicates Herself Within The Largest To The Smallest Form
4 pages
Sheet #4 - Stiffened Plates
No ratings yet
Sheet #4 - Stiffened Plates
3 pages
Centrifugal Compressor Simulations
No ratings yet
Centrifugal Compressor Simulations
6 pages
The Dimension of Strain Is? A) LT B) N/M C) N D) Dimensionless
No ratings yet
The Dimension of Strain Is? A) LT B) N/M C) N D) Dimensionless
608 pages
3.4. Glasses
No ratings yet
3.4. Glasses
50 pages
Boiler Safety Valve Calculations
89% (9)
Boiler Safety Valve Calculations
2 pages
Imits
No ratings yet
Imits
45 pages
DSM Vibration Welding PDF
100% (1)
DSM Vibration Welding PDF
20 pages
Kalnins TwiceYield Fatigue
No ratings yet
Kalnins TwiceYield Fatigue
9 pages
Physics 72.1
No ratings yet
Physics 72.1
3 pages
Wind - Asce-7-16 (21-03-2022)
100% (1)
Wind - Asce-7-16 (21-03-2022)
3 pages
Institute of Engineering & Technology, Devi Ahilya University, Indore, (M.P.), India. (Scheme Effective From July 2015)
No ratings yet
Institute of Engineering & Technology, Devi Ahilya University, Indore, (M.P.), India. (Scheme Effective From July 2015)
2 pages
MTH Y3-6 SLR MathsToolKitYrs3-6
No ratings yet
MTH Y3-6 SLR MathsToolKitYrs3-6
2 pages
Seismic Design Methodology Document For Precast Concrete Diaphragms PDF
No ratings yet
Seismic Design Methodology Document For Precast Concrete Diaphragms PDF
545 pages
Robotic Arm Project Defense
No ratings yet
Robotic Arm Project Defense
35 pages
Rolling Contact Bearings
No ratings yet
Rolling Contact Bearings
30 pages
Dowel PRODUCTS-CATALOG
No ratings yet
Dowel PRODUCTS-CATALOG
24 pages
Physical Chemistry - Lecture Planner
No ratings yet
Physical Chemistry - Lecture Planner
2 pages
Module 04 - Axial Compressors
100% (1)
Module 04 - Axial Compressors
50 pages

Introduction To Probability

Uploaded by

Introduction To Probability

Uploaded by

The Foundations of Probability: A Comprehensive Guide

• 0 (Impossible): An event with a probability of 0 will never happen.

• 1 (Certain): An event with a probability of 1 is guaranteed to happen.

1.1 Core Concepts and the Probability Space

1.1.1 Sample Space (Ω)

• Example 1: Tossing a single coin

• Example 2: Rolling a single standard six-sided die

• Example 1: Rolling an even number on a die

• Example 2: Tossing a coin twice and getting exactly one Head

1.2 Example: Basic Probability Calculation

• Sample space: Ω = {1, 2, 3, 4, 5, 6}

2.1 Discrete Random Variables

• Example 1: Toss two coins

2.1.1 Probability Mass Function (PMF)

• Example: Toss two coins (continued):

2.2 Continuous Random Variables

• Example: A Simple Continuous PDF

– Verify if it’s a valid PDF:

2.2.2 Cumulative Distribution Function (CDF)

– 0 ≤ F (x) ≤ 1 for all x.

3.1 Expectation (Mean) - E[X]

• For Discrete Random Variables: E[X] = x · P (x)

3.2 Variance (Var(X)) and Standard Deviation (σ)

• Example: Variance of a Die Roll

• σ 2 is the variance, which determines the spread or width of the distribution.

• Parameters: It is completely defined by its mean (µ) and variance (σ 2 ).

4.2 Multivariate Gaussian Distribution

• x is a k × 1 column vector representing a specific realization of the random variables.

• Σ (Sigma) is the k × k covariance matrix.

• Σ−1 is the inverse of the covariance matrix.

5.1 Joint Probability

• For discrete variables X, Y : P (X = x, Y = y) is the probability that X takes value x AND Y

– Find P (0.2 ≤ X ≤ 0.5, 0.3 ≤ Y ≤ 0.7):

5.2 Marginal Probability

• For discrete variables: P (X = x) = y P (X = x, Y = y)

• Example (Continuous): Using the Uniform Joint PDF (continued)

5.3 Conditional Probability

• Mathematically, for random variables X, Y :

6.1 Components of a Bayesian Network:

6.2 Key Property: Conditional Independence

6.3 Inference in Bayesian Networks

• Example Inference Types:

6.4 Example: ”Wet Grass” Network

• Sprinkler → Wet Grass

• Conditional Probabilities for Wet Grass given Rain and Sprinkler:

7.1 Example: Medical Diagnosis

P (T +) = P (T + |D)P (D) + P (T + |not D)P (not D)

(Evidence: the general chance of a positive test result in the population)

8.2 What KL Divergence Measures: The Information Theory Perspective

8.3 Properties of KL Divergence:

Now, let’s calculate DKL (Q||P ) to demonstrate asymmetry:

What this tells us:

8.4.2 Example 2: Distributions Over Three Outcomes

You might also like