Introduction To Probability
Introduction To Probability
1 Introduction to Probability
Probability is a numerical measure that quantifies the likelihood of an event occurring. It’s expressed
as a number between 0 and 1, inclusive:
1.1.2 Event
An event is any subset of the sample space (Ω). It’s a specific outcome or a collection of outcomes
that we are interested in, belonging to the event space (F).
2
1.1.3 Probability Axioms
These are fundamental rules that all probabilities must follow:
1. Range: The probability of any event A, denoted P (A), must be between 0 and 1. 0 ≤ P (A) ≤ 1
2. Total Probability: The probability of the entire sample space Ω (meaning something in the
sample space will happen) is 1. P (Ω) = 1
3. Additivity (for Disjoint Events): If two events, A and B, are disjoint (also known as mutually
exclusive, meaning they cannot happen at the same time and have no common outcomes, i.e.,
A ∩ B = ∅), then the probability of A or B happening is the sum of their individual probabilities.
P (A ∪ B) = P (A) + P (B)
• Example: When rolling a die, let A be ”rolling an odd number” (A = {1, 3, 5}) and B be
”rolling a 2” (B = {2}). A and B are disjoint. P (A) = 36 , P (B) = 16 . The event ”rolling
an odd number or a 2” is {1, 2, 3, 5}. Its probability is 46 . Using the axiom: P (A ∪ B) =
P (A) + P (B) = 36 + 16 = 46 . The results match.
3
2 Random Variables
A random variable (RV) is a function that assigns a numerical value to each outcome in the
sample space of a random experiment. Random variables are typically denoted by capital letters like X,
Y, Z.
• Properties of a PMF:
– 0 ≤ P (X = x) ≤ 1 for all values of x.
P
– x P (X = x) = 1 (the sum of all probabilities for all possible values must equal 1).
• Examples:
– Temperature (28.63◦ C)
– Height (175.4 cm)
– Time (e.g., time it takes for a light bulb to burn out)
4
2.2.1 Probability Density Function (PDF)
For continuous random variables, we use a Probability Density Function (PDF), denoted f (x).
Important distinction: For a continuous RV, the probability of it taking on any single exact value
is zero: P (X = x) = 0. This is because there are infinitely many possible values, so the chance of hitting
one specific value is infinitesimally small.
Instead, probabilities for continuous variables are found by calculating the area under the curve
of the PDF over a given interval.
Rb
P (a ≤ X ≤ b) = a f (x) dx
• Properties of a PDF:
– f (x) ≥ 0 for all x (the probability density cannot be negative).
R∞
– −∞ f (x) dx = 1 (the total area under the entire PDF curve must equal 1).
• Properties of a CDF:
• Example: CDF for Discrete Variable (Toss two coins, X=number of heads)
PMF: P (0) = 1/4, P (1) = 1/2, P (2) = 1/4.
– F (x) = 0 for x < 0
– F (0) = P (X ≤ 0) = P (X = 0) = 1/4
– F (1) = P (X ≤ 1) = P (X = 0) + P (X = 1) = 1/4 + 1/2 = 3/4
– F (2) = P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2) = 1/4 + 1/2 + 1/4 = 1
– F (x) = 1 for x ≥ 2
5
• Example: CDF for Continuous
Rx Variable (from PDF f (x) = 2x for 0 ≤ x ≤ 1)
For 0 ≤ x ≤ 1: F (x) = 0 2t dt = [t2 ]x0 = x2 So, F (x) = x2 for 0 ≤ x ≤ 1. F (x) = 0 for x < 0
F (x) = 1 for x > 1
– Using CDF to find P (0.2 ≤ X ≤ 0.5): P (0.2 ≤ X ≤ 0.5) = F (0.5)−F (0.2) = (0.5)2 −(0.2)2 =
0.25 − 0.04 = 0.21, matching the previous result.
6
3 Measures of Distribution
These are measures that describe the central tendency and spread of a random variable.
• What it tells us: The expectation represents the ”long-run average” value if you were to repeat
the random experiment infinitely many times. It’s the ”central tendency” or ”balancing point”
of the distribution. For a fair die, an expectation of 3.5 means that, over many rolls, the average
outcome will tend towards 3.5, even though 3.5 is not an actual outcome. It’s the best single value
to summarize the distribution’s typical outcome.
What this tells us: If you were to repeatedly draw samples from this distribution, the average of
those samples would tend towards 2/3.
7
First, calculate E[X 2 ]:
2 2 1 2 1 2 1 2 1 2 1 2 1
E[X ] = 1 · + 2 · + 3 · + 4 · + 5 · + 6 ·
6 6 6 6 6 6
1 91
E[X 2 ] = (1 + 4 + 9 + 16 + 25 + 36) = ≈ 15.1667
6 6
Now, calculate Variance: Var(X) = E[X 2 ] − (E[X])2 Var(X) = 91 2 35
6 − (3.5) = 12 ≈ 2.9167
q
Standard Deviation: σ = 35 12 ≈ 1.7078 What this tells us: On average, the outcomes of a fair die
roll deviate from the mean (3.5) by about 1.71 units.
8
4 Key Probability Distributions
4.1 Univariate (Single) Gaussian (Normal) Distribution
The Univariate Gaussian distribution, also known as the Normal distribution, is the most im-
portant and widely used continuous probability distribution for a single random variable. It’s often
described as the ”bell curve” due to its characteristic shape.
4.1.1 Definition:
(x−µ)2
The PDF of a normal distribution for a single variable X is given by: f (x; µ, σ 2 ) = √ 1
2πσ 2
e− 2σ 2
Where:
• µ (mu) is the mean of the distribution, which determines its central location and the peak of the
bell curve.
4.1.2 Properties:
• Bell-shaped and Symmetric: The curve is symmetrical around its mean µ.
9
– What the covariance matrix tells us:
∗ Diagonal elements: Indicate the spread of each individual variable along its own axis.
∗ Off-diagonal elements (Covariance): Quantify how two variables vary together.
· Positive Covariance: Implies that if one variable increases, the other tends to
increase as well (e.g., height and weight). The elliptical contours of the distribution
would be stretched along the direction where both variables increase.
· Negative Covariance: Implies that if one variable increases, the other tends to
decrease (e.g., hours spent studying and hours spent playing video games, if they are
inversely related). The elliptical contours would be stretched along the anti-diagonal.
· Zero Covariance: Implies no linear relationship between the two variables. For
Gaussian distributions, zero covariance implies independence. The elliptical contours
would be aligned with the coordinate axes.
∗ Overall, the covariance matrix Σ defines the shape and orientation of the multi-dimensional
”bell” (an ellipsoid), showing how the variables are correlated and spread out together in
the multi-dimensional space.
• |Σ| is the determinant of the covariance matrix.
10
5 Multiple Random Variables and Their Relationships
These concepts explain how different random variables interact and how their probabilities are related.
What this tells us: There is a 12% chance that X will be between 0.2 and 0.5 and Y will be
between 0.3 and 0.7 at the same time. This shows the joint likelihood of two events occurring
in specific continuous ranges.
R∞
• For continuous variables: f (x) = −∞ f (x, y) dy
• What it tells us: The marginal probability (or marginal PDF/PMF) gives the individual prob-
ability distribution for one variable, effectively ignoring or ”averaging out” the influence of other
variables it might be jointly distributed with. It allows us to analyze each variable’s behavior
independently of the others.
11
• Example (Discrete): Weather and Umbrella (continued)
If we also knew P (X=Sunny, Y=Yes) = 0.2 and P (X=Sunny, Y=No) = 0.2. P (X=Rainy) =
P (X=Rainy, Y=Yes) + P (X=Rainy, Y=No) = 0.5 + 0.1 = 0.6 What this tells us: There is an
overall 60% chance of it being rainy, regardless of whether someone carries an umbrella or not.
• What it tells us: The conditional probability (or conditional PDF/PMF) allows us to update our
beliefs about one variable’s likelihood after observing the value of another. It reveals dependencies
between variables: if knowing one variable’s value changes the probability of another, they are
dependent.
• Example (Discrete): Weather and Umbrella (continued)
Find the probability that it’s Rainy, given that someone has an Umbrella (Yes): P (X=Rainy | Y=Yes) =
P (X=Rainy, Y=Yes)
P (Y=Yes) Using our previous examples, P (Y=Yes) = P (Rainy, Yes) + P (Sunny, Yes) =
0.5 + 0.2 = 0.7. P (X=Rainy | Y=Yes) = 0.5 0.7 ≈ 0.714 What this tells us: If you know someone has
an umbrella, there’s about a 71.4% chance it’s rainy. This is higher than the overall 60% chance
of rain, indicating a positive association between rain and carrying an umbrella.
• Example (Continuous): Using the Uniform Joint PDF (continued)
Using f (x, y) = 1 for 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, and we found f (y) = 1 for 0 ≤ y ≤ 1.
f (x,y) 1
– Find the conditional PDF for X given Y=y, f (x|y): f (x|y) = f (y) = 1 = 1, for
0 ≤ x ≤ 1 (for any specific y ∈ [0, 1]). f (x|y) = 0, otherwise.
What this tells us: In this specific uniform example, knowing the value of Y (as long as it’s within
its range) does not change the distribution of X. X is still uniformly distributed between 0 and 1.
This is a characteristic of independent variables. If X and Y were dependent, f (x|y) would be a
function of y, meaning Y ’s value would indeed influence X’s distribution.
5.4 Independence
Two events A and B (or two random variables X and Y) are considered independent if the occurrence
of one does not affect the probability of the other.
12
• What it tells us: Independence signifies a lack of predictive power between variables. Knowing
the value of one variable gives you no additional information about the likelihood of the other.
• Example: Tossing 2 coins.
Let X be the outcome of the first toss (Heads/Tails) and Y be the outcome of the second toss.
The outcome of the first toss is independent of the outcome of the second toss. If P (HH) = 0.25,
P (HT) = 0.25, etc., and P (X = H) = 0.5, P (Y = H) = 0.5. Since P (X = H, Y = H) = 0.25 and
P (X = H) · P (Y = H) = 0.5 · 0.5 = 0.25, they are independent. What this tells us: Knowing the
result of the first coin toss gives you no information that changes your prediction for the second
coin toss. The two events do not influence each other’s probabilities.
13
6 Bayesian Networks: Modeling Conditional Dependencies
Bayesian Networks (BNs), also known as Bayes nets or belief networks, are powerful probabilistic
graphical models that represent a set of random variables and their conditional dependencies via
a Directed Acyclic Graph (DAG). They provide a structured way to represent and reason about
uncertainty in complex systems.
2. Directed Edges: Arrows connect nodes to indicate a direct probabilistic influence or causal
relationship. An arrow from A to B means A is a ”parent” of B, and B is a ”child” of A, implying
that B’s probability distribution directly depends on A.
3. No Cycles (Acyclic): The graph must not contain any directed cycles (you cannot start at a
node and follow arrows to return to the same node). This ensures that probabilities are well-defined
and that cause-and-effect relationships don’t loop back on themselves.
4. Conditional Probability Distributions (CPDs) / Tables (CPTs): Each node has an asso-
ciated CPD (or CPT for discrete variables) that quantifies the influence of its parents.
• For a node with no parents (a ”root” node), its CPD is simply its prior probability distribution.
• For a node with parents, its CPD defines the probability of that node’s value given the values
of its parents.
14
• Wet Grass (W): True/False
Structure (DAG):
• Rain → Wet Grass
15
7 Bayes’ Theorem
Bayes’ Theorem describes the probability of an event, based on prior knowledge of conditions that might
be related to the event. It’s a cornerstone of probabilistic inference, allowing us to update our beliefs
based on new evidence. It is frequently applied within Bayesian Networks for various inference tasks,
particularly diagnostic inference.
The formula for Bayes’ Theorem is:
P (A|B) = P (B|A)·P
P (B)
(A)
Where:
• P (A|B) is the posterior probability: the probability of event A occurring given that event B
has occurred. This is often what we want to find – our updated belief about A after observing B.
• P (B|A) is the likelihood: the probability of event B occurring given that event A has occurred.
This tells us how likely the observed evidence B is, assuming our hypothesis A is true.
• P (A) is the prior probability: the initial probability of event A occurring before considering any
information about B. This is our initial belief or knowledge about A.
• P (B) is the evidence or marginal probability of B: the total probability of event B occurring,
considering all possible scenarios. This acts as a normalizing constant. It can be calculated using
the law of total probability: P (B) = P (B|A)P (A) + P (B|not A)P (not A) (for two complementary
events A and not A).
What it tells us: Bayes’ Theorem provides a formal way to reverse conditional probabilities and update
our confidence in a hypothesis (A) when new evidence (B) becomes available. It shows how our initial
belief (P (A)) is weighted by how well the evidence supports the hypothesis (P (B|A)) relative to how
likely the evidence is overall (P (B)). It’s fundamental for inference and learning from data.
16
Now, apply Bayes’ Theorem:
P (T + |D) · P (D)
P (D|T +) =
P (T +)
0.00099
P (D|T +) = ≈ 0.0194
0.05094
What this tells us: Even with a positive test, the probability of actually having the disease is only about
1.94%! This illustrates the power of Bayes’ Theorem: it shows how the very low prior probability of the
disease (P (D) = 0.001) significantly tempers the impact of the positive test result, given the relatively
high false positive rate (5%) in the larger healthy population. This kind of analysis is crucial in fields
like medicine and other areas requiring probabilistic reasoning.
17
8 KL Divergence (Kullback-Leibler Divergence)
KL Divergence, also known as relative entropy, is a non-symmetric measure of the difference be-
tween two probability distributions. It quantifies how much information is lost when a probability
distribution Q is used to approximate another probability distribution P . In simpler terms, KL Diver-
gence tells us how ”different” or ”distant” one probability distribution Q is from a reference
probability distribution P .
It is commonly used in various fields, including statistics, information theory, and machine learning,
to measure how similar one probability distribution is to another, often when optimizing models to learn
target distributions.
8.1 Formula:
For discrete probability distributions
P (x) and Q(x) (where P (x) and Q(x) are their PMFs):
P P (x)
DKL (P ||Q) = x P (x) log Q(x)
For continuous probability
distributions P (x) and Q(x) (where p(x) and q(x) are their PDFs):
R∞
DKL (P ||Q) = −∞ p(x) log p(x)
q(x) dx
Note: The logarithm can be in any base. Using base 2 gives units in ”bits”, while using the natural
logarithm (base e) gives units in ”nats”. The choice of base only scales the result, not its fundamental
meaning.
Interpretation: DKL (P ||Q) represents the extra bits (or nats) required to encode samples from the
true distribution P when using an encoding scheme optimized for the approximating distribution Q,
compared to using an optimal encoding scheme for P itself. A higher DKL (P ||Q) means that Q is a
poorer approximation of P , and using Q’s encoding would be more inefficient, leading to more ”surprise”
or information loss when observing actual events from P .
18
• Asymmetry Explained:
– If P (x) > 0 for some x where Q(x) = 0, then DKL (P ||Q) will be infinite. This is
because if the approximating distribution Q assigns zero probability to an event that the
true distribution P considers possible, the ”cost” of encoding that event using Q’s scheme
becomes infinitely large. This means Q must cover all possibilities that P covers.
– Conversely, if Q(x) > 0 for some x where P (x) = 0, that specific term in the sum/integral
contributes 0 to DKL (P ||Q). In this case, Q might assign probability to events that P
never produces; this is penalized less severely in DKL (P ||Q) than P assigning probability
to events that Q misses.
• This asymmetry is important in machine learning. For instance, in Maximum Likelihood
Estimation (MLE), we often minimize DKL (Pdata ||Pmodel ), which means the model tries to
cover all data modes. In Variational Inference (VI), we often minimize DKL (Qapprox ||Ptrue ),
which implies that the approximation Q should not assign probability mass where the true
posterior P has none (it will tend to be narrower than P ).
8.4 Examples:
8.4.1 Example 1: Comparing Two Coins
Let’s consider two binary distributions (coin tosses):
• Distribution P (True Coin): A biased coin that lands Heads (H) with probability 0.7 and Tails
(T) with probability 0.3.
– P (H) = 0.7
– P (T ) = 0.3
• Distribution Q (Approximation Coin): A fair coin.
– Q(H) = 0.5
– Q(T ) = 0.5
Let’s calculate DKL (P ||Q) using natural logarithms.
P (H) P (T )
DKL (P ||Q) = P (H) ln + P (T ) ln
Q(H) Q(T )
0.7 0.3
= 0.7 ln + 0.3 ln
0.5 0.5
= 0.7 ln(1.4) + 0.3 ln(0.6)
≈ 0.7(0.33647) + 0.3(−0.51083)
≈ 0.23553 − 0.15325 = 0.08228 nats
Q(H) Q(T )
DKL (Q||P ) = Q(H) ln + Q(T ) ln
P (H) P (T )
0.5 0.5
= 0.5 ln + 0.5 ln
0.7 0.3
= 0.5 ln(0.71428) + 0.5 ln(1.66667)
≈ 0.5(−0.33647) + 0.5(0.51083)
≈ −0.168235 + 0.255415 = 0.08718 nats
19
• DKL (Q||P ) ≈ 0.08718 nats. This is the extra information we’d expect to need to represent out-
comes from the fair coin (Q) if we assumed they came from the biased coin (P).
• Notice that DKL (P ||Q) = ̸ DKL (Q||P ), confirming the non-symmetric property. The values are
close in this simple case because the distributions aren’t dramatically different, but the principle
holds.
P (A) P (B) P (C)
DKL (P ||Q) = P (A) ln + P (B) ln + P (C) ln
Q(A) Q(B) Q(C)
0.2 0.5 0.3
= 0.2 ln + 0.5 ln + 0.3 ln
0.4 0.4 0.2
= 0.2 ln(0.5) + 0.5 ln(1.25) + 0.3 ln(1.5)
≈ 0.2(−0.6931) + 0.5(0.2231) + 0.3(0.4055)
≈ −0.13862 + 0.11155 + 0.12165
= 0.09458 nats
What this tells us: This positive value indicates that there is some information loss when using Q to
approximate P. Specifically, if we were to encode outcomes of P using a code optimized for Q, on average,
we would use about 0.09458 more nats than if we used a code optimized for P. The higher value compared
to the coin example reflects a greater difference between these two distributions. KL divergence provides
a single numerical value to quantify this discrepancy.
20