Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views47 pages

Intro To Data Science

The document provides an introduction to key concepts in data science, focusing on statistical inequalities such as Markov and Chebyshev's inequalities, moment generating functions, and the Chernoff bound for binomial distributions. It also covers clustering techniques, including K-means and hierarchical clustering, along with distance functions and objective functions used in clustering algorithms. Additionally, it discusses methods for improving clustering efficiency and handling outliers.

Uploaded by

Pranav Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views47 pages

Intro To Data Science

The document provides an introduction to key concepts in data science, focusing on statistical inequalities such as Markov and Chebyshev's inequalities, moment generating functions, and the Chernoff bound for binomial distributions. It also covers clustering techniques, including K-means and hierarchical clustering, along with distance functions and objective functions used in clustering algorithms. Additionally, it discusses methods for improving clustering efficiency and handling outliers.

Uploaded by

Pranav Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Intro to Data Science

Created: January 6, 2025 2:11 PM


Data Representations — CS328-2022 Notes
Slides - Google Drive

Markov Inequality
For a non-negative random variable X

P (X ≥ a) ≤ E[X]/a ∀a > 0
• Proof


X a
X X X
E[X] = P (x)x = P (x)x+ P (x)x ≥ 0+ P (x)a = aP (X ≥ a) =⇒ P (X ≥ a) ≤ E[X]/a
0 a a

Chebyshev’s Inequality
P (|x − µ| > a) ≤ σ 2 /a2
• Proof

X X X X
σ2 = P (x)(x−µ)2 = P (x)(x−µ)2 + P (x)(x−µ)2 ≥ P (x)a2 = a2 P (|x−µ| ≥ a
x x∈(µ−a,µ+a) |x−µ|≥a |x−µ|≥a

Moment generating function


For a random variable X , define

X X E[X n ]
MX (t) = E[etX ] = E[ tn X n /n!] = tn
n n
n!

Then, it’s easy to see that

dn
E[X n ] = MX | 0
dtn

Thus, MX generates the moments E[X n ] .


From this, we can also define the characteristic function as

ϕX (ω) = E[eiωX ] = MX (iω)

1
Essentially, the moment generating function is a Laplace transform of the
distribution function P , i.e.

MX (t) = F[P](−⊔)

Chernoff Bound for binomial distribution


Suppose Xi ∼ Bernouli(p) are i.i.d.r.v . Then, obviously

E[etXi ] = pet + (1 − p)µi = E[Xi ] = p

Define

X
S= Xi
i

Then, we have

Y XY YX Y
µ = E[S] = npE[etS ] = E[ etXi ] = etXi P (Xi ) = etXi P (Xi ) = E[etXi ] = (pet +1−p)n
i i i Xi i

Then, using Markov’s inequality

P (etS > ea ) ≤ E[etS ]/ea = (pet + 1 − p)n /ea

But etS > ea ⇐⇒ tS > a . Thus, we have

P (tS > a) ≤ (1 + p(et − 1))n /ea

Also note that 1 + y ≤ ey for all y ≥ 0 .


Thus, we can write

t
P (tS > a) ≤ enp(e −1)−a

Now, we can set a := np(1 + d)t , giving us

t t
P (S > np(1 + d)) ≤ enp(e −1)−np(1+d)t
= enp(e −1−t−td)

Thus, what we have is that for all t ∈ R , we have

P (S/µ − 1 > d) ≤ exp(µ(et − 1 − t − td))

2
So, let’s make the bound as tight as we can. How ? Take a derivative, giving us
et − 1 − d = 0 =⇒ t = ln(1 + d)
Thus, we have

ed
P (S/µ − 1 > d) ≤ exp(µ(d − (1 + d) ln(1 + d))) = ( )µ
(1 + d)(1+d)

But, you can use the fact that ln(1 + d) ≥ d/(1 + d/2) , and thus,

d2
P (S/µ − 1 > d) ≤ exp(−µ )
2+d

So, finally, we have this inequality:

S−µ ed d2
P( > d) ≤ [ ]µ < exp(−µ )
µ (1 + d) (1+d) 2+d

This is the upper tail bound. There is another ..


All in all, if you compile everything, what you get is that for a binomial variable
X with mean µ:

2 2 2
P (X ≥ (1+δ)µ) ≤ e−δ µ/(2+δ)
P (X ≤ (1−δ)µ) ≤ e−δ µ/2
P (|X−µ| ≥ δµ) ≤ 2e−δ µ/3

Chernoff bound
For any random variable X , the r.c. Y = etX is positive. Thus, we can use
Markov’s inequality and say

E[Y ] ≥ eta P [Y ≥ eta ] =⇒ P [etX ≥ eta ] ≤ E[etX ]e−ta ∀t

For t > 0 , the LHS becomes P [X ≥ a] . Similarly, for t < 0 , it becomes


P [X ≤ a]
We can make this inequality as tight as possible by finding the minimum of the
RHS. Thus, we have

P [X ≥ a] ≤ min MX (t)e−ta P [X ≤ a] ≤ min MX (t)e−ta


t>0 t<0

3
Clustering
• Unsupervised learning (no labels)
• Detects patterns in images, outliers in data, etc.
• Can be used to combine results of parts of data processed on different
machine
– Split X into X1 , X2 , . . . .
– Use some complicated unsupervised model on Xi , on machine Mi to
get results Yi .
– Combine Y1 , Y2 . . . using clustering
• Can be used for exploratory data analysis (similar to simple scatter plots)
when you don’t know what you are looking for in the data

Distance function
• Clustering needs a notion of distance between data points . These data
points could be vectors, nodes in graph, etc.
• Lp norm for vectors

X
||x||p = [ xpi ]1/p
i

• L0 norm

||x||0 = |{i|xi ̸= 0}|

• L∞ norm

||x||∞ = max |xi |


i∈[n]

• Norms give us an easy way to define distance


For norm N : D → R , the distance between x, y ∈ D is

d(x, y) = N (x − y)

• Cosine distance

x·y
d(x, y) = 1 −
||x|| ||y||

1 − d(x, y) is called the cosine similarity

4
• Jaccard distance for sets
The Jaccard similarity is given by

|A ∩ B|
J(A, B) =
|A ∪ B|

The distance is

d(A, B) = 1 − J(A, B)

• Edit distance for strings


• Kullback-Leibler (KL) divergence for probability distribution

X
DKL (P ||Q) = P (x) ln(P (x)/Q(x)) = −EP [ln(Q/P )] = H(P, Q)−H(P )
x

Thus,

H(P, Q) = H(P ) + DKL (P ||Q)

It’s easy to see that

Q Q X X
−D(P ||Q) = EP [ln ] ≤ EP [ −1] = P (Q/P −1) = (Q−P ) = 1−1 = 0 =⇒ D(P ||Q) ≥ 0
P P x x

Objective Functions
• radius of cluster
• Inertia
• Intra-cluster distance
For a cluster Ci , the value ∆i is given by one of these definitions depending
on the context
1. Diameter of cluster

∆i = max d(x, y)
x,y∈Ci

5
2. Average distance between points

1 X
∆i = |Ci |
 d(x, y)
2 {x,y}⊆Ci

3. Average distance of points to mean

d(xi , µi )
P P
x∈Ci x x∈Ci
µi = ∆i =
|Ci | |Ci |

• Dunn’s Index
Suppose the inter-cluster distance δ(Ci , Cj ) is also defined for us, depending
on the context.. then Dunn’s index is given by

mini̸=j δ(Ci , Cj )
DI =
maxi ∆i

• Hierarchial clustering doesn’t have an objective function

K-center
• ∆i = maxx∈Ci d(x, ci ) where Ci ⊆ X is the cluster corresponding to center
ci
• Objective function is max ∆i
• Clearly, for minimum, we must have x ∈ Ci if and only if i = argmini d(x, ci )
.
• We want to find the optimum placements of the k centers in that case.
• NP hard
• The partitioning of data points based on nearest cluster center (for k-center,
k-means, hiearchial clustering, whatever) is called Voronoi partitioning
• 2-approximation algorithm
– Initialise c1 randomly
– dx := d(x, c1 )
– for i in 2 . . . k :
∗ ci := maxx dx
∗ dx := min(dx , d(x, ci ))
• Proof of it providing a 2-approximation
First, some notation..
Let Si be the set of clusters, as created by the algorithm, on the i-th
iteration.

6
d(x, Si ) = min(x, c)Di = max d(x, Si )
c∈Si x

Also, define Si∗ as the optimal set of clusters, and Di∗ as the optimal
objective function.
– Case 1 : Exactly one c ∈ Sk per cluster C ∗ with center c∗ ∈ Sk∗ .
Take any x ∈ X .
Let the center it be assigned in the optimal solution be c∗
Let THE cluster that is in the cluster associated with c∗ be c .
So now

d(x, c∗ ) ≤ Dk∗ d(c, c∗ ) ≤ Dk∗

By triangle inequality

d(x, c) ≤ 2Dk∗

But, clearly

d(x, Sk ) ≤ d(x, c)

Thus, for EVERY x , we have

d(x, Sk ) ≤ 2Dk∗

So, we can write

Dk = max d(x, Sk ) ≤ 2D∗


x

– Case 2 : at least two centers cj , ci ∈ Sk are assigned the same optimal


center c∗ ∈ Sk∗
We, WLOG assume that ci is added in the i th iteration and cj was
there before that
Thus, cj ∈ Si−1 . Thus..

d(ci , Si−1 ) ≤ d(ci , cj )

But we know that Di−1 = d(ci , Si−1 ) .

7
Thus,

Di−1 ≤ d(ci , cj )

But then by triangle inequality,

d(ci , cj ) ≤ d(ci , c∗ ) + d(cj , c∗ ) ≤ 2Dk∗

Thus,

Di−1 ≤ 2Dk∗

This suggests that on the i th iteration itself , which is the first time
that we have 2 centers falling in the same optimal cluster, we have
already minimised the objective below twice the optimum.
And obviously Dk ≤ Di−1 . Thus,

Dk ≤ 2Dk∗

K-means
• Problem with k-center
K-center is susceptible to outliers. It WILL make the outliers clusters,
even in optimal clustering. This is something we don’t want.
• Objective function
Given k clusters C1 , C2 ... and their respective centers µ1 , µ2 . . . , we define

X
∆i = ||x − µi ||22
x∈Ci

as the intra-cluster distance and the loss function as :

X
L= ∆i
i∈[k]

• L’Loyd’s algorithm
– Initialise (µi )i∈[k] somehow
– for t in 1..N :
∗ D ∈ Rn×k
∗ D[i, j] = d(xi , µj )
∗ r ∈ Rn

8
∗ c[i] := minj∈[k] D[i, j]
∗ I := (D == c[:, new])
∗ µi := sum(I[:, i : i + 1] ∗ X)/sum(I[:, i : i + 1])
• Initialisation
– choose randomly, with uniform probability (fails badly)
– use k-centers (again..outliers)
– k-means ++ (a combination of above 2, but works)
• K-means ++ initialisation
– choose c1 randomly
– C1 = {c1 }
– for i in 2 .P
. . k:
∗ S := x∈X d(x, Ci−1 )q
∗ px = d(x, Ci−1 )q /S
∗ choose a point ci ∈ X randomly, with probability px for each
point x ∈ X
∗ Ci := Ci−1 ∪ {ci }
Usually, q = 2 is used.
With this intialisation, k-means returns an Θ(ln k) bound.
Moreover, if the data is “nicely clusterable” , an alteration gives us an
O(1) approximation. That is, you just have to initialise, and it’s already
an acceptable clustering.
There are also hybrid methods .. but k-means ++ still wins.
• Parallelizing k-means
As you saw, the algorithm is O(nkN ) .. which isn’t nice if any of these
quantities are big. So, for large data X (which would also need a large k
for correct clustering) , we partition it into X1 , X2 . . . Xp
– Compute µi,j for each Xi and j ∈ [k ′ ] and wi,j = |Ci,j |
– run k-means on {µi,j |i ∈ [p], j ∈ [k ′ ]} with weight wi,j
Note : the weights aren’t all that problematic really. You just have to define
the D matrix as D[i, j] = wi ·d(xi , µj ) , and I := (D == c[:, new])∗w[:, new]
.
This gives us an 4γ(1 + β) + 2β = O(γβ) approximation, where β is the
approximation factor for sub-problems, and γ is one for final clustering.
Now, since k means gives a log(k) approximation, dividing the problem
such that β = γ = O(log k) gives us an O(log2 k) approximation.
• Hierarchial clustering
– We need a distance measure between clusters. Some popular ones are
∗ Smallest distance between nodes

9
∗ Farthest . . .
∗ Average distance between nodes u ∈ C1 and v ∈ C2
– A single point is just a cluster of cardinality 1.
– Start with simpelton clusters
– Join the cluster with minimal distance while #clusters > 1
– Undo the last k − 1 joins to get k clusters finally
– The joinings give us a dendogram or philogram
– If we just use the closest pair (single linkage) distance, we are basically
performing Kruskal’s algorithm for MST, and so we get the MST
– If we use the average distance, we are doing the UPGMA clustering,
and we get a UPGMA tree.
– For farthest distance .. no idea
• Exact algorithm for 1D space
Suppose x1 . . . xn are points on number line, in increasing order. Suppose
the function C(i, k) gives the optimal clustering cost with k clusters, for
all the points from x1 to xi . Then we have

C(i, k) = min(C(j, k − 1) + cost(j, i))


j≤i

Pi Pi
where cost(j, i) = p=j (xp − µj,i )2 and µj,i = 1
i−j+1 p=j xp .
One can easily solve the problem now using Dynamic Programming in
O(n2 ) .

HW 1
Q1

Figure 1: image.png

First, notice that

X X X X
||x−y||2 = ||(x−µi )−(y−µi )||2 = [||x−µi ||2 +||y−µi ||2 +(x−µi )T (y−µi )] = ||x−µi ||2 +
x,v∈Ci x,y∈Ci x,y∈Ci x,y∈Ci x

where

10
1 X
µi = x
|Ci |
x∈Ci

But this means that

X X X
(y − µi ) = y− µi = |Ci |µi − |Ci |µi = ⃗0
y∈Ci y∈Ci y∈Ci

Thus, we can write

X X X X 1 X
||x−y||2 = [|Ci | ||x−µi ||2 ]+ [|Ci | ||y−µi ||2 ]+ (x−µi )T ⃗0 =⇒ ∆i := ||x − y||2 = 2
|Ci |
x,y∈Ci x∈CI y∈Ci x∈Ci x,y∈Ci x

This is exactly twice the cost of a cluster for the k-means algorithm.
Note: here I’m assuming that by x, y ∈ Ci , we mean (x, y) ∈ Ci × Ci , that is,
the order matters. We can also assume unordered pairs, which will give us

X
∆i = ||x − µi ||2
x∈Ci

Practically, both work exactly the same, despite the factor of 2. Using the second
one, cost(C) is the same as the k-means cost, and using the first equation, it is
twice that value.
So, we can just use the k-means++ algorithm to minimise this cost.

Q2

Figure 2: image.png

Suppose, for a given k ∈ N , we cluster optimally without and with the restriction
(that centers are data points) as (in that order) :
1. C = {C1 , C2 . . . Ck } with centers µi := |C1i | x∈Ci x
P

2. C′ = {C1′ , C2′ . . . Ck′ } with centers c′i := argminy∈Ci′ x∈Ci ||x −


P

y||2 = argminy∈Ci′ ||y − µ′i || (using “parallel axis theorem”) where


µ′i := |C1′ | x∈C ′ x .
P
i i

11
The costs are given as

X X X X
∆i = ||x − µi ||2 ∆′c
i = ||x − c′i ||2 cost(C) = ∆i costc (C′ ) = ∆′c
i
x∈Ci x∈Ci′ i i

We want to prove that costc (C′ ) ≤ 4 cost(C) .


First, let’s define

ci = argminx∈CI ||x − µi ||

as the data point closest to the centroid for each cluster Ci , and let’s define

X X
∆ci = ||x − ci ||2 costc (C) = ∆ci
x∈Ci i

It’s clear that costc (C) ≥ costc (C′ ) , since C′ is the optimal clustering with
centers being data points, while C , with centers ci , is a clustering with centers
being data points.
Now, notice that

X X X
∆ci = ||(x−µi )−(ci −µi )||2 = [||x−µi ||2 +||ci −µi ||2 −2(ci −µi )T (x−µi )] = ∆i + ||ci −µi ||2 −0
x∈Ci x∈Ci x∈Ci

But, we know that ||ci − µi || ≤ ||x − µi ||∀x ∈ Ci by the definition of ci . This


gives us

X X
||ci − µi ||2 ≤ ||x − µi ||2 = ∆i
x∈Ci x∈Ci

Thus, we have

∆ci ≤ 2∆i ∀i =⇒ costc (C) ≤ 2 cost(C)

Combining this with costc (C) ≥ costc (C′ ) gives us

costc (C′ ) ≤ 2 cost(C) ≤ 4 cost(C)

which is what we wanted to show. In fact, we have shown a stronger bound.


Now, we also have

12
cost(C) ≤ cost(C′ ) ≤ costc (C)

where

X X 1 X
cost(C′ ) = [ ||x − µ′i ||2 ]µ′i = ′ x
i ′
|Ci |
x∈Ci x∈Ci

This gives us the final inequality as

cost(C) ≤ cost(C′ ) ≤ costc (C′ ) ≤ costc (C) ≤ 2 cost(C)

Thus, the ratio is always in the range [1, 4] .

Q3

Figure 3: image.png

Markov’s inequality states that for a non-negative random variable X , we have

P [X ≥ a] ≤ E[X]/a∀a ≥ 0

This works for both discrete and continuous random variables.


Consider the discrete random variable X with probability mass function being
(
1 x=a
P [X = x] = δx,a =
0 otherwise

It’s easy to see that

P [X ≥ a] = 1E[X] = 1 · a = a =⇒ P [X ≥ a] = E[X]/a

, that is, the inequality is tight.


Meanwile, the Chebyshev inequality states that

Var[X]/b2 ≥ P [|X − E[X]| ≥ b] ∀b > 0

13
Since in this case E[X] = a and Var[X] = 0 , thus the Chebyshev inequality
tells us that

P [|X − a| ≥ b] = 0∀b > 0 ⇐⇒ P [|X − a| > 0] = 0

Moreover, the Chebyshev inequality is tight at well (since both sides are 0).
This suggests that Markov’s inequality being tight is enough, and we can conclude
the distribution being δx,a by just that, and Chebyshev’s inequality will also be
tight automatically.
But remember that Markov’s inequality is claimed to be true only when P [X <
0] = 0 . For example, consider the distribution

1 1
P [X = x] = δx,−1 + δx,1
2 2

For this distribution, we have E[X] = 0 . Now, for a > 1 , we have P [X ≥ a] = 0


, giving us

P [X ≥ a] = E[X]/a

But, this isn’t enough to assume that P [X] = δx,a .


Meanwhile we have Var[X] = 1 and
(
0 b>1
P [|X − 0| ≥ b] =
1 b≤1

This means that for b = 1 , we have

P [|x − E[X]| ≥ b] = Var[X]/b2

, that is, the Chebyshev’s inequality is tight. But this, unlike “Markov’s inequality”
being tight, does give us useful information. Namely, that

P [|X − E[X]| = b] = 1 =⇒ P [E[X] + b] + P [E[X] − b] = 1

Q5
Suppose for a randomly chosen person whom we labelled i , we define the random
variable Xi as 1 if the person is a coffee drinker, and 0 if not. Then, suppose,
{Xi }i , are independent and identically distributed variables with probability
distribution

14
P [Xi = 1|p] = PX|p [1] = pP [Xi = 0|p] = PX|p [0] = 1 − p

and then suppose we randomly ask N people whether or not they drink coffee
to get the observed values for X1 , X2 . . . XN as x1 , x2 . . . xN , we can estimate
the likelihood of p using this data and a prior.
Note : Here Xi is at random variable, because we are choosing a person at
random and then labelling that person i . Thus, Xi is associated with the label
i , and not the person. If it was in fact associated with the person, then it would
have the fixed value, namely xi (the observed value).
We assume a uniform distribution over the interval [0, 1] for the value p , that is
(
1 q ∈ [0, 1]
P [p = q] = Pp (q) =
0 otherwise

Keep in mind that this is a density function, not a probability mass function.
Note : Although, there IS a fixed value p∗ which is the probability of Xi being
1, we don’t know that value. And so, depending of the data, we can only get
different estimates for this value. So, we are dealing with the random variable p
instead.
Using this, we can write

N
Y
P (⃗x|p) := PX1 ,X2 ...XN |p (x1 , x2 . . . xN ) = PX|p (xi ) = pα (1 − p)β
i=1

where α = xi and β = N − α .
P
i

Now, using Baye’s rule, we can write

1
Pp|⃗x (q) = P (⃗x|q)Pp (q)
P (⃗x)

We still don’t know

P (⃗x) = PX1 ,...XN (x1 . . . xN )


R∞
but since it is independent of q , we can find it using −∞
Pp|⃗x (q)dq = 1 . This
gives

Z ∞ Z 1
P (⃗x|q)Pp (q) dq = P (⃗x) =⇒ P (⃗x) = q α (1 − q)β dq
−∞ 0

15
This is solved using a recursive relation. Let

Z 1
C(α, β) = xα (1 − x)β dx
0

Then , using integration by parts

Z 1
β β
C(α, β) = xα+1 (1 − x)β−1 = C(α + 1, β − 1)
α+1 0 α+1

when β ≥ 1 . For the base case, we have C(α, 0) = 1


1+α .
With this, we can write

β(β − 1)(β − 2) . . . 1 β!α! 1


C(α, β) = C(α + β, 0) =
(α + 1)(α + 2) . . . (α + β) (α + β)! α + β + 1

Thus, we finally have our posterior distribution as

Pp|⃗x (q) = (N + 1) ·N Cα · q α (1 − q)β

Thus, the expected value is simply

(N + 1) ·N Cα α+1
E[p] = (N + 1) ·N Cα · C(α + 1, β) = =
(N + 2) ·N +1 Cα+1 N +2

But we should note that this is NOT the maximum likelihood estimate, or even
the most probable value. That is still the intuitive answer α/N since at q = α/N
, we have

d d2
ln Pp|⃗x (q) = αq −1 − β(1 − q)−1 = 0 2 ln Pp|⃗x (q) = −αq −2 − β(1 − q)−2 < 0
dq dq

making it the maxima for ln Pp|⃗x and thus Pp|⃗x .


Now. we can predict the next observation using PXN +1 |⃗x . We can get this as

1
α+1 β+1
Z
PXN +1 |⃗x (1) = PX|q (1)Pp|⃗x (q)dq = E[p] = PX |⃗x (0) =
0 N + 2 N +1 N +2

Since the person labelled N + 1 is chosen at random, we can state this as

16
Given the data ⃗x from asking N randomly chosen people, the proba-
bility that a randomly chosen person is a coffee drinker is N +2 .
α+1

Note that here, we are using the word “is” and not “is expected” , since this
is not an expected value, but something we know for sure, assuming that our
uniform prior distribution is actually correct.
This is in fact a very nice result called Laplace’s rule of succession.
But, we still do want to know what fraction of people are actually coffee drinkers.
This can be answered by thinking of p as a fraction of the total population that
drinks coffee. So now, p is in fact an observable, and can be treated as a random
variable. Thus, we can say
Given the data, (α + 1)/(N + 2) is the expected value of fraction of
people who drink coffee.
Given the data, α/N is the most probable value of fraction of people
who drink coffee
Relying on most probable values is the optimal strategy if the cost of you getting
the answer wrong is a constant (E[(X − g)0 ] where g is your guess for a random
variable X) and guessing the “expected” value is the optimal strategy if your
cost scales quadraticaly with how far you were (E[(X − g)2 ]) .
So, to get the most profitable estimate, I really need to know what we are trying
to do with the estimate, and what is at stake.
But suppose, we just assume E[p|⃗x] as the estimate, then we can compute the
width of the interval for 99% confidence using Chebychev’s inequality. Namely

Var[p]/b2 ≥ P [|p − E[p]| ≥ b]

First, compute the second moment as

1
(N + 1) ·N Cα (α + 1)(α + 2)
Z
E[p2 |⃗x] = (N +1)·N Cα q α+2 (1−q)β dq = (N +1)·N Cα ·C(α+2, β) = =
0 (N + 3) ·N +2 Cα+2 (N + 3)(N + 2)

Now, the variance can be computer as

α+1 α+2 α+1 α+1 (α +


Var[p] = E[p2 ]−E[p]2 = [ − ]= [N α+2N +2α+4−(N α+N +3α+3)] =
N +2 N +3 N +2 (N + 2)2 (N + 3) (N

Since we want 99% confidence that the observed value falls in our interval
(E[p] − b, E[p] + b) , we want P [|p − E[p]| ≥ b] ≤ 1% = 0.01 . This can be
achieved by setting an appropriate b such that Var[p]/b2 ≤ 0.01 .

17
The minimum such value is

(α + 1)(β + 1)
p
bc = 10 √
(N + 2) N + 3

It is easy to see, using the AM-GM inequality that

5
bc ≤ √
N +3

So, we have bc (and Var[p]) being O(N −1/2 ) .


Thus, we can say that p = α+1
N +2 ± θ(N −1/2 ) .
More precisely,
The fraction of people lies in the range (pexp − bc , pexp + bc ) where

α+1 (α + 1)(β + 1)
p
pexp = bc = 10 √
N +2 (N + 2) N + 3

with 99% confidence.


Now, we are yet to incorporate the fact that we know that the possible values of
p are more than or equal to 1% (since 1 million is a very big number, I can treat
the fraction of coffee drinkers obtained from 1 million as the fraction.)
To do this, we can change the prior distribution of p to
(
100/99 p ∈ [0.01, 1]
Pp (q) =
0 otherwise

Thus, our posterior changes as

1 100/99 α
Pp|⃗x (q) = P (⃗x|q)Pp (q) = q (1 − q)β
P (⃗x) P (⃗x)

when q ∈ [0.01, 1] and 0 otherwise.


Once again, we need to normalise this, as

q α (1 − q)β
Pp|⃗x = R 1
0.01
q α (1 − q)β dq

Thus, the expected value, second moment and variance are given as

18
C ′ (α + 1, β) C ′ (α + 2, β) C ′ (α + 2, β)C ′ (α, β) − C ′ (α + 1, β)2
E[p|⃗x] = E[p 2
|⃗
x ] = Var[p|⃗
x ] =
C ′ (α, β) C ′ (α, β) C ′ (α, β)2

where

Z 1
C (α, β) =

q α (1 − q)β dq
0.01

We can once again, use integration by parts to find a recursive relation.

(0.01)α+1 (0.99)β β
C ′ (α, β) = − + C ′ (α + 1, β − 1)
α+1 α+1

with the base case

1 − 0.01α+1
C ′ (α, 0) =
α+1

Thus, given numbers α, β , we can write a program to evaluate C ′ (α, β) even


though there isn’t a closed form.
And then, once again, we can get use the Chebyshev inequality to get a confidence
interval as

Var[p|⃗x]/b2c = 0.01 =⇒ bc = 10 Var[p|⃗x]


p

That is, with 99% confidence the actual value of the fraction lies in (E[p|⃗x] −
bc , E[p|⃗x] + bc ) .

Densest Subgraph
• Density of a subgraph
Suppose S ≤ G is a subgraph of G (I am using group theory notation, but
you get the gist of it), then we define the density as

D(S) = e(S, S)/|S|

where

e(A, B) = |{{a, b} ∈ E|a ∈ A, b ∈ B}|

Note that we are dealing with undirected graphs

19
It’s easy to see that

1 X
D(S) = din (u) = din,avg /2 ≤ (|S| − 1)/2
2S
u∈S

We didn’t consider |S|2 as denominator since then the value is bounded by


1/2 , and the densest graph happens to be just cliques, which suggests that
a subgraph with just two adjacent nodes is more “dense” than a bigger
graph .. which, I mean .. is TRUE .. but just rewrite the stupid “reason”
in the exam.
• An identity

D(S) = e(S, S)/|S| ≥ c/2 =⇒ 2e(S, S) ≥ c|S| =⇒ d(S)−e(S, S̄) ≥ c|S| =⇒ d(S)+d(S̄) ≥ c|S|+d(S̄)+

• Finding minimum c|S| + d(S̄) + e(S, S̄)


We add a “source” s and “sink” t node to the graph, connected to every
node in G .
For every edge in E , we weigh it as 1.
For every edge {s, ui } , we give it weight di where di = dG (u) .
For every edge {u, t} , we weigh it as c .
Then, the maximum flow, which is also the minimum cut is precisely
min [c|S| + d(S̄) + e(S, S̄)] .
Notice that after any “cut” , the graph will be separated in two. Let the
first section be {s} + S and other be {t} + S̄ . Since any node in S is not
connected to s by any path, thus all edges (s, ui ) such that ui ∈ S̄ have to
be cut. This incurrs a cost of d(S̄) . Similarly, edges broken from t incurr
c|S| . Then, all the edge between S and S̄ had to be broken, incurring a
cost e(S, S̄) . Thus the total cost is c|S| + d(S̄) + e(S, S̄) .
• Finding densest graph
We do this using binary search.
Let mincut(G′ , s, t) be a function that gives the minimum cut (S, S̄) and the
value f . This runs in O(f E) if implemented correctly using Ford-Fulkerson
. In our case f ≤ E , thus we just have O(E 2 ) .
Let build(G, c) be this algorithm :
On input graph G, and constant c ,
– Add node s such that w(s, ui ) = di = d(ui ) for all ui ∈ V (G)
– Add node t such that w(s, u) = c for all c ∈ V (G)

20
Figure 4: image.png

– No edge between s and t .


– return created graph .
Let’s define desnse(G, c) as this algorithm
– G′ = build(G, c)
– ((S, S̄), f ) = mincut(S, S̄)
– if f > 2|E(G)| : return null
– return S
Now, we can just do binary search over c using dense . If you get null,
change left bound, else change right bound, untill both bounds match.
Then return the last non null value.
Note : Here “null” is not the null set ϕ .
• Density is lower than degrees
Suppose S ∗ is the densest sub-graph, then for any u ∈ S ∗ , removing it
should decrease the density, that is,

e(S ∗ , S ∗ ) − dS ∗ (u) e(S ∗ , S∗)


≤ =⇒ −dS ∗ (u)|S ∗ | ≤ −e(S ∗ , S ∗ ) =⇒ dS ∗ (u) ≥ ρ(S ∗ )
|S | − 1
∗ |S ∗ |

• Approximate greedy algorithm for finding densest subgraph (Charikar’s


algorithm)

21
Start with Sk = G
Compute D(Sk )
Compute di = d(ui )∀ui ∈ G
for i = k . . . 2 :
– Find the vertex v ∈ S such that din (v) is minimum
– Si−1 := Si − {v}
– Compute D(Si−1 )
return argmaxSi D(Si )
• Charika’s algorithm gives a 2-approximation , i.e. D(Sc ) ≥ D(So )/2
• Proof
Since we start will the full graph, till some point S ∗ is contained in the
sub-graph. Suppose the last time that this happens in when the sub-graph
is Si . Suppose the node with lowest density in Si is u . Then, we have :

dS ∗ (u) ≤ dSi (u)

just because S ∗ ⊆ Si .
Now, we also know:
1. dSi (u) ≤ avgSi (dSi (v)) = 2ρ(Si )
2. dS ∗ (u) ≥ ρ(S ∗ ) (as proved previously)
Thus,

ρ(S ∗ ) ≤ dS ∗ (u) ≤ dSi (u) ≤ 2ρ(Si )

which means Si is at least half as dense as the densest sub-graph. Now, obvi-
ously the algorithm’s result is denser than that, giving us a 2-approximation.

Expansion and Conductance


• Expansion for S ≤ G

ϕ(S) = e(S, S̄)/ min(|S|, |S̄|)

• Why not different things in denominator ?


– sum
Since |S| + |S̄| = |V | , then subgraph with smallest expansion, is just
one produced by min-cut.

22
– Max
This also gives similar results to that of min-cut, as |V | ≈ max(|S|, |S̄|)
– Product
This gives similar results to that of min(|S|, |S̄|) .
e(S,S̄)
Define ϕ′ (S) = |S||S̄|
.

WLOG assume that |S| ≤ |S̄| . Then, clearly,


1. |S̄| ∈ [n/2, n] where n = |G|
2. ϕ(S) = e(S, S̄)/|S|
Thus,

e(S, S̄) e(S, S̄)


ϕ(S)/n = ≤ ϕ′ (S) ≤ 2 = 2ϕ(S)/n =⇒ nϕ′ (S) ∈ [ϕ(S), 2ϕ(S)]
n|S| n|S|

• Conductance

ϕc (S) = e(S, S̄)/ min(d(S), d(S̄))

This is the fraction of edges going out from X = argminA∈{S,S̄} d(A) to X̄


.
• Conductance of graph
For a graph G , we define its conductance as the minimum conductance of
a sub-graph

ϕc (G) = min ϕc (S)


S≤G

This is also the sparsest cut.

Laplacian
• For a graph G with adjacency matrix A, we define the laplacian as L =
D − A where D = diag(A⃗1) .
• The ⃗1 is an eigenvector of L with λ = 0
• If there is a disconnected component, then the vector with all 1s for nodes
in that component and 0 otherwise is an eigenvector with eigenvalue 0 .
• The second eigenvector gives us an approximation of the min-cut value.

23
Sparsest cut and λ2
For any vector v, and an undirected graph,

X
v T Lv = (vi − vj )2
(i,j)∈E

Now, since L is a symmetric matrix, itPhas an orthonormal eigenbasis. So, we


can safely say that for any vector v = ai vi where vi is an eigenvector of L ,
we have

X
v T Lv = λi a2i
i

It’s easy to see that with the restriction that ||v||2 = a2i = 1 , we get
P
i

min v T Lv = λ2
||v||2 =1,a0 =0

which happens when v = v2 .


Note : For any vector v , the value v T Lv/||v||2 is called the Rayleigh quotient
of v.
Notice that λ2 will be zero if there are disconnected components..
Essentially, what we have proved is that

(i,j)∈E (vi − vj )2
P
λ2 = min P 2
v⊥⃗
1 i vi

Now, since v ⊥ ⃗1 , we have

X X XX X X X X X X
vi = 0 =⇒ ( vi )2 = 0 =⇒ vi vj = 0 =⇒ 2 vi2 − vi vj = 2 vi2 =⇒ vi2 −2 vi vj +
i i i j i i,j i i i,j j

Thus, we can write

(i,j)∈E (vi − vj )2
P
λ2 = min P
{i,j} (vi − vj )2
v

It’s easy to see that when you have vi = 1 if i ∈ S and vi = 0 otherwise , for
some S ≤ G , the RHS just becomes

24
e(S, S̄)
ϕ′ (S) :=
|S||S̄|

Note that here, you are not following v ⊥ 1 = 0 , and thus, you CANNOT use
the original expression (one with ||v||2 in denominator) to calculates ϕ′ (S) .. no,
instead, we defined ϕ′ (S) using the expression above.
Now, much in the past, we had proved that nϕ′ (S) ∈ [ϕ(S), 2ϕ(S)] . You should
go read that.
But this tells us that

nϕ′ (S) ≤ 2ϕ(S)∀S =⇒ n min ϕ′ (S) ≤ 2 min ϕ(S) =⇒ λ2 ≤ min ϕ′ (S) ≤ 2 min ϕ(S)/n
S S S S

Note that for the Nomalized Laplacian, there is a factor of d/n everywhere in
the denominator, where d = d(G)/|G| and n = |G| . So, factoring that in , you
would get, for regular graphs :

n n
λ′2 = λ2 ≤ min ϕ′ (S) ≤ 2 min ϕ(S)/d = 2 min ϕc (S) = ϕc (G) =⇒ λ′2 ≤ 2ϕc (G)
d d S S S

where λ′2 is the 2nd smallest eigenvalue of the normalised Laplacian.


Cheege’s inequality says that

λ′2 /2 ≤ ϕc (G) ≤ 2λ′2


p

We have just proved the left part of the inequality.


We dare not try to prove the other (harder) part, since in the end, this is an
engineering course.
A thing to notice is that ϕc (S) ≤ 1∀S , by its very definition. So this inequality
makes sense.

Spectral Graph clustering


√ ∗
If ϕ∗c is the conductance of G , then we can find a cut giving conductance O( ϕc )
using v2 .
For regular graphs (where all nodes have equal degree say, d), we can do these
steps
1. Find L = d · I − A , or better yet, the normalised version L = I − A/d
(not that it makes any difference in final result ..)

25
2. Find the eigenvector v2 (L) , also known as the Fielder vector
3. Sort all vertices by their absolute coordinate values in v2 , from highest to
lowest
4. Try partitioning G into (S, S̄) where S holds nodes i with high v2,i , and
S̄ holds one with lower valued ones. You can simply start with all nodes
in S and keep putting lowest valued nodes in S̄ one by one and thus find
the best cut.
This is guaranteed to give conductance ϕc ∈ O( ϕ∗c ) .
p

Then, just recursively partition to get a heirarchial clustering.


Note : For non-regular graphs, you should use the normalised Laplacian as

L = I − D−1/2 AD−1/2

Recap of Graph clustering


• Density of sub-graph is ρ(S) = e(S, S)/|S|
• Expansion of sub-graph is ϕ(S) = e(S, S̄)/ min(|S|, |S̄|)
• Conductance of sub-graph is ϕc (S) = e(S, S̄)/ min(d(S), s(S̄))
• d(S) + d(S̄) = |E|
• e(S, S) + e(S, S̄) = d(S)
• ρ(S) ≥ c/2 ⇐⇒ 2|E| ≥ c|S| + d(S̄) + e(S, S̄)
• You can construct a G′ with V (G′ ) = V (G) ∪ {s, f } such that there is a
bijection between P (G) and minimal (not minumum) cuts in G′ with cost
c|S| + d(S̄) + e(S, S̄) . Finding whether there is a cut where this cost is
less than 2|E| is doable in time O(f E) = O(E 2 ) for any positive integer c
. Thus, you can do binary search to find minimum c where ρ(S) ≥ c/2 .
• Charika’s greedy algorithm gives a 2-approximation for densest subgraph
– Start with full graph as the sub-graph
– Keep deleting the node with lowest degree and compute density at
each stage
– return the sub-graph with highest density, out of all the stages you
saw (just keep track of the iteration, and iterate again till that stage.).
• Sparsest cut of a graph G is a cut (S, S̄) such that ϕ(S) =
e(S, S̄)/ min(|S|, |S̄|) is minimised. This is also defined the expan-
sion ϕ(G) of the graph G .

Mixture models
We assume that the data points come from an ensemble of K different distribution,
each of them having a probability ϕi where ϕ ∈ RK and ||ϕ||2 = 1 .
Usually, these distributions are Gaussian, namely the ith one is N (µ⟩ , σ⟩∈ ) .
The probability of getting a data-point xi is thus

26
K
X ϕk (xi − µk )2
exp(− )
2πσk2 2σk2
p
k=1

In general, the distributions could be anything, giving

K
X
ϕk p(xi |θk )
k=1

It’s easy to see that the likelihood is thus

N X
Y K
ϕk p(xi |θk )
i=1 k=1

and thus the log likelihood is

N
X K
X
ln ϕk p(xi |θk )
i=1 i=1

Yeah .. not a very nice function to maximise, is it ?


You could try computing gradients and getting an analytical solution, or putting
this in an auto-grad system (after all these are very basic operations).
And lastly there is the approach people take in statistical mechanics : approx-
imation. You assume that the centers of the Gaussians (if you are dealing
with gaussian distribution) are very close to the points, the gaussians have unit
variances in all directions, and all modes are equally likely, and then approximate
as

N K N K N
X X 1 T
Σ−1
X X 2 X 2
ln( ϕk p e−(xi −µk ) k
(xi −µk )/2
) = N C+ ln( e−(xi −µk ) /2
) ≈ N C+ ln(e−(xi −c(xi )) /2
)=
i=1 k=1
2π|Σk | i=1 k=1 i=1

. . . which gives the K-means objective function.. weird. This suggests that
k-means has got to do something with this.
In fact, there is a (proven) better method, which is basically k-means, but
probabilistic. It’s called Expectation maximisation.
1. You first assign the labels to the data-points, but in a probabilistic manner.
Namely,

27
ϕk p(xi |θk )
zi,k = P
k ϕk p(xi |θk )

Btw, this is what you compute for the final classification too, except you
return a specific label by sampling with distribution ⃗zi .
2. Update the parameters, just as you would update in k-means, taking the
⃗zi stuff we created as absolute truth.

PN PN PN
i=1 zi,k (xi − µk )(xi − µk )
T
i=1 zi,k zi,k xi
ϕk := µk := Pi=1
N
Σ k := PN
N i=1 zi,k i=1 zi,k

Vectors are always orthogonal in high dimensions


This is about how close can the centers of two gaussians be so that we can still
cluster them.
For any 2 points on a unit variance one dimensional gaussian, we have

Z Z Z Z ∞
2
/2−y 2 /2 2 2
2πE[||x−y||2 ] = ||x−y||2 e−x dxdy = r2 e−r /2
rdrdθ−2( xe−x /2
dx)2 = 2π(2 xe−x dx)−2(0)2
0

And as for the variance, we can compute the second moment first,

Z ∞
2
2πE[||x−y||4 ] = 2π(E[(x2 +y 2 )2 ]−4E[x3 y]−4E[xy 3 ]+4E[x2 y 2 ]) = 2πE[r4 ]+8π(E[x2 ])2 = 2π r5 e−r /2
dr+8
0

Now, for a d dimensional space, we have

E[||x − y||2 ] = 2dVar[||x − y||2 ] = 8d

This tells us that

√ √
||x − y||2 = 2d ± θ(2 2d) = 2d ± θ( d)

Also, it’s easy to see that E[||x||2 ] = d and Var[||x2 ||] = ⟨x4 ⟩ = 3σ 2 d = 3d (use
the cummulant generating function or something, idk.. just get this somehow)
Thus, we have

√ √
||x − y||2 = 2d ± θ( 8d)||x||2 = d ± θ( 3d)

28
With this, we can get an upper bound on the angle between two points

√ √
xT y (||x||2 + ||y||2 − ||x − y||2 ) ±θ( 11d) ± 1
||x−y||2 = ||x||2 +||y||2 −2xT y =⇒ = = q √ √ =
||x||||y|| 2||x||||y||
2 d2 + 2dθ( d) + θ( d)2

Thus, the cosine of the angle


√ between two points is statistically of order d
−1/2

(with the constant factor of 11/2shoved in your .. let’s move on .. I really hate
my prof..) .
What this result says is that, in a very high dimensional space, it’s extremely
hard to find points that are not almost orthogonal.

How far should clusters be for them to be clusters ?


Suppose you have data sampled from 2 Gaussian distributions in d dimensions
with equal probability, and Σ1 = σ12 Id , Σ2 = σ22 Id . Suppose these have centers
µ1 , µ2 , then for a point x from the first Gaussian, and y from second, we have

E[||x−y||2 ] = E[||(x−µ1 )−(y−µ2 )+(µ1 −µ2 )||2 ] = E[||x−µ1 ||2 ]+E[||y−µ2 ||2 ]+||µ1 −µ2 ||2 −2(0)−2(0)+2(0) = σ

where ∆ = ||µ1 − µ2 || .
Now, just like the last time, you have Var[||x − y||2 ] = O(σ14 d + σ24 d) .
So, we have
q
||x − y||2 = σ12 d + σ22 d + ∆2 + θ( σ14 d + σ24 d)

Similarly, for two points y, z in the second distribution, you have


||z − y||2 = 2σ22 d + θ(σ22 d)

We want ||x − y||2 > ||z − y||2 . That is, we want

√ q
∆2 = (σ22 − σ12 )d + θ( d σ14 + σ24 )

So,
(p
|σ22 − σ12 |d σ2 ̸= σ1
∆=
θ(d1/4 σ) σ1 ≈ σ2 ≈ σ

29
χ2 distribution
If Xi ∼ N (0, 1) are i.i.d , then

d
X
Z := Xi2 ⇐⇒ Z ∼ χ2 (d)
i=1

A nice inequality on this:

2
P [Z > d(1 + δ)] ≤ e−dδ /18


Then, if you define Y = Z , we can write

√ √ d
P [Y > d+β] = P [Z > d+2 dβ+β 2 ] = P [Z > d(1+(2d−1/2 β+β 2 d−1 ))] ≤ exp(− (2d−1/2 β+β 2 /d)2 ) = exp(
18

Supposing β >> β 2 / d , we can write


P [Y > d + β] = exp(−cβ 2 )

for some c > 0 . More specifically, we can write


P [||⃗x|| > d + β] = exp(−O(β 2 ))

where ⃗x ∈ Rd and xi ∼ N (0, 1) .

Spectral Algorithm
This is different from the spectral graph clustering algorithm. This one is just
normal clustering. Suppose you have a probability density which you have
somehow figured out is from a k component Gaussian mixture.
Now, you want to cluster the data into k clusters. This is the standard k means
application if you are a bit lazy and don’t want to do the whole EM thing.
There’s just one problem..the dimensions d are extremely huge. So, the compu-
tational cost of k means, which is O(ndkt) is very high.
So, now what ?
This is where SVD is useful. Before that, there are a few things to think about :
1. The orthographic (discarding extra dimensions) projection of a standard
Gaussian probability density in d dimensions to k dimensions also produces
a standard Gaussian density with the same variance as before.

30
2. The best fit line v ∈ Rd for a d dimensional density is one that minimises
E[(v T x)2 ] where ||v|| = 1 is the constraint. When the density is a single
Gaussian, this is the line passing through the centre of that Gaussian and
the origin. If X is the data matrix, with rows as data points, then the
expectation simply becomes n1 v T X T Xv . This happens when v is the first
right singular vector of X.
3. The best fit hyperplane V ∈ R×ℸ is one that minimises E[||V T x||2 ] where
V is ortho-normal, or if we are dealing with data, it minimises n1 ||XV ||2F .
Clearly, this happens when V is made of the first k right singular vectors.
When the density is a mixture of k Gaussians, this is the hyper-plane
passing through the centres of all of those, as well as the origin.
So, what we should do is simply do a projection to the best fit k dimensional
hyperplane using Vk which is the matrix with the first k columns of V from the
SVD X = U ΣV T . We do thhis as x̂i = xi
Then, we can just cluster using k means to estimate µ̂k , Σ̂k , ϕk for the GMM in
the k dimensionaly space.
Then, once you have all these parameters, you essentially have the Gaussian
mixture model that generated this data. You can either straight away go on to
the prediction phase, or refine further using the EM algorithm.
Note that this GMM is in a lesser dimensional space. So, for prediction , you
will always have to find x̂ = x first.

Sampling
We are dealing with a system that has limited storage. The system receives
a data stream x1 , x2 . . . xn . Our aim is to somehow sample from the data
uniformly.
• When n is known.
In this case, you can simply generate k indices from 1 to n with equal
probability , and sample values on those indices when they occur
• Algorithm R
Initialise an array R of size k , with starting index as 1. for i, x in enum (
stream ) :
j = rand([1 . . . i])
if j ≤ k :
R[j] = x
else:
continue

31
Claim : This process finally selects k samples from the stream such that
the probability of each element being sampled is k/n when n elements
have been seen.
Clearly, for n = k this claim is true, since till that point, we can just fill
whatever we see.
Now, if this is proved true till n = N , then for n = N + 1 , we have
– The probability of any of x1 . . . xN being in R just before xN +1 arrives
is, by the induction hypothesis, k/N
– The probability that for i = N + 1 , index j is chosen (and thus, the
value at that index replaced by xN +1 if j ≤ k) is 1/(N + 1).
– The probability that xi is put into R by the end of first N items AND
xi is not replaced by xN +1 is (k/N )(1 − 1/(N + 1)) = k/(N + 1) .
– The probability that one of x1 . . . xN stays on the stack by the end of
i = N + 1 iteration is thus, k/(N + 1)
– The probability that xN +1 is on R is also k/(N + 1)

Hashing
To give every key a fair chance to be stored in any location, we store a lot of
hash functions in a family H = {h1, h2 . . . }, and choose one (h) at random.
The number of bits required to specify which hash function to use is log2 (|H|) .
Suppose the universe of items is U and the hash table has m indices, we never
have this number (bits needed to choose h) to go beyond |U | log2 m since m|U |
is the number of different hash functions you could have.
The different types of Hashing families are (in higher to lower quality) :
1. Uniform Hash family : Ph [h(x) = i] = 1/m
2. Universal : Ph [h(x) = h(y)] = 1/m
3. Near Universal: Ph [h(x) = h(y)] ≤ 2/m
For a hash function h , define Cx,y = 1(h(x) = h(y)) as an indicator function.
Also, for data stream of size n , define l(x) as the length of the chain (linked
list) at index h(x) .
We can write :

X X
Eh [l(x)] = Eh [ Cx,y ] = Ph [h(x) = h(y)]
y y

For universal hash families, this becomes n/m .


The space required for hash maps, apart from the space needed for storing h is
given by O(n log2 |U |) . This is because, each of the elements from the universe
can be uniquely identified by log2 |U | bits. A good hash map will store only

32
these bits, and not the full object in its linked lists. And of-course, we’ll have to
store n such items.
The query time is (for a good hash function) : O(n/m) . This is called the load
factor.
Another very efficient way of storing keys from U is through a bit-array. Basically,
you convert every object to a log2 |U | index. Then, you set the bit at the index
on the bit array to 1. The size of the bit array is obviously |U | . This allows for
O(1) queries.. but do you really want to do it like this ? .. with O(|U |) space ?
Ok, this is good and all, but like .. what does a hashing function look like ?
• Prime Multiplicative Hashing
– Pick a prime p > |U |
– For any b ∈ [p − 1] and a ∈ [p − 1] define ha,b (x) = (ax + b) mod p
– Define H = {ha,b (x)|a, b ∈ [p − 1]} as the hash family. Thus, you
have |H| = (p − 1)2 .

Bloom filter
You have a bit array B of size m , initialised to 0. Every time you want to insert
a key, you pass it through k hash functions and set the bits in B at those k
output values to 1.
Then, to check if an element was observed before, you hash again, and if all k
bits are set, return True, else return False.
There are chances of false positives here, since all bits being 1 doesn’t necessarily
mean that the element was indeed put in.
So, what is the probability of getting a false positive ?
1. The probability of a bit not being set upon insertion of an element is
approximately (1 − 1/m)k
2. The probability of it not being set upon insertion of n elements is similarly
(1 − 1/m)nk ≈ e−nk/m
3. Thus the probability of it being set is approximately 1 − e−nk/m .
4. The probability of k bits being set is (1 − e−nk/m )k .
All this is assuming that the element that sets all the k bits is not observed.
Thus, this is approximately the probability of a false positive given that the
element is not there.. that is, it is FP/(FP+TN) .. which is the False positive
rate.
So, we just need to minimise this over k (software parameter) to get the best
value, for a particular m, n (hardware limitation, and knowledge about process).
Taking a log and then setting derivative to 0,

ln(1 − e−nk/m ) + k(1 − e−nk/m )−1 (e−nk/m (n/m)) = 0

33
For sanity, define p = e−nk/m . Thus,

pnk/m
ln(1 − p) + = 0 =⇒ k = −m(1 − p) ln(1 − p)/(pn)
1−p

Now, from how we defined p , we have k = −m(ln p)/n . This gives

(1 − p) ln(1 − p) = p ln p

Clearly, as p ∈ (0, 1] , and x ln x is an increasing function, this can only happen


once, and by symmetry, at p = 1/2 .
Thus, we get

m
k= ln 2
n
Thus, the optimum FPR is

m
δ = 2− n ln 2

So, to get a false positive rate of δ , you need

m = −n ln δ/(ln 2)2

amount of space at minimum.


For example, for a 1% error rate, you need

m = 2n ln 10/ ln2 2 ≈ 9.6n

amount of bits.. that is, 9.6 bits per element.


You might think that the O(n) space requirements makes this impractical.. but
you are forgetting that if you use something like a hash table instead, you would
still have to store O(n) bits, where the constant associated would be much,
much higher. This is because actual data isn’t just bits.. it’s well, data (32 bits
already for just an integer, as compared to 10 bits per element in a bloom filter).
Checking equality between two data elements requires them to be present in
their entirety . This is where bloom filters shine.
Bloom filters are also easy to use for set unions, intersections, etc, since you can
just OR and AND two bloom filters to get the result.
How about if you wanted to know how many times an element has occurred and
not just whether it has occurred ?

34
In such a scenario, instead of maintaining m bits, you should maintain m counters
(usually 4 bits).
This is called a counting bloom filter, for obvious reasons.

Linear Counting
Given a stream of data of size n, we want to find the number of distinct elements
S in the stream.
Once again, the traditional hash table approach isn’t fast and memory efficient
enough, since it checks equality linearly through a linked list. Like.. how tf are
you going to use that for a real time stream ?
Ok, so what can we do ?
1. Once again, maintain a bit array B of size m .
2. Have a single hash function h : [U ] → [m]
3. On observing an element x , set B[h(x)] := 1 .
4. After all elements are done, count the number of unset bits as Zm . Return
S := −m ln(Zm /m)
Just like last time, the probability of a bit not being set after n elements is
(1 − 1/m)S ≈ e−S/m . So, the expected number of bits that are unset are
me−S/m , which we set to Zm , our observed value, for the best estimate of S .
Thus, we have

Zm = me−S/m =⇒ S = −m ln(Zm /m)

Just like the last time, we require m = O(S) bits for some fixed error rate.

Flajolet-Martin Sketch (HyperLogLog Sketch)


• Algorithm
This is another approximate algorithm for finding number of distinct
elements.
1. Define the “zeroes” of a number as number of trailing zeroes at the
right side of the binary representation of the number, namely v2 (x) ,
with the special case v2 (0) := 0.
2. Select a hash function h : U → [0, 2l − 1] randomly.
3. z := 0
4. for x in stream :
1. ẑ := v2 (h(x))
2. z := max(ẑ, z)
5. return 2z+0.5 as the number of distinct elements

35
The analysis for this goes like this :
First, suppose the actual number of distinct samples is S such that log2 S −
0.5 ∈ (b, a), and you’ve returned Ŝ = 2z+0.5 , then we can show that :

P [Ŝ > 4S] = P [z ≥ a + 2] ≤ 0.35P [Ŝ < S/4] = P [z ≤ b − 2] ≤ 0.35

This implies that

P [Ŝ ∈ (S/4, 4S)] ≥ 0.3

This isn’t very good by itself, but we can change this drastically by running
this algorithm in parallel with several randomly selected hash functions
simultaneously.
• Medians of estimates
For k estimates Ŝ1 , Ŝ2 . . . Ŝk ,the probability that their median Ŝ is out
of bound is given like this :

P [Ŝ > 4S] = P [|{Ŝi > 4S}| > k/2] = Ω(e−k )P [Ŝ < S/4] = P [|{Ŝi < S/4}| > k/2] = Ω(e−k )

that is to say, it can go arbitrarily small depending on how refined you


want it to be.
Ok, but how do we prove the first two inequalities ?
Glad you asked
• Proof
Let Xrx be 1 if v2 (h(x)) ≥ r and 0 otherwise
Let Yr = |{x ∈ stream|Xrx }| be number of elements x with Xrx
– Part 1
It’s easy to see that

Ex [Xrx ] = 2−r =⇒ E[Yr ] = S2−r

Now, since Yr is an integer, Yr > 0 ⇐⇒ Yr ≥ 1 . With this, we can


use Markov’s inequality to state that

P [Yr > 0] = P [Yr ≥ 1] ≤ S2−r /1 =⇒ P [Yr > 0] ≤ S/2r

36
But we should note that we aren’t interested in any random r . We
only want to know if Ŝ > 4S ⇐⇒ z ≥ a + 2 ⇐⇒ Ya+2 > 0 .
So we get

P [Ŝ > 4S] ≤ S/2a+2 < 2a+0.5 /2a+2 = 2−1.5 ≈ 0.35

So, we have

P [Ŝ > 4S] < 2−1.5 ≈ 0.35

– Part 2
Also, as Yr is a whole number, and Yr = 0 ⇐⇒ (Yr − E[Yr ])/E[Yr ] ≤
−1 . And Yr just happens to be the sum of a bunch of Bernoulli
variables. So, we can exploit the lower Chernoff bound

2
P [Yr = 0] = P [(Yr − µ)/µ ≤ −1] ≤ e−(−1) µ/2

where µ = E[Yr ] = S/2r .


This gives

r+1
P [Yr = 0] ≤ e−S/2

Once again, we aren’t interested in just any r . We have Ŝ < S/4 ⇐⇒


z ≤ b − 2 =⇒ Yb−1 = 0 .
So, we have

b 0.5
P [Ŝ < S/4] ≤ P [Yb−1 = 0] ≤ e−S/2 ≤ e−2 ≈ 0.24 < 0.35

So, finally we have


P [Ŝ > 4S] ≤ 2−1.5 P [Ŝ < S/4] ≤ e− 2

• Medians of Means
The kind of approximate algorithms we are reading right now are a category
called ϵ, δ algorithms, the reason for which will become clear in a moment.
You can get even better estimates from the Flajolet-Matrin Sketch by
taking means of estimates in batches, and then taking medians of those
batches.
To get the final estimate Ŝ is interval [S(1 − ϵ), S(1 + ϵ)] with probability
1 − δ (and thus, error rate δ) , you should

37
1. First, run the algorithm k = − ln δ/ϵ2 times in parallel.
2. Divide the results into k2 = − ln δ sub-groups of size k1 = 1/ϵ2 each.
3. For all of the k2 supgroups , use the mean of their k1 individual
estimates as the resultant estimate for the subgroup.
4. Calculate the median of the k2 means that you calculated as the final
result.

kMV sketch
First, let’s start with this claim :
If you choose n numbers x1 , x2 . . . xn from [0, 1] uniformly , then E[min({xi |i ∈
[n]})] = n+1
1
.
• Proof

X X X
E[min({xi |i ∈ [n]})] = n x1 P (x1 , x2 . . . xn )n x1 P (xi ≥ x1 ∀i ∈ [2, n])P (x1 ) = n
x1 ∈[0,1],x1 ≤xi xi ∈[0,1] x1 ∈[0,1]

Another claim is:


If you choose n numbers xi ∈ [0, 1] uniformly, then E[1/ min({xi }i )] = ∞ .
• Proof
Just change the x1 in the integral to 1/x , giving us

∞ ∞ ∞
1
(1 − x)n−1
Z Z Z Z
n dx = n (1−e )
−x n−1
dx = n (1−e )
−x n−1 −x x
e e dx ≥ n(1−e−x )n−1 e−x (1+x)dx
0 x 0 0 0

R1
where F (n) = 0
(1 − e−x )n dx . Now, of course, F (n − 1) > F (n) . Thus,

(n + 1)F (n − 1) = ∞ =⇒ F (n − 1) = ∞

Ok, that’s just for the minimum. What about the 2nd minimum, or 3rd, or say,
tth .
• Reciprocal of tth minimum
Assume all cases where x1 is the tth minimum. Now, the expectation of
f (x1 )It ({xi }i ) where It is the indicator of x1 being the tth minimum is

Z 1
f (x1 ) ·n−1 Ct−1 (x1 )t−1 (1 − x1 )n−t dx1
0

38
Now, set f (x1 ) = t−1
x1 . This gives us

t−1 1
n−2
Z  
E[ It ({xi }i )] = (n − 1) 1 (1 − x1 )
xt−2 n−t
dx1
x1 0 t−2

Yep, this is the form that a beta function integral has. How did we solve it
? Using recursion. Particularly, do D.I. and then get a recursive relation.
Then, you’ll get the thing. But for simplicity, I’ll just tell you that

Z 1
a!b!
xa (1 − x)b dx =
0 (a + b + 1)!

In our case, we get the value of the integral as.. you guessed it .. 1.
So, now, we can write the expected value of the sum of such functions (
f (xi )It,i ({xi }i )) over xi as simply n .
Thus, we have the expected value of t − 1 times the reciprocal of tth
minimum value as n .
Of course, you also have to care about how good of an estimate the expectation
gives. To do that, you would check the variance. Then, you would choose the
minimum t such that the variance is still lower than some particular error bound,
say ϵ2 n2 . After all the heavy math, you will get:

t = ⌈96/ϵ2 ⌉

Ok, now coming to the algorithm.


• Algorithm
On stream x1 , x2 . . . , and given error bound ϵ
Define t = ceil(96/ϵ2 ) .
Choose a hash function h : U → [0, 1] randomly.
hash the first t of xi to get S = {hi |i ∈ [t]} where hi = h(xi ) .
Make S into a max-heap.
for each x in stream,
if max element in S is more than h(x) , then pop it, and insert h(x) to S
pop S to get the maximum element v . This is the tth smallest element in
the stream.
Compute n̂ = t−1
v as the estimate for number of distinct elements.

39
This has the run-time complexity of O(N log2 t) where N is the length of the
stream.
Now, to get the best results, you need to run this multiple times with l = log(1/δ)
different hash functions and get the mean of the results.
This algorithm actually does an (ϵ, δ) approximation, but the details are too
gory.
This is still not good enough, since you have to compute l hash functions.
So, instead, you can use a single hash function, and split the [0, 1] interval into
l buckets of equal size. Each bucket will have its own heap, containing the
k = O(ϵ−2 ) minimum elements, in that bucket. Now, the number of distinch
element in bucket i will be around (k − 1)/vi . The final number of distinct
Pl
elements is thus n = i=1 k−1 vi since elements from different buckets are clearly
distinct, as they have different hash function values.
This is called Stochastic averaging kMV algorithm.

Misra Gries Sketch


This is an algorithm to get estimates for frequencies of the k most frequent
algorithms.
• Initialise hash map H that can contain k elements as keys and their counts
as values easily.
• for x in stream:
– if x ∈ H.keys : H[x] := H[x] + 1
– else if |H| < k : H[x] := 1
– else :
∗ for y ∈ H.keys : H[y] := H[y] − 1
∗ Remove keys with value 0
The count fˆx = H[x] of any key x ∈ H.keys is an estimate of its true frequency.
Particularly,

N
fx − ≤ fˆx ≤ fx
k+1

where N is the length of the stream.


The right inequality is obvious. For the left inequality, note that the number of
elements on which we decremented all keys, say d , and number of increments
i give us the sum of frequencies i − d(k + 1), which is non-negative. Thus
i ≥ (k + 1)d . Note that we are taking k + 1 and not k because the +1 accounts
for the reduction in count of the thrown out element from what it would have
been had it not been thrown out. But, we know that N ≥ i . Thus, d ≤ N/k
. Now, whenever the element in stream was x , then either it increased fˆx by

40
1, or induced a decrement in values of all keys, except those not in H , which
includes x . This cannot happen more than d times.

Hypothesis testing methodology


• Overview
You want to confirm whether a particular hypothesis H1 is true based
on data. We have not collected data yet. Instead, we are planning what
we would do after we have collected the data. Suppose we decide to
collect n samples to give use the observed values of the i.i.d random
variables X1 . . . Xn (the values of observations come from some unknown
distribution, so we use random variables to think about these values)
. The hypothesis tells us (or claims) the distribution (multiple in case
of composite hypothesis) of Xi by giving us parameters that define this
distribution, namely the distribution is p(X|θ1 ) .
• Composite hypothesis
In case of composite hypothesis, there are multiple distribution parameters.
Essentially, we are claiming that the real distribution is one amongst
{p(X|θ1 )|θ1 ∈ Θ1 } where Θ1 is a set of possible parameter values.
• Alternate, Null, type 1 error
The hypothesis being true will have drastic effects (say, a company investing
in something or someone being judged guilty in a courtroom). So, we do not
want to just say that H1 is plausible. Getting a false positive, aka a “Type
1 error” is costly. To solve this, we define a “null” hypothesis H0 which
is the “opposite” of H1 . For example, if H1 says that P (rapist|man) >
P (rapist|woman) , then H0 says that these are equal. Note that it does
not say that P (rapist|man) ≤ P (rapist|woman) which would be the logical
opposite. So, there are cases which are explainable neither bu the null or
the alternate hypothesis.
From this point on-wards, the question has changed from proving or
disproving H1 to proving or disproving H0 until we reach the NP test,
UMP tests, etc. Essentially, when you reject H0 , you also make the
decision of accepting H1 . This works well when they are logical opposites.
For now, we will think of it like that.
• Test metric
The first thing to do is to create a test metric which we can use to make
the decision about rejecting or accepting the hypothesis rather efficient
and mechanical. Such a metric is called a tests statistic T .
Next we need to choose a (or many) critical value c of T , beyond which
we will reject the null hypothesis. Basically, we will calculate the observed
value t of T on the actual observed data x1 , x2 . . . and then based on

41
whether t > c (or whatever form this condition takes) we will reject the
null hypothesis.
• Test function
The pair (T, c) combined give us the test function ϕ which is an indicator
function for the rejection of θ0 . The region in the X n = (X1 . . . Xn )
space on which we reject the null hypothesis, is called the rejection region.
R = {xn ∼ X n |ϕ(xn )} .
Power functions
Since “given θ0 is true” , we actually have the whole distribution for Xi
, we can find the probability of the data observed falling in the rejection
region as

β(θ0 ) = P (X n ∈ R|θ0 )

This is called the “power” (of rejection) function, and is only a function of
θ0 and ϕ . Similar to this, you can have

β(θ1 ) = P (X n ∈ R|θ1 )

If the test ϕ is designed carefully, then we should ideally have very low
β(θ0 ) and very high β(θ1 ) .
• Level, size, UMP
If (given θ0 , θ1 ) for the test case ϕ we have β(θ0 ) ≤ α , then we say that
the test is “level α” and if β(θ0 ) = α precisely, we call it “size α” .
If a test ϕ is level α and at the same time we have βϕ (θ1 ) ≥ βψ (θ1 ) for
ANY other level alpha test ψ , then we say that the test is Uniformly Most
powerful. Given θ0 , θ1 for a simple hypothesis, one can create a uniformly
most powerful test as described by the Neyman-Pearson procedure, using
nothing but likelihood functions.
The “test creation” largely hinges on choosing the appropriate test statistic
and then finding a value of c that will make the test level α , or in some
cases, size α .
Moving on to composite tests. In case Θ0 is finite, one can simply calculate
β for each parameter θ0 ∈ Θ0 and take the maximum. But if Θ0 is
continuous, say something like intervals, in that case, you use the upper
bound of β in Θ0 .Namely , a test is level α if supθ0 ∈Θ0 β(θ0 ) ≤ α . We
aren’t saying anything about the UMP test here.
• Neyman-Pearson test
Suppose

42
f0 (X n ) = L(θ0 |X n ) = P (X n |θ0 )f1 (X n ) = L(θ1 |X n ) = P (X n |θ1 )

Define the test statistic

f1 (X n )
T (X n ) =
f0 (X n )

For a fixed α , define c such that

P (T (X n ) > c | θ0 ) = α

Define the test function as

ϕ(X n ) = I(T (X n ) > c)

This test is a UMP size α test.


• Proof of NP test claim
Consider any other level α test ϕ′ . We have the following :

ϕ(X n ) ≥ ϕ′ (X n )∀X n ∈ R ⇐⇒ ϕ(X n )−ϕ′ (X n ) ≥ 0 ∀ X n s.t. T (X n ) > c =⇒ (ϕ−ϕ′ )(T −c) ≥ 0∀X n ∈ R

Moreover, if if T < c , then by ϕ = 0 and thus ϕ − ϕ′ ≤ 0 . This gives


(ϕ − ϕ′ )(T − k) ≥ 0 even in that condition.
Thus, we have

(ϕ − ϕ′ )(T − c) ≥ 0∀X n =⇒ (ϕ − ϕ′ )(f1 − cf0 ) ≥ 0∀X n .

as likelihood f0 ≥ 0 necessarily.
This gives us:
Z Z
(ϕ − ϕ )f1 dX ≥ c
′ n
(ϕ − ϕ′ )f0 dX n
Xn Xn

But now remember that f0 (X n ) = P (X n |θ0 ) . Thus, we have


Z
ϕf0 dX n = E[ϕ|θ0 ] = P (X n ∈ Rϕ |θ0 ) = βϕ (θ0 )
Xn

43
Similar for ϕ′ . This transforms the RHS into βϕ (θ0 ) − βϕ′ (θ) . But
remember, ϕ is size α and ϕ′ is level α . This makes the RHS non-negative,
and we simply get
Z Z
ϕf1 dX n ≥ ϕ′ f1 dX n ⇐⇒ βϕ (θ1 ) ≥ βϕ′ (θ1 ) ...□
Xn Xn

• Wald test
Now, NP test is only useful if the hypothesis’ are simple (not composite).
What happens if that’s not the case ? What if your null hypothesis is
simple and alternative hypothesis is not.
Let’s consider this specific form :

H0 : θ = θ0 H1 : θ ̸= θ0

Let’s also assume that you have an “estimator” function θ̂(X n ) that tells
us an estimate of θ .
This estimator is also asymptotically normal, that is to say,

θ̂ − θ0
∼N →∞ N (′, ∞)
se0

Here se0 is the standard deviation of the estimator, assuming that θ0 is


true.
Note that the estimator can be any complicated model you can dream
of, and it can be the simplest of stuff like an average. The point is that
there is SOME function that tries to predict the true parameters of the
distribution that generated the data. For the most part, it’s easier if you
just think of the average. Consider θ0 as the average value you expect the
distribution to have and θ̂ as the actual observed average from the data.
If θ0 is true then θ̂ should eventually have a sharp normal distrubution
centred around θ0 as N → ∞ .
The test statistic we use is simply

θ̂ − θ0
T =
se0

If se0 (the theoretical standard deviation) is not known, you can use the
observed value seˆ0 .
We set the rejection criterion as ϕ = I(|Tn | ≥ zα/2 ) where zα/2 is a value
such that

44
lim P (Tn ≥ zα/2 |θ0 ) = α/2
N →∞

With this we have β(θ0 ) → α as N → ∞ .


What about the power of the alternate hypothesis ? There are too many
θ1 to test out. So instead, we will use the assumption that
there does exist “some true θ1 ” under which Tθ is asymptotically
normal, which is not θ0
as H1 .
By asymptoticaly normal, we mean that

1
θ̂ − θ1 ∼n→∞ N (0, )
nI1 (θ1 )

Let’s first define

∆= nI1 (θ)(θ0 − θ1 )
p

so that we can write

T (X n ) = nI1 (θ)(θ̂(X n )−θ0 ) ∼n→∞ N (−∆, 1) =⇒ T +∆ ∼n→∞ N (0, 1)


p

Then the probability of rejection using the Wald test function is

P (T < −zα/2 )+P (T > zα2 ) = P (T +∆ < ∆−zα/2 )+P (T +∆ > ∆+zα/2 ) = Φ(∆−zα/2 )+1−Φ(∆+zα/2 )

where Φ is the standard Gaussian cumulative distribution function.


• Likelihood Ratio test (LRT)
This is the most general method possible.
Here we have

H0 : θ ∈ Θ0 H1 : θ ̸∈ Θ0

We define the test statistic

supθ∈Θ0 L(θ) L(θ̂0 )


λ(X n ) = =
supθ∈Θ L(θ) L(θ̂)

where

45
– Θ is the full domain of θ
– θ̂ is the MLE estimate over Θ
– θ̂0 is the MLE estimate over Θ0 .
The test function is

ϕ(X n ) = I(λ(X n ) ≤ c)

• χ2 test
Consider a categorical distribution with probability vector θ of dimension
k. We do n trials to find it. For each trial, we have the class noted.
This gives us an estimate of the probabilities which we denote by θ̂ .
Here, nθ̂ ∼ Mult(θ, n) . Also, it’s a fact that you can approximate the
multinomial distribution as :

Y 1 1 X (θ̂i − θi )2
P (θ̂) = exp(− )
2πθi /n 2 i
p
i
θi /n

as discussed over here : https://math.stackexchange.com/questions/2859095/approximating-


a-multinomial-as-p-xi-1-ldots-xi-n-propto-exp-left-fracn.
This means we can use

X
T = Zi2
i

as a test metric, where

θ̂i − θi nθ̂i − nθi fˆi − fi


Zi = p = √ = √
θi /n nθi fi

is the “Z-score” for each class. Of course this is different from the usual
estimation, but for the sake of notation, we’ll use this definition.
Now, Zi , if you haven’t guessed already, has a standard normal distribution
(when n → ∞) . This means that T as a function P of θ has the χk−1
2

distribution with k − 1 degrees of freedom (since i θi = 1 says that Zk


is determined by Z1 . . . Zk−1 ) . That is, it is the sum of squares of k − 1
standard normal random variables. That is why this test is called the χ2
test.
So, to find the critical value χ2k−1,α after which our hypothesis becomes
unlikely to be true, we have to solve

P (T (θ̂) > χ2k−1,α |θ) = P (T (θ) > χ2k−1,α |θ̂) = α

46
The LHS is the power function with c = χ2k−1,α , and for convenience, we’ll
just call it c from now on.
This tells us that we can claim with 1 − α confidence that the true value
lies “close” to the estimate, namely that T (θ̂, θ) ≤ c .

47

You might also like