Intro To Data Science
Intro To Data Science
Markov Inequality
For a non-negative random variable X
P (X ≥ a) ≤ E[X]/a ∀a > 0
• Proof
−
X a
X X X
E[X] = P (x)x = P (x)x+ P (x)x ≥ 0+ P (x)a = aP (X ≥ a) =⇒ P (X ≥ a) ≤ E[X]/a
0 a a
Chebyshev’s Inequality
P (|x − µ| > a) ≤ σ 2 /a2
• Proof
X X X X
σ2 = P (x)(x−µ)2 = P (x)(x−µ)2 + P (x)(x−µ)2 ≥ P (x)a2 = a2 P (|x−µ| ≥ a
x x∈(µ−a,µ+a) |x−µ|≥a |x−µ|≥a
X X E[X n ]
MX (t) = E[etX ] = E[ tn X n /n!] = tn
n n
n!
dn
E[X n ] = MX | 0
dtn
1
Essentially, the moment generating function is a Laplace transform of the
distribution function P , i.e.
MX (t) = F[P](−⊔)
Define
X
S= Xi
i
Then, we have
Y XY YX Y
µ = E[S] = npE[etS ] = E[ etXi ] = etXi P (Xi ) = etXi P (Xi ) = E[etXi ] = (pet +1−p)n
i i i Xi i
t
P (tS > a) ≤ enp(e −1)−a
t t
P (S > np(1 + d)) ≤ enp(e −1)−np(1+d)t
= enp(e −1−t−td)
2
So, let’s make the bound as tight as we can. How ? Take a derivative, giving us
et − 1 − d = 0 =⇒ t = ln(1 + d)
Thus, we have
ed
P (S/µ − 1 > d) ≤ exp(µ(d − (1 + d) ln(1 + d))) = ( )µ
(1 + d)(1+d)
But, you can use the fact that ln(1 + d) ≥ d/(1 + d/2) , and thus,
d2
P (S/µ − 1 > d) ≤ exp(−µ )
2+d
S−µ ed d2
P( > d) ≤ [ ]µ < exp(−µ )
µ (1 + d) (1+d) 2+d
2 2 2
P (X ≥ (1+δ)µ) ≤ e−δ µ/(2+δ)
P (X ≤ (1−δ)µ) ≤ e−δ µ/2
P (|X−µ| ≥ δµ) ≤ 2e−δ µ/3
Chernoff bound
For any random variable X , the r.c. Y = etX is positive. Thus, we can use
Markov’s inequality and say
3
Clustering
• Unsupervised learning (no labels)
• Detects patterns in images, outliers in data, etc.
• Can be used to combine results of parts of data processed on different
machine
– Split X into X1 , X2 , . . . .
– Use some complicated unsupervised model on Xi , on machine Mi to
get results Yi .
– Combine Y1 , Y2 . . . using clustering
• Can be used for exploratory data analysis (similar to simple scatter plots)
when you don’t know what you are looking for in the data
Distance function
• Clustering needs a notion of distance between data points . These data
points could be vectors, nodes in graph, etc.
• Lp norm for vectors
X
||x||p = [ xpi ]1/p
i
• L0 norm
• L∞ norm
d(x, y) = N (x − y)
• Cosine distance
x·y
d(x, y) = 1 −
||x|| ||y||
4
• Jaccard distance for sets
The Jaccard similarity is given by
|A ∩ B|
J(A, B) =
|A ∪ B|
The distance is
d(A, B) = 1 − J(A, B)
X
DKL (P ||Q) = P (x) ln(P (x)/Q(x)) = −EP [ln(Q/P )] = H(P, Q)−H(P )
x
Thus,
Q Q X X
−D(P ||Q) = EP [ln ] ≤ EP [ −1] = P (Q/P −1) = (Q−P ) = 1−1 = 0 =⇒ D(P ||Q) ≥ 0
P P x x
Objective Functions
• radius of cluster
• Inertia
• Intra-cluster distance
For a cluster Ci , the value ∆i is given by one of these definitions depending
on the context
1. Diameter of cluster
∆i = max d(x, y)
x,y∈Ci
5
2. Average distance between points
1 X
∆i = |Ci |
d(x, y)
2 {x,y}⊆Ci
d(xi , µi )
P P
x∈Ci x x∈Ci
µi = ∆i =
|Ci | |Ci |
• Dunn’s Index
Suppose the inter-cluster distance δ(Ci , Cj ) is also defined for us, depending
on the context.. then Dunn’s index is given by
mini̸=j δ(Ci , Cj )
DI =
maxi ∆i
K-center
• ∆i = maxx∈Ci d(x, ci ) where Ci ⊆ X is the cluster corresponding to center
ci
• Objective function is max ∆i
• Clearly, for minimum, we must have x ∈ Ci if and only if i = argmini d(x, ci )
.
• We want to find the optimum placements of the k centers in that case.
• NP hard
• The partitioning of data points based on nearest cluster center (for k-center,
k-means, hiearchial clustering, whatever) is called Voronoi partitioning
• 2-approximation algorithm
– Initialise c1 randomly
– dx := d(x, c1 )
– for i in 2 . . . k :
∗ ci := maxx dx
∗ dx := min(dx , d(x, ci ))
• Proof of it providing a 2-approximation
First, some notation..
Let Si be the set of clusters, as created by the algorithm, on the i-th
iteration.
6
d(x, Si ) = min(x, c)Di = max d(x, Si )
c∈Si x
Also, define Si∗ as the optimal set of clusters, and Di∗ as the optimal
objective function.
– Case 1 : Exactly one c ∈ Sk per cluster C ∗ with center c∗ ∈ Sk∗ .
Take any x ∈ X .
Let the center it be assigned in the optimal solution be c∗
Let THE cluster that is in the cluster associated with c∗ be c .
So now
By triangle inequality
d(x, c) ≤ 2Dk∗
But, clearly
d(x, Sk ) ≤ d(x, c)
d(x, Sk ) ≤ 2Dk∗
7
Thus,
Di−1 ≤ d(ci , cj )
Thus,
Di−1 ≤ 2Dk∗
This suggests that on the i th iteration itself , which is the first time
that we have 2 centers falling in the same optimal cluster, we have
already minimised the objective below twice the optimum.
And obviously Dk ≤ Di−1 . Thus,
Dk ≤ 2Dk∗
K-means
• Problem with k-center
K-center is susceptible to outliers. It WILL make the outliers clusters,
even in optimal clustering. This is something we don’t want.
• Objective function
Given k clusters C1 , C2 ... and their respective centers µ1 , µ2 . . . , we define
X
∆i = ||x − µi ||22
x∈Ci
X
L= ∆i
i∈[k]
• L’Loyd’s algorithm
– Initialise (µi )i∈[k] somehow
– for t in 1..N :
∗ D ∈ Rn×k
∗ D[i, j] = d(xi , µj )
∗ r ∈ Rn
8
∗ c[i] := minj∈[k] D[i, j]
∗ I := (D == c[:, new])
∗ µi := sum(I[:, i : i + 1] ∗ X)/sum(I[:, i : i + 1])
• Initialisation
– choose randomly, with uniform probability (fails badly)
– use k-centers (again..outliers)
– k-means ++ (a combination of above 2, but works)
• K-means ++ initialisation
– choose c1 randomly
– C1 = {c1 }
– for i in 2 .P
. . k:
∗ S := x∈X d(x, Ci−1 )q
∗ px = d(x, Ci−1 )q /S
∗ choose a point ci ∈ X randomly, with probability px for each
point x ∈ X
∗ Ci := Ci−1 ∪ {ci }
Usually, q = 2 is used.
With this intialisation, k-means returns an Θ(ln k) bound.
Moreover, if the data is “nicely clusterable” , an alteration gives us an
O(1) approximation. That is, you just have to initialise, and it’s already
an acceptable clustering.
There are also hybrid methods .. but k-means ++ still wins.
• Parallelizing k-means
As you saw, the algorithm is O(nkN ) .. which isn’t nice if any of these
quantities are big. So, for large data X (which would also need a large k
for correct clustering) , we partition it into X1 , X2 . . . Xp
– Compute µi,j for each Xi and j ∈ [k ′ ] and wi,j = |Ci,j |
– run k-means on {µi,j |i ∈ [p], j ∈ [k ′ ]} with weight wi,j
Note : the weights aren’t all that problematic really. You just have to define
the D matrix as D[i, j] = wi ·d(xi , µj ) , and I := (D == c[:, new])∗w[:, new]
.
This gives us an 4γ(1 + β) + 2β = O(γβ) approximation, where β is the
approximation factor for sub-problems, and γ is one for final clustering.
Now, since k means gives a log(k) approximation, dividing the problem
such that β = γ = O(log k) gives us an O(log2 k) approximation.
• Hierarchial clustering
– We need a distance measure between clusters. Some popular ones are
∗ Smallest distance between nodes
9
∗ Farthest . . .
∗ Average distance between nodes u ∈ C1 and v ∈ C2
– A single point is just a cluster of cardinality 1.
– Start with simpelton clusters
– Join the cluster with minimal distance while #clusters > 1
– Undo the last k − 1 joins to get k clusters finally
– The joinings give us a dendogram or philogram
– If we just use the closest pair (single linkage) distance, we are basically
performing Kruskal’s algorithm for MST, and so we get the MST
– If we use the average distance, we are doing the UPGMA clustering,
and we get a UPGMA tree.
– For farthest distance .. no idea
• Exact algorithm for 1D space
Suppose x1 . . . xn are points on number line, in increasing order. Suppose
the function C(i, k) gives the optimal clustering cost with k clusters, for
all the points from x1 to xi . Then we have
Pi Pi
where cost(j, i) = p=j (xp − µj,i )2 and µj,i = 1
i−j+1 p=j xp .
One can easily solve the problem now using Dynamic Programming in
O(n2 ) .
HW 1
Q1
Figure 1: image.png
X X X X
||x−y||2 = ||(x−µi )−(y−µi )||2 = [||x−µi ||2 +||y−µi ||2 +(x−µi )T (y−µi )] = ||x−µi ||2 +
x,v∈Ci x,y∈Ci x,y∈Ci x,y∈Ci x
where
10
1 X
µi = x
|Ci |
x∈Ci
X X X
(y − µi ) = y− µi = |Ci |µi − |Ci |µi = ⃗0
y∈Ci y∈Ci y∈Ci
X X X X 1 X
||x−y||2 = [|Ci | ||x−µi ||2 ]+ [|Ci | ||y−µi ||2 ]+ (x−µi )T ⃗0 =⇒ ∆i := ||x − y||2 = 2
|Ci |
x,y∈Ci x∈CI y∈Ci x∈Ci x,y∈Ci x
This is exactly twice the cost of a cluster for the k-means algorithm.
Note: here I’m assuming that by x, y ∈ Ci , we mean (x, y) ∈ Ci × Ci , that is,
the order matters. We can also assume unordered pairs, which will give us
X
∆i = ||x − µi ||2
x∈Ci
Practically, both work exactly the same, despite the factor of 2. Using the second
one, cost(C) is the same as the k-means cost, and using the first equation, it is
twice that value.
So, we can just use the k-means++ algorithm to minimise this cost.
Q2
Figure 2: image.png
Suppose, for a given k ∈ N , we cluster optimally without and with the restriction
(that centers are data points) as (in that order) :
1. C = {C1 , C2 . . . Ck } with centers µi := |C1i | x∈Ci x
P
11
The costs are given as
X X X X
∆i = ||x − µi ||2 ∆′c
i = ||x − c′i ||2 cost(C) = ∆i costc (C′ ) = ∆′c
i
x∈Ci x∈Ci′ i i
ci = argminx∈CI ||x − µi ||
as the data point closest to the centroid for each cluster Ci , and let’s define
X X
∆ci = ||x − ci ||2 costc (C) = ∆ci
x∈Ci i
It’s clear that costc (C) ≥ costc (C′ ) , since C′ is the optimal clustering with
centers being data points, while C , with centers ci , is a clustering with centers
being data points.
Now, notice that
X X X
∆ci = ||(x−µi )−(ci −µi )||2 = [||x−µi ||2 +||ci −µi ||2 −2(ci −µi )T (x−µi )] = ∆i + ||ci −µi ||2 −0
x∈Ci x∈Ci x∈Ci
X X
||ci − µi ||2 ≤ ||x − µi ||2 = ∆i
x∈Ci x∈Ci
Thus, we have
12
cost(C) ≤ cost(C′ ) ≤ costc (C)
where
X X 1 X
cost(C′ ) = [ ||x − µ′i ||2 ]µ′i = ′ x
i ′
|Ci |
x∈Ci x∈Ci
Q3
Figure 3: image.png
P [X ≥ a] ≤ E[X]/a∀a ≥ 0
P [X ≥ a] = 1E[X] = 1 · a = a =⇒ P [X ≥ a] = E[X]/a
13
Since in this case E[X] = a and Var[X] = 0 , thus the Chebyshev inequality
tells us that
Moreover, the Chebyshev inequality is tight at well (since both sides are 0).
This suggests that Markov’s inequality being tight is enough, and we can conclude
the distribution being δx,a by just that, and Chebyshev’s inequality will also be
tight automatically.
But remember that Markov’s inequality is claimed to be true only when P [X <
0] = 0 . For example, consider the distribution
1 1
P [X = x] = δx,−1 + δx,1
2 2
P [X ≥ a] = E[X]/a
, that is, the Chebyshev’s inequality is tight. But this, unlike “Markov’s inequality”
being tight, does give us useful information. Namely, that
Q5
Suppose for a randomly chosen person whom we labelled i , we define the random
variable Xi as 1 if the person is a coffee drinker, and 0 if not. Then, suppose,
{Xi }i , are independent and identically distributed variables with probability
distribution
14
P [Xi = 1|p] = PX|p [1] = pP [Xi = 0|p] = PX|p [0] = 1 − p
and then suppose we randomly ask N people whether or not they drink coffee
to get the observed values for X1 , X2 . . . XN as x1 , x2 . . . xN , we can estimate
the likelihood of p using this data and a prior.
Note : Here Xi is at random variable, because we are choosing a person at
random and then labelling that person i . Thus, Xi is associated with the label
i , and not the person. If it was in fact associated with the person, then it would
have the fixed value, namely xi (the observed value).
We assume a uniform distribution over the interval [0, 1] for the value p , that is
(
1 q ∈ [0, 1]
P [p = q] = Pp (q) =
0 otherwise
Keep in mind that this is a density function, not a probability mass function.
Note : Although, there IS a fixed value p∗ which is the probability of Xi being
1, we don’t know that value. And so, depending of the data, we can only get
different estimates for this value. So, we are dealing with the random variable p
instead.
Using this, we can write
N
Y
P (⃗x|p) := PX1 ,X2 ...XN |p (x1 , x2 . . . xN ) = PX|p (xi ) = pα (1 − p)β
i=1
where α = xi and β = N − α .
P
i
1
Pp|⃗x (q) = P (⃗x|q)Pp (q)
P (⃗x)
Z ∞ Z 1
P (⃗x|q)Pp (q) dq = P (⃗x) =⇒ P (⃗x) = q α (1 − q)β dq
−∞ 0
15
This is solved using a recursive relation. Let
Z 1
C(α, β) = xα (1 − x)β dx
0
Z 1
β β
C(α, β) = xα+1 (1 − x)β−1 = C(α + 1, β − 1)
α+1 0 α+1
(N + 1) ·N Cα α+1
E[p] = (N + 1) ·N Cα · C(α + 1, β) = =
(N + 2) ·N +1 Cα+1 N +2
But we should note that this is NOT the maximum likelihood estimate, or even
the most probable value. That is still the intuitive answer α/N since at q = α/N
, we have
d d2
ln Pp|⃗x (q) = αq −1 − β(1 − q)−1 = 0 2 ln Pp|⃗x (q) = −αq −2 − β(1 − q)−2 < 0
dq dq
1
α+1 β+1
Z
PXN +1 |⃗x (1) = PX|q (1)Pp|⃗x (q)dq = E[p] = PX |⃗x (0) =
0 N + 2 N +1 N +2
16
Given the data ⃗x from asking N randomly chosen people, the proba-
bility that a randomly chosen person is a coffee drinker is N +2 .
α+1
Note that here, we are using the word “is” and not “is expected” , since this
is not an expected value, but something we know for sure, assuming that our
uniform prior distribution is actually correct.
This is in fact a very nice result called Laplace’s rule of succession.
But, we still do want to know what fraction of people are actually coffee drinkers.
This can be answered by thinking of p as a fraction of the total population that
drinks coffee. So now, p is in fact an observable, and can be treated as a random
variable. Thus, we can say
Given the data, (α + 1)/(N + 2) is the expected value of fraction of
people who drink coffee.
Given the data, α/N is the most probable value of fraction of people
who drink coffee
Relying on most probable values is the optimal strategy if the cost of you getting
the answer wrong is a constant (E[(X − g)0 ] where g is your guess for a random
variable X) and guessing the “expected” value is the optimal strategy if your
cost scales quadraticaly with how far you were (E[(X − g)2 ]) .
So, to get the most profitable estimate, I really need to know what we are trying
to do with the estimate, and what is at stake.
But suppose, we just assume E[p|⃗x] as the estimate, then we can compute the
width of the interval for 99% confidence using Chebychev’s inequality. Namely
1
(N + 1) ·N Cα (α + 1)(α + 2)
Z
E[p2 |⃗x] = (N +1)·N Cα q α+2 (1−q)β dq = (N +1)·N Cα ·C(α+2, β) = =
0 (N + 3) ·N +2 Cα+2 (N + 3)(N + 2)
Since we want 99% confidence that the observed value falls in our interval
(E[p] − b, E[p] + b) , we want P [|p − E[p]| ≥ b] ≤ 1% = 0.01 . This can be
achieved by setting an appropriate b such that Var[p]/b2 ≤ 0.01 .
17
The minimum such value is
(α + 1)(β + 1)
p
bc = 10 √
(N + 2) N + 3
5
bc ≤ √
N +3
α+1 (α + 1)(β + 1)
p
pexp = bc = 10 √
N +2 (N + 2) N + 3
1 100/99 α
Pp|⃗x (q) = P (⃗x|q)Pp (q) = q (1 − q)β
P (⃗x) P (⃗x)
q α (1 − q)β
Pp|⃗x = R 1
0.01
q α (1 − q)β dq
Thus, the expected value, second moment and variance are given as
18
C ′ (α + 1, β) C ′ (α + 2, β) C ′ (α + 2, β)C ′ (α, β) − C ′ (α + 1, β)2
E[p|⃗x] = E[p 2
|⃗
x ] = Var[p|⃗
x ] =
C ′ (α, β) C ′ (α, β) C ′ (α, β)2
where
Z 1
C (α, β) =
′
q α (1 − q)β dq
0.01
(0.01)α+1 (0.99)β β
C ′ (α, β) = − + C ′ (α + 1, β − 1)
α+1 α+1
1 − 0.01α+1
C ′ (α, 0) =
α+1
That is, with 99% confidence the actual value of the fraction lies in (E[p|⃗x] −
bc , E[p|⃗x] + bc ) .
Densest Subgraph
• Density of a subgraph
Suppose S ≤ G is a subgraph of G (I am using group theory notation, but
you get the gist of it), then we define the density as
where
19
It’s easy to see that
1 X
D(S) = din (u) = din,avg /2 ≤ (|S| − 1)/2
2S
u∈S
D(S) = e(S, S)/|S| ≥ c/2 =⇒ 2e(S, S) ≥ c|S| =⇒ d(S)−e(S, S̄) ≥ c|S| =⇒ d(S)+d(S̄) ≥ c|S|+d(S̄)+
20
Figure 4: image.png
21
Start with Sk = G
Compute D(Sk )
Compute di = d(ui )∀ui ∈ G
for i = k . . . 2 :
– Find the vertex v ∈ S such that din (v) is minimum
– Si−1 := Si − {v}
– Compute D(Si−1 )
return argmaxSi D(Si )
• Charika’s algorithm gives a 2-approximation , i.e. D(Sc ) ≥ D(So )/2
• Proof
Since we start will the full graph, till some point S ∗ is contained in the
sub-graph. Suppose the last time that this happens in when the sub-graph
is Si . Suppose the node with lowest density in Si is u . Then, we have :
just because S ∗ ⊆ Si .
Now, we also know:
1. dSi (u) ≤ avgSi (dSi (v)) = 2ρ(Si )
2. dS ∗ (u) ≥ ρ(S ∗ ) (as proved previously)
Thus,
which means Si is at least half as dense as the densest sub-graph. Now, obvi-
ously the algorithm’s result is denser than that, giving us a 2-approximation.
22
– Max
This also gives similar results to that of min-cut, as |V | ≈ max(|S|, |S̄|)
– Product
This gives similar results to that of min(|S|, |S̄|) .
e(S,S̄)
Define ϕ′ (S) = |S||S̄|
.
• Conductance
Laplacian
• For a graph G with adjacency matrix A, we define the laplacian as L =
D − A where D = diag(A⃗1) .
• The ⃗1 is an eigenvector of L with λ = 0
• If there is a disconnected component, then the vector with all 1s for nodes
in that component and 0 otherwise is an eigenvector with eigenvalue 0 .
• The second eigenvector gives us an approximation of the min-cut value.
23
Sparsest cut and λ2
For any vector v, and an undirected graph,
X
v T Lv = (vi − vj )2
(i,j)∈E
X
v T Lv = λi a2i
i
It’s easy to see that with the restriction that ||v||2 = a2i = 1 , we get
P
i
min v T Lv = λ2
||v||2 =1,a0 =0
(i,j)∈E (vi − vj )2
P
λ2 = min P 2
v⊥⃗
1 i vi
X X XX X X X X X X
vi = 0 =⇒ ( vi )2 = 0 =⇒ vi vj = 0 =⇒ 2 vi2 − vi vj = 2 vi2 =⇒ vi2 −2 vi vj +
i i i j i i,j i i i,j j
(i,j)∈E (vi − vj )2
P
λ2 = min P
{i,j} (vi − vj )2
v
It’s easy to see that when you have vi = 1 if i ∈ S and vi = 0 otherwise , for
some S ≤ G , the RHS just becomes
24
e(S, S̄)
ϕ′ (S) :=
|S||S̄|
Note that here, you are not following v ⊥ 1 = 0 , and thus, you CANNOT use
the original expression (one with ||v||2 in denominator) to calculates ϕ′ (S) .. no,
instead, we defined ϕ′ (S) using the expression above.
Now, much in the past, we had proved that nϕ′ (S) ∈ [ϕ(S), 2ϕ(S)] . You should
go read that.
But this tells us that
nϕ′ (S) ≤ 2ϕ(S)∀S =⇒ n min ϕ′ (S) ≤ 2 min ϕ(S) =⇒ λ2 ≤ min ϕ′ (S) ≤ 2 min ϕ(S)/n
S S S S
Note that for the Nomalized Laplacian, there is a factor of d/n everywhere in
the denominator, where d = d(G)/|G| and n = |G| . So, factoring that in , you
would get, for regular graphs :
n n
λ′2 = λ2 ≤ min ϕ′ (S) ≤ 2 min ϕ(S)/d = 2 min ϕc (S) = ϕc (G) =⇒ λ′2 ≤ 2ϕc (G)
d d S S S
25
2. Find the eigenvector v2 (L) , also known as the Fielder vector
3. Sort all vertices by their absolute coordinate values in v2 , from highest to
lowest
4. Try partitioning G into (S, S̄) where S holds nodes i with high v2,i , and
S̄ holds one with lower valued ones. You can simply start with all nodes
in S and keep putting lowest valued nodes in S̄ one by one and thus find
the best cut.
This is guaranteed to give conductance ϕc ∈ O( ϕ∗c ) .
p
L = I − D−1/2 AD−1/2
Mixture models
We assume that the data points come from an ensemble of K different distribution,
each of them having a probability ϕi where ϕ ∈ RK and ||ϕ||2 = 1 .
Usually, these distributions are Gaussian, namely the ith one is N (µ⟩ , σ⟩∈ ) .
The probability of getting a data-point xi is thus
26
K
X ϕk (xi − µk )2
exp(− )
2πσk2 2σk2
p
k=1
K
X
ϕk p(xi |θk )
k=1
N X
Y K
ϕk p(xi |θk )
i=1 k=1
N
X K
X
ln ϕk p(xi |θk )
i=1 i=1
N K N K N
X X 1 T
Σ−1
X X 2 X 2
ln( ϕk p e−(xi −µk ) k
(xi −µk )/2
) = N C+ ln( e−(xi −µk ) /2
) ≈ N C+ ln(e−(xi −c(xi )) /2
)=
i=1 k=1
2π|Σk | i=1 k=1 i=1
. . . which gives the K-means objective function.. weird. This suggests that
k-means has got to do something with this.
In fact, there is a (proven) better method, which is basically k-means, but
probabilistic. It’s called Expectation maximisation.
1. You first assign the labels to the data-points, but in a probabilistic manner.
Namely,
27
ϕk p(xi |θk )
zi,k = P
k ϕk p(xi |θk )
Btw, this is what you compute for the final classification too, except you
return a specific label by sampling with distribution ⃗zi .
2. Update the parameters, just as you would update in k-means, taking the
⃗zi stuff we created as absolute truth.
PN PN PN
i=1 zi,k (xi − µk )(xi − µk )
T
i=1 zi,k zi,k xi
ϕk := µk := Pi=1
N
Σ k := PN
N i=1 zi,k i=1 zi,k
Z Z Z Z ∞
2
/2−y 2 /2 2 2
2πE[||x−y||2 ] = ||x−y||2 e−x dxdy = r2 e−r /2
rdrdθ−2( xe−x /2
dx)2 = 2π(2 xe−x dx)−2(0)2
0
And as for the variance, we can compute the second moment first,
Z ∞
2
2πE[||x−y||4 ] = 2π(E[(x2 +y 2 )2 ]−4E[x3 y]−4E[xy 3 ]+4E[x2 y 2 ]) = 2πE[r4 ]+8π(E[x2 ])2 = 2π r5 e−r /2
dr+8
0
√ √
||x − y||2 = 2d ± θ(2 2d) = 2d ± θ( d)
Also, it’s easy to see that E[||x||2 ] = d and Var[||x2 ||] = ⟨x4 ⟩ = 3σ 2 d = 3d (use
the cummulant generating function or something, idk.. just get this somehow)
Thus, we have
√ √
||x − y||2 = 2d ± θ( 8d)||x||2 = d ± θ( 3d)
28
With this, we can get an upper bound on the angle between two points
√ √
xT y (||x||2 + ||y||2 − ||x − y||2 ) ±θ( 11d) ± 1
||x−y||2 = ||x||2 +||y||2 −2xT y =⇒ = = q √ √ =
||x||||y|| 2||x||||y||
2 d2 + 2dθ( d) + θ( d)2
(with the constant factor of 11/2shoved in your .. let’s move on .. I really hate
my prof..) .
What this result says is that, in a very high dimensional space, it’s extremely
hard to find points that are not almost orthogonal.
E[||x−y||2 ] = E[||(x−µ1 )−(y−µ2 )+(µ1 −µ2 )||2 ] = E[||x−µ1 ||2 ]+E[||y−µ2 ||2 ]+||µ1 −µ2 ||2 −2(0)−2(0)+2(0) = σ
where ∆ = ||µ1 − µ2 || .
Now, just like the last time, you have Var[||x − y||2 ] = O(σ14 d + σ24 d) .
So, we have
q
||x − y||2 = σ12 d + σ22 d + ∆2 + θ( σ14 d + σ24 d)
√
||z − y||2 = 2σ22 d + θ(σ22 d)
√ q
∆2 = (σ22 − σ12 )d + θ( d σ14 + σ24 )
So,
(p
|σ22 − σ12 |d σ2 ̸= σ1
∆=
θ(d1/4 σ) σ1 ≈ σ2 ≈ σ
29
χ2 distribution
If Xi ∼ N (0, 1) are i.i.d , then
d
X
Z := Xi2 ⇐⇒ Z ∼ χ2 (d)
i=1
2
P [Z > d(1 + δ)] ≤ e−dδ /18
√
Then, if you define Y = Z , we can write
√ √ d
P [Y > d+β] = P [Z > d+2 dβ+β 2 ] = P [Z > d(1+(2d−1/2 β+β 2 d−1 ))] ≤ exp(− (2d−1/2 β+β 2 /d)2 ) = exp(
18
√
Supposing β >> β 2 / d , we can write
√
P [Y > d + β] = exp(−cβ 2 )
√
P [||⃗x|| > d + β] = exp(−O(β 2 ))
Spectral Algorithm
This is different from the spectral graph clustering algorithm. This one is just
normal clustering. Suppose you have a probability density which you have
somehow figured out is from a k component Gaussian mixture.
Now, you want to cluster the data into k clusters. This is the standard k means
application if you are a bit lazy and don’t want to do the whole EM thing.
There’s just one problem..the dimensions d are extremely huge. So, the compu-
tational cost of k means, which is O(ndkt) is very high.
So, now what ?
This is where SVD is useful. Before that, there are a few things to think about :
1. The orthographic (discarding extra dimensions) projection of a standard
Gaussian probability density in d dimensions to k dimensions also produces
a standard Gaussian density with the same variance as before.
30
2. The best fit line v ∈ Rd for a d dimensional density is one that minimises
E[(v T x)2 ] where ||v|| = 1 is the constraint. When the density is a single
Gaussian, this is the line passing through the centre of that Gaussian and
the origin. If X is the data matrix, with rows as data points, then the
expectation simply becomes n1 v T X T Xv . This happens when v is the first
right singular vector of X.
3. The best fit hyperplane V ∈ R×ℸ is one that minimises E[||V T x||2 ] where
V is ortho-normal, or if we are dealing with data, it minimises n1 ||XV ||2F .
Clearly, this happens when V is made of the first k right singular vectors.
When the density is a mixture of k Gaussians, this is the hyper-plane
passing through the centres of all of those, as well as the origin.
So, what we should do is simply do a projection to the best fit k dimensional
hyperplane using Vk which is the matrix with the first k columns of V from the
SVD X = U ΣV T . We do thhis as x̂i = xi
Then, we can just cluster using k means to estimate µ̂k , Σ̂k , ϕk for the GMM in
the k dimensionaly space.
Then, once you have all these parameters, you essentially have the Gaussian
mixture model that generated this data. You can either straight away go on to
the prediction phase, or refine further using the EM algorithm.
Note that this GMM is in a lesser dimensional space. So, for prediction , you
will always have to find x̂ = x first.
Sampling
We are dealing with a system that has limited storage. The system receives
a data stream x1 , x2 . . . xn . Our aim is to somehow sample from the data
uniformly.
• When n is known.
In this case, you can simply generate k indices from 1 to n with equal
probability , and sample values on those indices when they occur
• Algorithm R
Initialise an array R of size k , with starting index as 1. for i, x in enum (
stream ) :
j = rand([1 . . . i])
if j ≤ k :
R[j] = x
else:
continue
31
Claim : This process finally selects k samples from the stream such that
the probability of each element being sampled is k/n when n elements
have been seen.
Clearly, for n = k this claim is true, since till that point, we can just fill
whatever we see.
Now, if this is proved true till n = N , then for n = N + 1 , we have
– The probability of any of x1 . . . xN being in R just before xN +1 arrives
is, by the induction hypothesis, k/N
– The probability that for i = N + 1 , index j is chosen (and thus, the
value at that index replaced by xN +1 if j ≤ k) is 1/(N + 1).
– The probability that xi is put into R by the end of first N items AND
xi is not replaced by xN +1 is (k/N )(1 − 1/(N + 1)) = k/(N + 1) .
– The probability that one of x1 . . . xN stays on the stack by the end of
i = N + 1 iteration is thus, k/(N + 1)
– The probability that xN +1 is on R is also k/(N + 1)
Hashing
To give every key a fair chance to be stored in any location, we store a lot of
hash functions in a family H = {h1, h2 . . . }, and choose one (h) at random.
The number of bits required to specify which hash function to use is log2 (|H|) .
Suppose the universe of items is U and the hash table has m indices, we never
have this number (bits needed to choose h) to go beyond |U | log2 m since m|U |
is the number of different hash functions you could have.
The different types of Hashing families are (in higher to lower quality) :
1. Uniform Hash family : Ph [h(x) = i] = 1/m
2. Universal : Ph [h(x) = h(y)] = 1/m
3. Near Universal: Ph [h(x) = h(y)] ≤ 2/m
For a hash function h , define Cx,y = 1(h(x) = h(y)) as an indicator function.
Also, for data stream of size n , define l(x) as the length of the chain (linked
list) at index h(x) .
We can write :
X X
Eh [l(x)] = Eh [ Cx,y ] = Ph [h(x) = h(y)]
y y
32
these bits, and not the full object in its linked lists. And of-course, we’ll have to
store n such items.
The query time is (for a good hash function) : O(n/m) . This is called the load
factor.
Another very efficient way of storing keys from U is through a bit-array. Basically,
you convert every object to a log2 |U | index. Then, you set the bit at the index
on the bit array to 1. The size of the bit array is obviously |U | . This allows for
O(1) queries.. but do you really want to do it like this ? .. with O(|U |) space ?
Ok, this is good and all, but like .. what does a hashing function look like ?
• Prime Multiplicative Hashing
– Pick a prime p > |U |
– For any b ∈ [p − 1] and a ∈ [p − 1] define ha,b (x) = (ax + b) mod p
– Define H = {ha,b (x)|a, b ∈ [p − 1]} as the hash family. Thus, you
have |H| = (p − 1)2 .
Bloom filter
You have a bit array B of size m , initialised to 0. Every time you want to insert
a key, you pass it through k hash functions and set the bits in B at those k
output values to 1.
Then, to check if an element was observed before, you hash again, and if all k
bits are set, return True, else return False.
There are chances of false positives here, since all bits being 1 doesn’t necessarily
mean that the element was indeed put in.
So, what is the probability of getting a false positive ?
1. The probability of a bit not being set upon insertion of an element is
approximately (1 − 1/m)k
2. The probability of it not being set upon insertion of n elements is similarly
(1 − 1/m)nk ≈ e−nk/m
3. Thus the probability of it being set is approximately 1 − e−nk/m .
4. The probability of k bits being set is (1 − e−nk/m )k .
All this is assuming that the element that sets all the k bits is not observed.
Thus, this is approximately the probability of a false positive given that the
element is not there.. that is, it is FP/(FP+TN) .. which is the False positive
rate.
So, we just need to minimise this over k (software parameter) to get the best
value, for a particular m, n (hardware limitation, and knowledge about process).
Taking a log and then setting derivative to 0,
33
For sanity, define p = e−nk/m . Thus,
pnk/m
ln(1 − p) + = 0 =⇒ k = −m(1 − p) ln(1 − p)/(pn)
1−p
(1 − p) ln(1 − p) = p ln p
m
k= ln 2
n
Thus, the optimum FPR is
m
δ = 2− n ln 2
m = −n ln δ/(ln 2)2
34
In such a scenario, instead of maintaining m bits, you should maintain m counters
(usually 4 bits).
This is called a counting bloom filter, for obvious reasons.
Linear Counting
Given a stream of data of size n, we want to find the number of distinct elements
S in the stream.
Once again, the traditional hash table approach isn’t fast and memory efficient
enough, since it checks equality linearly through a linked list. Like.. how tf are
you going to use that for a real time stream ?
Ok, so what can we do ?
1. Once again, maintain a bit array B of size m .
2. Have a single hash function h : [U ] → [m]
3. On observing an element x , set B[h(x)] := 1 .
4. After all elements are done, count the number of unset bits as Zm . Return
S := −m ln(Zm /m)
Just like last time, the probability of a bit not being set after n elements is
(1 − 1/m)S ≈ e−S/m . So, the expected number of bits that are unset are
me−S/m , which we set to Zm , our observed value, for the best estimate of S .
Thus, we have
Just like the last time, we require m = O(S) bits for some fixed error rate.
35
The analysis for this goes like this :
First, suppose the actual number of distinct samples is S such that log2 S −
0.5 ∈ (b, a), and you’ve returned Ŝ = 2z+0.5 , then we can show that :
This isn’t very good by itself, but we can change this drastically by running
this algorithm in parallel with several randomly selected hash functions
simultaneously.
• Medians of estimates
For k estimates Ŝ1 , Ŝ2 . . . Ŝk ,the probability that their median Ŝ is out
of bound is given like this :
P [Ŝ > 4S] = P [|{Ŝi > 4S}| > k/2] = Ω(e−k )P [Ŝ < S/4] = P [|{Ŝi < S/4}| > k/2] = Ω(e−k )
36
But we should note that we aren’t interested in any random r . We
only want to know if Ŝ > 4S ⇐⇒ z ≥ a + 2 ⇐⇒ Ya+2 > 0 .
So we get
So, we have
– Part 2
Also, as Yr is a whole number, and Yr = 0 ⇐⇒ (Yr − E[Yr ])/E[Yr ] ≤
−1 . And Yr just happens to be the sum of a bunch of Bernoulli
variables. So, we can exploit the lower Chernoff bound
2
P [Yr = 0] = P [(Yr − µ)/µ ≤ −1] ≤ e−(−1) µ/2
r+1
P [Yr = 0] ≤ e−S/2
b 0.5
P [Ŝ < S/4] ≤ P [Yb−1 = 0] ≤ e−S/2 ≤ e−2 ≈ 0.24 < 0.35
√
P [Ŝ > 4S] ≤ 2−1.5 P [Ŝ < S/4] ≤ e− 2
• Medians of Means
The kind of approximate algorithms we are reading right now are a category
called ϵ, δ algorithms, the reason for which will become clear in a moment.
You can get even better estimates from the Flajolet-Matrin Sketch by
taking means of estimates in batches, and then taking medians of those
batches.
To get the final estimate Ŝ is interval [S(1 − ϵ), S(1 + ϵ)] with probability
1 − δ (and thus, error rate δ) , you should
37
1. First, run the algorithm k = − ln δ/ϵ2 times in parallel.
2. Divide the results into k2 = − ln δ sub-groups of size k1 = 1/ϵ2 each.
3. For all of the k2 supgroups , use the mean of their k1 individual
estimates as the resultant estimate for the subgroup.
4. Calculate the median of the k2 means that you calculated as the final
result.
kMV sketch
First, let’s start with this claim :
If you choose n numbers x1 , x2 . . . xn from [0, 1] uniformly , then E[min({xi |i ∈
[n]})] = n+1
1
.
• Proof
X X X
E[min({xi |i ∈ [n]})] = n x1 P (x1 , x2 . . . xn )n x1 P (xi ≥ x1 ∀i ∈ [2, n])P (x1 ) = n
x1 ∈[0,1],x1 ≤xi xi ∈[0,1] x1 ∈[0,1]
∞ ∞ ∞
1
(1 − x)n−1
Z Z Z Z
n dx = n (1−e )
−x n−1
dx = n (1−e )
−x n−1 −x x
e e dx ≥ n(1−e−x )n−1 e−x (1+x)dx
0 x 0 0 0
R1
where F (n) = 0
(1 − e−x )n dx . Now, of course, F (n − 1) > F (n) . Thus,
(n + 1)F (n − 1) = ∞ =⇒ F (n − 1) = ∞
Ok, that’s just for the minimum. What about the 2nd minimum, or 3rd, or say,
tth .
• Reciprocal of tth minimum
Assume all cases where x1 is the tth minimum. Now, the expectation of
f (x1 )It ({xi }i ) where It is the indicator of x1 being the tth minimum is
Z 1
f (x1 ) ·n−1 Ct−1 (x1 )t−1 (1 − x1 )n−t dx1
0
38
Now, set f (x1 ) = t−1
x1 . This gives us
t−1 1
n−2
Z
E[ It ({xi }i )] = (n − 1) 1 (1 − x1 )
xt−2 n−t
dx1
x1 0 t−2
Yep, this is the form that a beta function integral has. How did we solve it
? Using recursion. Particularly, do D.I. and then get a recursive relation.
Then, you’ll get the thing. But for simplicity, I’ll just tell you that
Z 1
a!b!
xa (1 − x)b dx =
0 (a + b + 1)!
In our case, we get the value of the integral as.. you guessed it .. 1.
So, now, we can write the expected value of the sum of such functions (
f (xi )It,i ({xi }i )) over xi as simply n .
Thus, we have the expected value of t − 1 times the reciprocal of tth
minimum value as n .
Of course, you also have to care about how good of an estimate the expectation
gives. To do that, you would check the variance. Then, you would choose the
minimum t such that the variance is still lower than some particular error bound,
say ϵ2 n2 . After all the heavy math, you will get:
t = ⌈96/ϵ2 ⌉
39
This has the run-time complexity of O(N log2 t) where N is the length of the
stream.
Now, to get the best results, you need to run this multiple times with l = log(1/δ)
different hash functions and get the mean of the results.
This algorithm actually does an (ϵ, δ) approximation, but the details are too
gory.
This is still not good enough, since you have to compute l hash functions.
So, instead, you can use a single hash function, and split the [0, 1] interval into
l buckets of equal size. Each bucket will have its own heap, containing the
k = O(ϵ−2 ) minimum elements, in that bucket. Now, the number of distinch
element in bucket i will be around (k − 1)/vi . The final number of distinct
Pl
elements is thus n = i=1 k−1 vi since elements from different buckets are clearly
distinct, as they have different hash function values.
This is called Stochastic averaging kMV algorithm.
N
fx − ≤ fˆx ≤ fx
k+1
40
1, or induced a decrement in values of all keys, except those not in H , which
includes x . This cannot happen more than d times.
41
whether t > c (or whatever form this condition takes) we will reject the
null hypothesis.
• Test function
The pair (T, c) combined give us the test function ϕ which is an indicator
function for the rejection of θ0 . The region in the X n = (X1 . . . Xn )
space on which we reject the null hypothesis, is called the rejection region.
R = {xn ∼ X n |ϕ(xn )} .
Power functions
Since “given θ0 is true” , we actually have the whole distribution for Xi
, we can find the probability of the data observed falling in the rejection
region as
β(θ0 ) = P (X n ∈ R|θ0 )
This is called the “power” (of rejection) function, and is only a function of
θ0 and ϕ . Similar to this, you can have
β(θ1 ) = P (X n ∈ R|θ1 )
If the test ϕ is designed carefully, then we should ideally have very low
β(θ0 ) and very high β(θ1 ) .
• Level, size, UMP
If (given θ0 , θ1 ) for the test case ϕ we have β(θ0 ) ≤ α , then we say that
the test is “level α” and if β(θ0 ) = α precisely, we call it “size α” .
If a test ϕ is level α and at the same time we have βϕ (θ1 ) ≥ βψ (θ1 ) for
ANY other level alpha test ψ , then we say that the test is Uniformly Most
powerful. Given θ0 , θ1 for a simple hypothesis, one can create a uniformly
most powerful test as described by the Neyman-Pearson procedure, using
nothing but likelihood functions.
The “test creation” largely hinges on choosing the appropriate test statistic
and then finding a value of c that will make the test level α , or in some
cases, size α .
Moving on to composite tests. In case Θ0 is finite, one can simply calculate
β for each parameter θ0 ∈ Θ0 and take the maximum. But if Θ0 is
continuous, say something like intervals, in that case, you use the upper
bound of β in Θ0 .Namely , a test is level α if supθ0 ∈Θ0 β(θ0 ) ≤ α . We
aren’t saying anything about the UMP test here.
• Neyman-Pearson test
Suppose
42
f0 (X n ) = L(θ0 |X n ) = P (X n |θ0 )f1 (X n ) = L(θ1 |X n ) = P (X n |θ1 )
f1 (X n )
T (X n ) =
f0 (X n )
P (T (X n ) > c | θ0 ) = α
ϕ(X n ) ≥ ϕ′ (X n )∀X n ∈ R ⇐⇒ ϕ(X n )−ϕ′ (X n ) ≥ 0 ∀ X n s.t. T (X n ) > c =⇒ (ϕ−ϕ′ )(T −c) ≥ 0∀X n ∈ R
as likelihood f0 ≥ 0 necessarily.
This gives us:
Z Z
(ϕ − ϕ )f1 dX ≥ c
′ n
(ϕ − ϕ′ )f0 dX n
Xn Xn
43
Similar for ϕ′ . This transforms the RHS into βϕ (θ0 ) − βϕ′ (θ) . But
remember, ϕ is size α and ϕ′ is level α . This makes the RHS non-negative,
and we simply get
Z Z
ϕf1 dX n ≥ ϕ′ f1 dX n ⇐⇒ βϕ (θ1 ) ≥ βϕ′ (θ1 ) ...□
Xn Xn
• Wald test
Now, NP test is only useful if the hypothesis’ are simple (not composite).
What happens if that’s not the case ? What if your null hypothesis is
simple and alternative hypothesis is not.
Let’s consider this specific form :
H0 : θ = θ0 H1 : θ ̸= θ0
Let’s also assume that you have an “estimator” function θ̂(X n ) that tells
us an estimate of θ .
This estimator is also asymptotically normal, that is to say,
θ̂ − θ0
∼N →∞ N (′, ∞)
se0
θ̂ − θ0
T =
se0
If se0 (the theoretical standard deviation) is not known, you can use the
observed value seˆ0 .
We set the rejection criterion as ϕ = I(|Tn | ≥ zα/2 ) where zα/2 is a value
such that
44
lim P (Tn ≥ zα/2 |θ0 ) = α/2
N →∞
1
θ̂ − θ1 ∼n→∞ N (0, )
nI1 (θ1 )
∆= nI1 (θ)(θ0 − θ1 )
p
P (T < −zα/2 )+P (T > zα2 ) = P (T +∆ < ∆−zα/2 )+P (T +∆ > ∆+zα/2 ) = Φ(∆−zα/2 )+1−Φ(∆+zα/2 )
H0 : θ ∈ Θ0 H1 : θ ̸∈ Θ0
where
45
– Θ is the full domain of θ
– θ̂ is the MLE estimate over Θ
– θ̂0 is the MLE estimate over Θ0 .
The test function is
ϕ(X n ) = I(λ(X n ) ≤ c)
• χ2 test
Consider a categorical distribution with probability vector θ of dimension
k. We do n trials to find it. For each trial, we have the class noted.
This gives us an estimate of the probabilities which we denote by θ̂ .
Here, nθ̂ ∼ Mult(θ, n) . Also, it’s a fact that you can approximate the
multinomial distribution as :
Y 1 1 X (θ̂i − θi )2
P (θ̂) = exp(− )
2πθi /n 2 i
p
i
θi /n
X
T = Zi2
i
is the “Z-score” for each class. Of course this is different from the usual
estimation, but for the sake of notation, we’ll use this definition.
Now, Zi , if you haven’t guessed already, has a standard normal distribution
(when n → ∞) . This means that T as a function P of θ has the χk−1
2
46
The LHS is the power function with c = χ2k−1,α , and for convenience, we’ll
just call it c from now on.
This tells us that we can claim with 1 − α confidence that the true value
lies “close” to the estimate, namely that T (θ̂, θ) ≤ c .
47