TM
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Raw data Clustering algorithm Clusters of data
Centroid
(p 1 , p 2 , ...) (q 1 , q 2 , ...)
k
(p i q i )2
i 1
x1 xN
xi
K
1
||x i – x j ||2
2
K – 1 C(i) – K C(i) – K
K
Nk ||x i – m K ||2
K–1 C(i) – K
mK K th
NK K th
C1 C2
C1 C2
yt Yt
yt Yt – Yt – 1
yt ( Yt – Yt – 1 ) – ( Yt – 1 – Yt – 2 )
Yt – 2 Yt – 1 Yt – 2
yt 0 1 Yt – 1
yt 0 1 Yt – 1 2 Yt – 2
y t – y t–1
y t – y t–1 + 1 ( Yt – 1 – Yt – 2 )
y t – y t–1 1 ( t–1 – t–2)
yi 1, ... , n
s i , t i , ri
yi si ti ri
"INFORMATION RETRIEVAL by Technical Publication"
Tokenization
"INFORMATION" "RETRIEVAL" "by" "Technical" "Publication"
Raw text Bag-of-words
vector
it 2
They 0
puppy 1
and 1
It is a puppy and it
is extremely cute cat 0
aardvark 0
cute 1
extremely 1
... ...
N
dft
N
Wt , d (1 log tft,d ) log
dft
A B
E D
N
i ij
j= 1
Number of lines (L) 2L
(Number of points (Number of points – 1)) / 2 g (g – 1)
pk
N–1
pk p k (1 – p) N– 1– k
k
Emerging
clusters
1 2 3 4
A B C D E F G H I J K
H
A
B F
C I
D G
E K
J
|True negatives| |True positives|
|False negatives|+|True positive|+|True negatives|+|True positives|
|False negatives| |False positives|
|False negatives|+|False positive|+|True negatives|+|True positives|
|True positive|
|True positive|+|False negative|
|True negatives|
|False positives|+|True negative|
Number of true positives
Number of true positives + Number of false positives
Total number of examples
Training set Test set
i th Ei
Ei
K
1
Ei
K
i= 1
Dataset
Training Testing Holdout method
Cross validation
Data permitting :
Training Validation Testing Training, Validation, Testing
Total number of examples
Experiment 1
Experiment 2
Experiment 3
Test examples
Experiment 4
80 80
60 60
SSE
SSE
40 40
Elbow point
20 20
0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
K : Number of clusters K : Number of clusters
(a) (b)