Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views44 pages

Data Science and Big Data Analytics-173-216

The document discusses various concepts related to technical publications, focusing on data clustering algorithms and information retrieval techniques. It includes mathematical representations of clustering, tokenization, and evaluation metrics for classification models. Additionally, it outlines experimental setups for training and testing datasets in machine learning.

Uploaded by

faltubandagavali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views44 pages

Data Science and Big Data Analytics-173-216

The document discusses various concepts related to technical publications, focusing on data clustering algorithms and information retrieval techniques. It includes mathematical representations of clustering, tokenization, and evaluation metrics for classification models. Additionally, it outlines experimental setups for training and testing datasets in machine learning.

Uploaded by

faltubandagavali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

TM

TECHNICAL PUBLICATIONS - An up thrust for knowledge


Raw data Clustering algorithm Clusters of data

Centroid

(p 1 , p 2 , ...) (q 1 , q 2 , ...)
k
(p i q i )2
i 1
x1 xN
xi

K
1
||x i – x j ||2
2
K – 1 C(i) – K C(i) – K

K
Nk ||x i – m K ||2
K–1 C(i) – K

mK K th

NK K th
C1 C2
C1 C2
yt Yt

yt Yt – Yt – 1
yt ( Yt – Yt – 1 ) – ( Yt – 1 – Yt – 2 )

Yt – 2 Yt – 1 Yt – 2

yt 0 1 Yt – 1

yt 0 1 Yt – 1 2 Yt – 2

y t – y t–1

y t – y t–1 + 1 ( Yt – 1 – Yt – 2 )

y t – y t–1 1 ( t–1 – t–2)

yi 1, ... , n
s i , t i , ri
yi si ti ri
"INFORMATION RETRIEVAL by Technical Publication"

Tokenization

"INFORMATION" "RETRIEVAL" "by" "Technical" "Publication"


Raw text Bag-of-words
vector
it 2

They 0

puppy 1

and 1
It is a puppy and it
is extremely cute cat 0

aardvark 0

cute 1

extremely 1

... ...
N
dft

N
Wt , d (1 log tft,d ) log
dft
A B

E D
N
i ij
j= 1

Number of lines (L) 2L


(Number of points (Number of points – 1)) / 2 g (g – 1)
pk

N–1
pk p k (1 – p) N– 1– k
k
Emerging
clusters
1 2 3 4

A B C D E F G H I J K

H
A
B F

C I

D G
E K

J
|True negatives| |True positives|
|False negatives|+|True positive|+|True negatives|+|True positives|

|False negatives| |False positives|


|False negatives|+|False positive|+|True negatives|+|True positives|

|True positive|
|True positive|+|False negative|

|True negatives|
|False positives|+|True negative|

Number of true positives


Number of true positives + Number of false positives
Total number of examples

Training set Test set

i th Ei
Ei
K
1
Ei
K
i= 1
Dataset

Training Testing Holdout method

Cross validation

Data permitting :
Training Validation Testing Training, Validation, Testing
Total number of examples

Experiment 1

Experiment 2

Experiment 3
Test examples

Experiment 4
80 80

60 60

SSE
SSE

40 40

Elbow point
20 20

0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
K : Number of clusters K : Number of clusters

(a) (b)

You might also like