Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views212 pages

BDA Module 5

The document is a lecture on Big Data Analytics, focusing on Big Data Mining Algorithms, particularly Frequent Pattern Mining. It covers various modules including the introduction to Big Data, frameworks like Hadoop and NOSQL, and algorithms for mining big data streams, clustering, and classification. The document also discusses the Market-Basket Model, association rules, and the A-Priori algorithm for finding frequent itemsets and generating rules from them.

Uploaded by

Mathew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views212 pages

BDA Module 5

The document is a lecture on Big Data Analytics, focusing on Big Data Mining Algorithms, particularly Frequent Pattern Mining. It covers various modules including the introduction to Big Data, frameworks like Hadoop and NOSQL, and algorithms for mining big data streams, clustering, and classification. The document also discusses the Market-Basket Model, association rules, and the A-Priori algorithm for finding frequent itemsets and generating rules from them.

Uploaded by

Mathew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 212

Lecture on

Big Data Analytics

Big Data Mining Algorithms:


Frequent Pattern Mining
Mrs. Archana Shirke,
Department of Information Technology,
Fr. C. R. I. T., Vashi.
Modules
1. Introduction to Big Data
2. Introduction to Big Data Frameworks: Hadoop, NOSQL
3. MapReduce Paradigm
4. Mining Big Data Streams
5. Big Data Mining Algorithms
– Frequent Pattern Mining :Handling Larger Datasets in Main Memory, Basic Algorithm of Park,
Chen, and Yu, The SON Algorithm, and MapReduce.
– Clustering Algorithms: CURE Algorithm. Canopy Clustering, Clustering with MapReduce
– Classification Algorithms: Parallel Decision trees, Overview SVM classifiers, Parallel SVM, K-
nearest Neighbor classifications for Big Data, One Nearest Neighbour.
6. Big Data Analytics Applications

2
Frequent Pattern Mining

• The Market-Basket Model


– A large set of items, e.g., things sold in a supermarket.
– A large set of baskets, each of which is a small set of the items, e.g., the things
one customer buys on one day.
• Simplest question: find sets of items that appear “frequently” in the
baskets.

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 3


The Market-Basket Model
Input:

• A large set of items TID Items


1 Bread, Coke, Milk
– e.g., things sold in a supermarket 2 Beer, Bread
• A large set of baskets 3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
• Each basket is a small subset of items
– e.g., the things one customer buys on one day
Output:
Rules Discovered:
• Want to discover association rules {Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
– People who bought {x,y,z} tend to buy {v,w}

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 4


Mining Association Rules
• Step 1: Finding the frequent itemsets (Hard)

• Step 2: Rule generation


– For every subset A of I, generate a rule A → I \ A
• Since I is frequent, A is also frequent
• Variant 1: Single pass to compute the rule confidence
– confidence(A,B→C,D) = support(A,B,C,D) / support(A,B)
• Variant 2:
– Observation: If A,B,C→D is below confidence, so is A,B→C,D
– Can generate “bigger” rules from smaller ones!
– Output the rules above the confidence threshold

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 5


1. Support
• Simplest question: Find sets of items that appear together “frequently”
in baskets
• Support for itemset I: Number of baskets containing all items in I
– (Often expressed as a fraction TID Items
of the total number of baskets) 1 Bread, Coke, Milk
• Given a support threshold s, 2 Beer, Bread
then sets of items that appear 3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
in at least s baskets are called
5 Coke, Diaper, Milk
frequent itemsets
Support of {Beer, Bread} = 2

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 6


Example
• Items = {milk, coke, pepsi, beer, juice}
• Support threshold = 3 baskets
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

• Frequent itemsets: {m}:5, {c}:5, {b}:6, {j}:4

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 7


2. Confidence
• Association Rules:
If-then rules about the contents of baskets
• {i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it is
likely to contain j”
• Confidence of this association rule is the probability of j given I =
{i1,…,ik}

support( I  j )
conf( I → j ) =
support( I )

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 8


3. Interest
• Not all high-confidence rules are interesting
– The rule X → milk may have high confidence for many itemsets X, because milk
is just purchased very often (independent of X) and the confidence will be high
• Interest of an association rule I → j:
difference between its confidence and the fraction of baskets that
contain j
Interest(I → j ) = conf( I → j ) − Pr[ j ]
– Interesting rules are those with high positive or negative interest values (usually
above 0.5)

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 9


Example
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

• Association rule: {m, b} →c


– Confidence = 2/4 = 0.5
– Interest = |0.5 – 5/8| = 1/8=0.125
• Item c appears in 5/8 of the baskets
• Rule is not very interesting!

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 10


A-Priori algorithm

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 11


A-priori Algorithm
• The construction of the collections of larger frequent itemsets and
candidates proceeds in essentially the same manner, until at some pass
we find no new frequent itemsets and stop.
1. Define Ck to be all those itemsets of size k, every k−1 of which is an
itemset in Lk−1.
2. Find Lk by making a pass through the baskets and counting all and only
the itemsets of size k that are in Ck and those itemsets that have count
at least s are in Lk

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 12


Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup
C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan {B, C, E} 2
{B, C, E}
3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 13
Example
• Generating association rules from frequent itemset
– For each frequent itemset I, generate all nonempty subsets of I
– For every nonempty subset s of I, output the rule s=> (l-s) if
support(l)/support(s)>= min confidence(75%)
B^ C=>E conf =2/2 =100%
C^ E=>B conf =2/2 =100%
B^E=>C conf =2/3 =66.67%
B=>C ^E conf =2/3 =66.67%
C=>B ^E conf =2/3 =66.67%
E=>B ^C conf =2/3 =66.67%

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 14


Example 2

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 15


Problem
TID Items
1 Biscuits, Bread, Cheese, Coffee, Yogurt
2 Bread, Cereal, Cheese, Coffee
3 Cheese, Chocolate, Donuts, Juice, Milk
4 Bread, Cheese, Coffee, Cereal, Juice
5 Bread, Cereal, Chocolate, Donuts, Juice
6 Milk, Tea
7 Biscuits, Bread, Cheese, Coffee, Milk
8 Eggs, Milk, Tea
9 Bread, Cereal, Cheese, Chocolate, Coffee
10 Bread, Cereal, Chocolate, Donuts, Juice
11 Bread, Cheese, Juice
12 Bread, Cheese, Coffee, Donuts, Juice
13 Biscuits, Bread, Cereal
14 Cereal, Cheese, Chocolate, Donuts, Juice
15 Chocolate, Coffee
16 Donuts
17 Donuts, Eggs, Juice
18 Biscuits, Bread, Cheese, Coffee
19 Bread, Cereal, Chocolate, Donuts, Juice
20 Cheese, Chocolate, Donuts, Juice
21 Milk, Tea, Yogurt
22 Bread, Cereal, Cheese, Coffee
23 Chocolate, Donuts, Juice, Milk, Newspaper
24 Newspaper, Pastry, Rolls
25 Rolls, Sugar, Tea
3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 16
C1
Item No Item name Frequency
1 Biscuits 4
2 Bread 13
3 Cereal 10
4 Cheese 11
5 Chocolate 9
6 Coffee 9
7 Donuts 10
8 Eggs 2
9 Juice 11
10 Milk 6
11 Newspaper 2
12 Pastry 1
13 Rolls 2
14 Sugar 1
15 Tea 4
16 Yogurt 2

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 17


L1
Assume 25% support. In 25 transactions, a frequent item must
occur in at least 7 transactions. The frequent 1-itemset or L1 is
now given below. How many candidates in C2? List them.
Item Frequency
Bread 13
Cereal 10
Cheese 11
Chocolate 9
Coffee 9
Donuts 10
Juice 11

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 18


C2 and L2
The following pairs are frequent. Now find C3 and then L3
and the rules.

Frequent 2-itemset Frequency


{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
{Chocolate, Donuts} 7
{Chocolate, Juice} 7
{Donuts, Juice} 9

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 19


Association Rules
The full set of rules are given below. Could some rules be removed?
Cheese → Bread
Cheese → Coffee
Coffee → Bread
Coffee → Cheese
Cheese, Coffee → Bread
Bread, Coffee → Cheese
Bread, Cheese → Coffee
Chocolate → Donuts
Chocolate → Juice
Donuts → Chocolate
Donuts → Juice
Donuts, Juice → Chocolate
Chocolate, Juice → Donuts
Chocolate, Donuts → Juice
Bread → Cereal
Cereal → Bread

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 20


Exercise 1

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 21


Exercise 1
Consider the transaction database in the table shown below. Find strong
association rule with minimum support and confidence as 50%.

Transaction ID Items
100 Bread, Cheese, Eggs, Juice
200 Bread, Cheese, Juice
300 Bread, Milk, Yogurt
400 Bread, Juice, Milk
500 Cheese, Juice, Milk

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 22


Exercise 2

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 23


Exercise 2
Find frequent itemset and strong association rule for the given
transaction database.

B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, c, b, n} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
• Support threshold s = 3, confidence c = 0.75

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 24


Exercise 2: Solution
• Support threshold s = 3, confidence c = 0.75
• 1) Frequent itemsets:
– 2 itemset- {b,m} {b,c} {c,m} {c,j}
– 3 itemset - {m,c,b}
• 2) Generate rules:
– b,c→m: c=3/5
– b,m→c: c=3/4
– c,m→ b: c=3/6
– b→c,m: c=3/6
3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 25
Summary
• Frequent pattern mining algorithm
– Apriopri
– Limitations
• Handling Larger Datasets in Main Memory using
– Algorithm of Park, Chen, and Yu
– The Multistage Algorithm
– The Multihash Algorithm
• The SON Algorithm and MapReduce Counting

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 26


Thank you ! ! !

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 27


Lecture on

Big Data Analytics

Big Data Mining Algorithms:


Frequent Pattern Mining
Mrs. Archana Shirke,
Department of Information Technology,
Fr. C. R. I. T., Vashi.
Modules
1. Introduction to Big Data
2. Introduction to Big Data Frameworks: Hadoop, NOSQL
3. MapReduce Paradigm
4. Mining Big Data Streams
5. Big Data Mining Algorithms
– Frequent Pattern Mining :Handling Larger Datasets in Main Memory, Basic Algorithm of Park,
Chen, and Yu, The SON Algorithm, and MapReduce.
– Clustering Algorithms: CURE Algorithm. Canopy Clustering, Clustering with MapReduce
– Classification Algorithms: Parallel Decision trees, Overview SVM classifiers, Parallel SVM, K-
nearest Neighbor classifications for Big Data, One Nearest Neighbour.
6. Big Data Analytics Applications

29
The Market-Basket Model
Input:

• A large set of items TID Items


1 Bread, Coke, Milk
– e.g., things sold in a supermarket 2 Beer, Bread
• A large set of baskets 3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
• Each basket is a small subset of items 5 Coke, Diaper, Milk
– e.g., the things one customer buys on one day
Output:
• Want to discover association rules
Rules Discovered:
– People who bought {x,y,z} tend to buy {v,w} {Milk} --> {Coke}
{Bread, Milk} --> {Beer}

• Solution : A-priori Algorithm

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 30


A-priori Algorithm
• The construction of the collections of larger frequent itemsets and
candidates proceeds in essentially the same manner, until at some pass
we find no new frequent itemsets and stop.

1. Define Ck to be all those itemsets of size k, every k−1 of which is an


itemset in Lk−1.
2. Find Lk by making a pass through the baskets and counting all and only
the itemsets of size k that are in Ck and those itemsets that have count
at least s are in Lk .

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 31


Problems with A-priori
1. High Computation cost
– The true cost of mining disk-resident data is usually the number of
disk I/Os.
– In practice, association-rule algorithms read the data in passes – all
baskets read in turn.
– We measure the cost by the number of passes an algorithm makes
over the data.

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 32


Problems with A-priori
2. Main-Memory Bottleneck
– For many frequent-itemset algorithms, main-memory is the critical
resource
• As we read baskets, we need to count something, e.g., occurrences of pairs of
items
• The number of different things we can count is limited by main memory
• Swapping counts in/out is a disaster

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 33


Problems with A-priori
3. Finding Frequent Pairs
– The hardest problem often turns out to be finding the frequent
pairs of items {i1, i2}
• Why? Freq. pairs are common, freq. triples are rare(Probability of being
frequent drops exponentially with size; number of sets grows more slowly
with size)
– Let’s first concentrate on pairs, then extend to larger sets
– The approach: We always need to generate all the itemsets. But we would
only like to count (keep track) of those itemsets that in the end turn out to be
frequent.

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 34


Solution : Park-Chen-Yu (PCY) Algorithm

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 35


PCY Algorithm
• PCY is also called as Direct Hashing and Pruning (DHP) algorithm which
attempts to generate large itemsets efficiently and reduces the
transaction database size.

• When generating C1, the algorithm also generates all the 2-itemsets for
each transaction, hashes them to a hash table and keeps a count.

• Observation: In pass 1 of A-Priori, most memory is idle


– We store only individual item counts
– Can we use the idle memory to reduce memory required in pass 2?

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 36


PCY Algorithm
• Pass 1 of PCY: In addition to item counts, maintain a hash table with as
many buckets as fit in memory
– Keep a count for each bucket into which pairs of items are hashed (For each
bucket just keep the count, not the actual pairs that hash to the bucket!)
• Observation:
– If a bucket contains a frequent pair, then the bucket is surely frequent
– However, even without any frequent pair, a bucket can still be frequent 
– So, we can’t use the hash to eliminate any member (pair) of a “frequent” bucket
– But, for a bucket with total count less than s, none of its pairs can be frequent
☺ (Pairs that hash to this bucket can be eliminated as candidates even if the pair
consists of 2 frequent items)
3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 37
PCY Algorithm
• Pass 2 of PCY: Only count pairs that hash to frequent buckets
– Replace the buckets by a bit-vector:
• 1 means the bucket count exceeded the support s (call it a frequent bucket);
• 0 means it did not
– 4-byte integer counts are replaced by bits, so the bit-vector requires 1/32 of
memory
– Also, decide which items are frequent and list them for the second pass

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 38


PCY Algorithm – Pass 1
FOR (each basket) :
FOR (each item in the basket) :
add 1 to item’s count;
New FOR (each pair of items) :
in hash the pair to a bucket;
PCY
add 1 to the count for that bucket;

• Few things to note:


– Pairs of items need to be generated from the input file; they are not present in
the file.
– We are not just interested in the presence of a pair, but we need to see whether
it is present at least s (support) times.

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 39


PCY Algorithm – Pass 2
Count all pairs {i, j} that meet the conditions for being a
candidate pair:
1. Both i and j are frequent items.
2. The pair {i, j} hashes to a bucket whose bit in the bit
vector is 1 (i.e., a frequent bucket).

• Few things to note:


– Both conditions are necessary for the pair to have a chance of being frequent.

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 40


Main-Memory: Picture of PCY

Item counts Frequent items

Bitmap
Main memory
Hash
Hash table
table Counts of
for pairs candidate
pairs

Pass 1 Pass 2

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 41


PCY
Pass 1: FOR (each basket) :
FOR (each item in the basket) :
add 1 to item’s count;
FOR (each pair of items) :
hash the pair to a bucket;
add 1 to the count for that bucket;
Pass 2: Count all pairs {i, j} that meet the conditions
for being a candidate pair:
1. Both i and j are frequent items.
2. The pair {i, j} hashes to a bucket whose bit in the
bit vector is 1 (i.e., a frequent bucket).
PCY
• The major aim of the algorithm is to reduce the size of C2.
• It is therefore essential that the hash table is large enough so that
collisions are low.
• Collisions result in loss of effectiveness of the hash table.

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 43


Counting Pairs in Memory
Two approaches:
• Approach 1: (Triangular Matrix Method) Count all pairs using a triangular
matrix
• Approach 2: (Triples Method) Keep a table of triples [i, j, c] = “the count of
the pair of items {i, j} is c.”
– If integers and item ids are 4 bytes, we need approximately 12 bytes for pairs
with count > 0
– Plus some additional overhead for the hashtable
Note:
• Approach 1 only requires 4 bytes per pair
• Approach 2 uses 12 bytes per pair (but only for pairs with count > 0)

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 44


Counting Pairs in Memory
• Unlike the triangular matrix, the triples method does not require us to
store anything if the count for a pair is 0.
• On the other hand, the triples method requires us to store three
integers, rather than one, for every pair that does appear in some
basket. In addition, there is the space needed for the hash table or
other data structure used to support efficient retrieval.
• The conclusion is that the triangular matrix will be better if at least 1/3
of the nC2 possible pairs actually appear in some basket, while if
significantly fewer than 1/3 of the possible pairs occur, we should
consider using the triples method.

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 45


Main-Memory Details
• Buckets require a few bytes each:
– #buckets is O(main-memory size)

• On second pass, a table of (item, item, count) triples is


essential (we cannot use triangular matrix approach)
– Thus, hash table must eliminate approx. 2/3 of the candidate pairs
for PCY to beat A-Priori

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 46


Example 1
Example
Consider the transaction database in the table shown below. Find strong
association rule with minimum support and confidence as 50% using
PCY Algorithm.

Transaction ID Items
100 Bread, Cheese, Eggs, Juice
200 Bread, Cheese, Juice
300 Bread, Milk, Yogurt
400 Bread, Juice, Milk
500 Cheese, Juice, Milk

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 48


Example
• For each pair, a numeric value is obtained by first representing B by 1, C by 2, E 3, J 4, M 5
and Y 6. Now each pair can be represented by a two digit number, for example (B, E) by 13
and (C, M) by 26.
1 Bread
2 Cheese
3 Eggs
4 Juice
5 Milk
6 Yogurt
• The two digits are then coded as modulo 8 number (dividing by 8 and using the remainder).
This is the bucket address.
• A count of the number of pairs hashed is kept. Those addresses that have a count above the
support value have the bit vector set to 1 otherwise 0.

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 49


Solution
C1 L1
Pass1: Item No Item Count Item No Item Count
1 Bread 4 1 Bread 4
2 Cheese 3 2 Cheese 3
3 Eggs 1 4 Juice 4
4 Juice 4 5 Milk 3
5 Milk 3
6 Yogurt 1

Scan 1
Transaction ID Items Pairs
100 1,2,3,4 (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4)
200 1,2,4 (1, 2) (1, 4) (2, 4)
300 1,5,6 (1, 5) (1, 6) (5, 6)
400 1,4,5 (1, 4) (1, 5) (4, 5)
500 2,4,5 (2, 4) (2, 5) (4, 5)

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 50


Solution
Pass 1: h(i,j)=ij mod 8

Bit vector Bucket number Count Pairs


0 0 0
0 1 0
0 2 0
0 3 0
0 4 0
0 5 0
0 6 0
0 7 0

51
Solution
Pass 1: h(i,j)=ij mod 8

(1,2)=> h(1,2)=12 mod 8 =4


(1,3)=> h(1,3)=13 mod 8 =5
(1,4)=> h(1,4)=14 mod 8 =6

Bit vector Bucket number Count Pairs


0 0 0
0 1 0
0 2 0
0 3 0
0 4 1 (1,2)
0 5 1 (1,3)
0 6 1 (1,4)
0 7 0

52
Solution
Pass 1: h(i,j)=ij mod 8
(2,3)=> h(2,3)=23 mod 8 =7
(2,4)=> h(2,4)=24 mod 8 =0
(3,4)=> h(3,4)=34 mod 8 =2

Bit vector Bucket number Count Pairs


0 0 1 (2,4)
0 1 0
0 2 1 (3,4)
0 3 0
0 4 1 (1,2)
0 5 1 (1,3)
0 6 1 (1,4)
0 7 1 (2,3)

53
Solution
Pass 1: h(i,j)=ij mod 8
Bit vector Bucket number Count Pairs
1 0 5 (2, 4) (2,4) (1, 6) (5, 6)(2,4)
0 1 1 (2, 5)
0 2 1 (3, 4)
0 3 0
0 4 2 (1, 2) (1,2)
1 5 3 (1, 3) (4, 5) (4,5)
1 6 3 (1, 4) (1, 4) (1, 4)
1 7 3 (2, 3) (1, 5) (1, 5)

Pass 1: Output
L1 Bit vector Bucket number
Item No Item Count 1 0
1 Bread 4 1 5
2 Cheese 3 1 6
4 Juice 4 1 7
54
5 Milk 3
Solution
Pass 2: Uses L1 and Bitmap
L1 Bit vector Bucket number
Item No Item Count 1 0
1 Bread 4 1 5
2 Cheese 3 1 6
4 Juice 4 1 7
5 Milk 3

(1,2)=> h(1,2)=12 mod 8 =4


(1,4)=> h(1,4)=14 mod 8 =6
(1,5)=> h(1,5)=15 mod 8 =7
(2,4)=> h(2,4)=24 mod 8 =0
(2,5)=> h(2,5)=25 mod 8 =1
3/24/2024
(4,5)=> h(4,5)=45 mod 8 =5
Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 55
Solution
Pass 2: Uses L1 and Bitmap
L1 Bit vector Bucket number
Item No Item Count 1 0
1 Bread 4 1 5
2 Cheese 3 1 6
4 Juice 4 1 7
5 Milk 3

Pass 2: Output
C2 L2
Item Count Item Count
(1,4) 3 (1,4) 3
(1,5) 2 (2,4) 3
(2,4) 3
(4,5) 2
3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 56
Exercise
Exercise 1
Apply PCY algorithm to find frequent itemset for the given
dataset with minimum support 2.

Transaction ID Items
10 1,2,3,5
20 2,4,5
30 1,3,5
40 2,3,5
Solution
Pass 1
C1 L1
Item No Count Item No Count
1 2 1 2
2 3 2 3 Scan1 : hash buckets generation h(i,j)=ijmod 8
3 3 3 3 Bit vector Bucket number Count Pairs
4 1 5 4 0 0 1 (2, 4)
5 4 1 1 3 (2, 5)
0 2 0 (3, 4)
1 3 3 (3,5)
0 4 1 (1, 2)
Scan 1 1 5 3 (1, 3) (4, 5)
TID Items Pairs 0 6 0
1 7 4 (1, 5), (2,3)
10 1,2,3,5 (1, 2) (1, 3) (1, 5) (2, 3) (2, 5) (3, 5)
20 2,4,5 (2, 4) (2, 5) (4,5) Bitmap 1 for Hash Buckets
30 1,3,5 (1, 3) (1, 5) (3, 5) Bit vector Bucket number
1 1
40 2,3,5 (2, 3) (2, 5) (3, 5) 1 3
1 5
1 7
Solution
C1 L1 Bitmap 1 for Hash Buckets
Item No Count Item No Count Bit vector Bucket number
1 2 1 2 1 1
2 3 2 3 1 3
3 3 1 5
3 3
4 1 1 7
5 4
5 4

C2 L2 C3 L3
Item Count Item Count Item Count Item Count
(1,3) 2 (1,3) 2 (1,3,5) 2 (1,3,5) 2
(1,5) 2 (1,5) 2 (2,3,5) 2 (2,3,5) 2
(2,3) 2 (2,3) 2
(2,5) 3 (2,5) 3
(3,5) 3 (3,5) 3
Exercise
Exercise 3
Here is a collection of twelve baskets. Each contains three of the six items 1 through 6.
{1, 2, 3} {2, 3, 4} {3, 4, 5} {4, 5, 6}
{1, 3, 5} {2, 4, 6} {1, 3, 4} {2, 4, 5}
{3, 5, 6} {1, 2, 4} {2, 3, 5} {3, 4, 6}
Suppose the support threshold is 4. On the first pass of the PCY Algorithm we use a
hash table with 11 buckets, and the set {i, j} is hashed to bucket i×j mod 11.
(a) By any method, compute the support for each item and each pair of items.
(b) Which pairs hash to which buckets?
(c) Which buckets are frequent?
(d) Which pairs are counted on the second pass of the PCY Algo?

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 62


{1, 2, 3} {2, 3, 4} {3, 4, 5} {4, 5, 6}
{1, 3, 5} {2, 4, 6} {1, 3, 4} {2, 4, 5}
Solution {3, 5, 6} {1, 2, 4} {2, 3, 5} {3, 4, 6}

Scan 1
T-ID Items Pairs
1 1,2,3 (1, 2) (1, 3) (2, 3)
2 2,3,4 (2, 3) (2, 4) (3, 4)
3 3,4,5 (3, 4) (3, 5) (4, 5)
4 4,5,6 (4, 5) (4, 6) (5, 6)
5 1,3,5 (1, 3) (1, 5) (3, 5)
6 2,4,6 (2, 4) (2, 6) (4, 6)
7 1,3,4 (1, 3) (1, 4) (3, 4)
8 2,4,5 (2, 4) (2, 5) (4, 5)
9 3,5,6 (3, 5) (3, 6) (5, 6)
10 1,2,4 (2, 4) (1,2,) (1,4)
11 2,3,5 (2, 3) (2, 5) (3, 5)
12 3,4,6 (3, 4) (3, 6) (4, 6)
{1, 2, 3} {2, 3, 4} {3, 4, 5} {4, 5, 6}
{1, 3, 5} {2, 4, 6} {1, 3, 4} {2, 4, 5}
Solution {3, 5, 6} {1, 2, 4} {2, 3, 5} {3, 4, 6}

h(i,j)=i*j mod 11
C1 L1
Item count Item count Bit vector Bucket number Count
1 4 1 4 0 0
2 6 2 6 0 1
3 8 3 8 0 2
4 8 4 8 0 3
5 6 5 6 0 4
6 4 6 4 0 5
0 6
0 7
0 8
0 9
0 10
{1, 2, 3} {2, 3, 4} {3, 4, 5} {4, 5, 6}
{1, 3, 5} {2, 4, 6} {1, 3, 4} {2, 4, 5}
Solution {3, 5, 6} {1, 2, 4} {2, 3, 5} {3, 4, 6}

h(i,j)=i*j mod 11
C1 L1
Item count Item count
Bit vector Bucket number Count
1 4 1 4
0 0 0
2 6 2 6
1 1 5
3 8 3 8
1 2 5
4 8 4 8 0 3 3
5 6 5 6 1 4 6
6 4 6 4 0 5 1
0 6 3
0 7 2
1 8 6
0 9 3
0 10 2
{1, 2, 3} {2, 3, 4} {3, 4, 5} {4, 5, 6}
{1, 3, 5} {2, 4, 6} {1, 3, 4} {2, 4, 5}
Solution {3, 5, 6} {1, 2, 4} {2, 3, 5} {3, 4, 6}

L1 Bit vector Bucket number Count C2 L2


Item count 1 1 5 Item count Item count
1 4 1 2 5 (1,2) 2 (2,4) 4
2 6 1 4 6 (3,4) 4
(1,4) 2
1 8 6 (3,5) 4
3 8 (2,4) 4
4 8 (2,6) 1
5 6 (3,4) 4
6 4 (3,5) 4
(4,6) 3
(5,6) 2
Exercise
Exercise 3
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, c, b, n} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
• Support threshold s = 3.

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 68


Summary
• Frequent pattern mining algorithm
– Apriopri
– Limitations
• Handling Larger Datasets in Main Memory using
– Algorithm of Park, Chen, and Yu
– The Multistage Algorithm
– The Multihash Algorithm
• The SON Algorithm and MapReduce Counting

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 69


Thank you ! ! !

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 70


Lecture on

Big Data Analytics

Big Data Mining Algorithms:


Frequent Pattern Mining
Mrs. Archana Shirke,
Department of Information Technology,
Fr. C. R. I. T., Vashi.
Modules
1. Introduction to Big Data
2. Introduction to Big Data Frameworks: Hadoop, NOSQL
3. MapReduce Paradigm
4. Mining Big Data Streams
5. Big Data Mining Algorithms
– Frequent Pattern Mining :Handling Larger Datasets in Main Memory, Basic Algorithm of Park,
Chen, and Yu, The SON Algorithm, and MapReduce.
– Clustering Algorithms: CURE Algorithm. Canopy Clustering, Clustering with MapReduce
– Classification Algorithms: Parallel Decision trees, Overview SVM classifiers, Parallel SVM, K-
nearest Neighbor classifications for Big Data, One Nearest Neighbour.
6. Big Data Analytics Applications

72
The Market-Basket Model
Input:

• A large set of items TID Items


1 Bread, Coke, Milk
– e.g., things sold in a supermarket 2 Beer, Bread
• A large set of baskets 3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
• Each basket is a small subset of items 5 Coke, Diaper, Milk
– e.g., the things one customer buys on one day
Output:
• Want to discover association rules
Rules Discovered:
– People who bought {x,y,z} tend to buy {v,w} {Milk} --> {Coke}
{Bread, Milk} --> {Beer}

• Solution : A-priori Algorithm


PCY
3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 73
PCY Algorithm
• PCY is also called as Direct Hashing and Pruning (DHP) algorithm which
attempts to generate large itemsets efficiently and reduces the
transaction database size.

• When generating C1, the algorithm also generates all the 2-itemsets for
each transaction, hashes them to a hash table and keeps a count.

• Observation: In pass 1 of A-Priori, most memory is idle


– We store only individual item counts
– Can we use the idle memory to reduce memory required in pass 2?

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 74


PCY Algorithm
Pass 1: FOR (each basket) :
FOR (each item in the basket) :
add 1 to item’s count;
FOR (each pair of items) :
hash the pair to a bucket;
add 1 to the count for that bucket;
Pass 2: Count all pairs {i, j} that meet the conditions
for being a candidate pair:
1. Both i and j are frequent items.
2. The pair {i, j} hashes to a bucket whose bit in the
bit vector is 1 (i.e., a frequent bucket).
PCY Algorithm
• The major aim of the algorithm is to reduce the size of C2.
• It is therefore essential that the hash table is large enough so that
collisions are low.
• Collisions result in loss of effectiveness of the hash table.

• PCY extensions
– Multistage
– Multihash

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 76


1. Multistage Algorithm
• Limit the number of candidates to be counted
– Remember: Memory is the bottleneck
– Still need to generate all the itemsets but we only want to count/keep
track of the ones that are frequent
• Idea- uses two hash functions in 2 different passes which are
independent.
– After Pass 1 of PCY, rehash only those pairs that qualify for Pass 2 of PCY
• i and j are frequent, and
• {i, j} hashes to a frequent bucket from Pass 1
• Requires 3 passes over the data

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 77


1. Multistage -- Main Memory
Item counts Freq. items Freq. items

Main memory
Bitmap 1 Bitmap 1

First Bitmap 2
hash table
First
Second Counts
hash table Counts of
of
hash table candidate
candidate
pairs
pairs

Pass 1 Pass 2 Pass 3


Hash pairs {i,j} Count pairs {i,j} iff:
Count items into Hash2 iff: i,j are frequent,
Hash pairs {i,j} i,j are frequent, {i,j} hashes to
{i,j} hashes to freq. bucket in B1
freq. bucket in B1 {i,j} hashes to
freq. bucket in B2
3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 78
1. Multistage – Pass 3
• Count only those pairs {i, j} that satisfy these candidate pair
conditions:
1. Both i and j are frequent items.
2. Using the first hash function, the pair hashes to a bucket whose bit
in the first bit-vector is 1.
3. Using the second hash function, the pair hashes to a bucket whose
bit in the second bit-vector is 1.

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 79


2. Multihash Algorithm

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 80


2. Multihash Algorithm
• Key idea: Use several independent hash tables on the first pass
• Risk: Halving the number of buckets doubles the average count
– We have to be sure most buckets will still not reach count s

• If so, we can get a benefit like multistage, but in only 2 passes

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 81


2. Multihash -- Main-Memory

Item counts Freq. items

Bitmap 1
Main memory First
First hash
hash table
table Bitmap 2

Counts
Countsofof
Second
Second candidate
candidate
hash table
hash table pairs
pairs

Pass 1 Pass 2

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 82


PCY: Extensions
• Either multistage or multihash can use more than two hash functions

1. In multistage, there is a point of diminishing returns, since the bit-


vectors eventually consume all of main memory.

2. In multihash, the bit-vectors occupy exactly what one PCY bitmap


does, but too many hash functions makes all counts> s

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 83


Frequent Itemsets in < 2 Passes

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 84


Frequent Itemsets in < 2 Passes
• A-Priori, PCY, etc., take k passes to find frequent itemsets of size k

• Can we use fewer passes?

• Use 2 or fewer passes for all sizes, but may miss some frequent itemsets
– Random sampling algorithm
– SON (Savasere, Omiecinski, and Navathe) algorithm

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 85


1. Random Sampling
• Take a random sample of the market baskets
• Run Apriori or one of its improvements in main memory
– So we don’t pay for disk I/O each
time we increase the size of itemsets Copy of
– Reduce support threshold sample

Main memory
baskets
proportionally to match the sample size
– Verify that the candidate pairs are truly
frequent in the entire data set by a second pass Space
for
(avoid false positives) counts

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 86


2. SON Algorithm
• Repeatedly read small subsets of the baskets into main memory and run
an in-memory algorithm to find all frequent itemsets
– Its not sampling, but processing the entire file in memory-sized chunks
• An itemset becomes a candidate if it is found to be frequent in any one
or more subsets of the baskets.
• On a second pass, count all the candidate itemsets and determine
which are frequent in the entire set

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 87


2. SON – Distributed Version
• SON lends itself to distributed data mining
• Baskets distributed among many nodes
– Compute frequent itemsets at each node
– Distribute candidates to all nodes
– Accumulate the counts of all candidates
• Phase 1: Find candidate itemsets
– Map?
– Reduce?
• Phase 2: Find true frequent itemsets
– Map?
– Reduce?
3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 88
SON: Map/Reduce
• Map1 Function:
– Take the assigned subset of the baskets and find the itemsets frequent in the
subset using the algorithm
– lower the support threshold from s to ps if each Map task gets fraction p of the
total input file.
– output is a set of key-value pairs (F, 1), where F is a frequent itemset from the
sample.
• Reduce1 Function:
– Each Reduce task is assigned a set of itemsets.
– The value is ignored, and The Reduce task simply produces those keys (itemsets)
that appear one or more times.
3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 89
3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 90
SON: Map/Reduce
• Map2 Function:
– Take output of Reduce1 and a portion of the input data file.
– Each Map task counts the number of occurrences of each of the candidate itemsets among the
baskets in the portion of the dataset that it was assigned.
– The output is a set of key-value pairs (C, v), where C is one of the candidate sets and v is the
support for that itemset among the baskets that were input to this Map task.
• Reduce2 Function:
– Take the itemsets they are given as keys and sum the associated values.
– The result is the total support for each of the itemsets that the Reduce task was assigned to
handle.
– Those itemsets whose sum of values is at least s are frequent in the whole dataset, so the Reduce
task outputs these itemsets with their counts.
– Itemsets that do not have total support at least s are not transmitted to the output of the Reduce
3/24/2024
task Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 91
3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 92
In short

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 93


Summary
• Frequent pattern mining algorithm
– Apriopri
– Limitations
• Handling Larger Datasets in Main Memory using
– Algorithm of Park, Chen, and Yu
– The Multistage Algorithm
– The Multihash Algorithm
• The SON Algorithm and MapReduce Counting

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 94


Thank you ! ! !

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 95


Lecture on

Big Data Analytics

Big Data Mining Algorithms:


Clustering Algorithms
Mrs. Archana Shirke,
Department of Information Technology,
Fr. C. R. I. T., Vashi.
Modules
1. Introduction to Big Data
2. Introduction to Big Data Frameworks: Hadoop, NOSQL
3. MapReduce Paradigm
4. Mining Big Data Streams
5. Big Data Mining Algorithms
– Frequent Pattern Mining :Handling Larger Datasets in Main Memory, Basic Algorithm of Park,
Chen, and Yu, The SON Algorithm, and MapReduce.
– Clustering Algorithms: K-mean, CURE Algorithm. Canopy Clustering, Clustering with MapReduce
– Classification Algorithms: Parallel Decision trees, Overview SVM classifiers, Parallel SVM, K-
nearest Neighbor classifications for Big Data, One Nearest Neighbour.
6. Big Data Analytics Applications

97
What is clustering?
• Clustering is the classification of
objects into different groups

• Its the partitioning of a data set


into subsets (clusters), so that
the data in each subset share
some common trait - often
according to some defined
distance measure.

3/24/2024 Mrs. Archana Shirke 98


Types of clustering
1. Hierarchical clustering: find successive clusters using previously
established clusters
1. Agglomerative ("bottom-up"): Agglomerative algorithms begin with each
element as a separate cluster and merge them into successively larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin with the whole set and proceed
to divide it into successively smaller clusters.
2. Partition based clustering: Partition based algorithms determine all
clusters at once. They include:
– K-means and derivatives
– Fuzzy c-means clustering

3/24/2024 Mrs. Archana Shirke 99


k-mean clustering
K-mean Algorithm
• k-Means clustering algorithm proposed by J. Hartigan and M. A. Wong
[1979].
• Given a set of n distinct objects, the k-Means clustering algorithm
partitions the objects into k number of clusters such that intracluster
similarity is high but the intercluster similarity is low.
• In this algorithm, user has to specify k, the number of clusters and
consider the objects are defined with numeric attributes and thus using
any one of the distance metric to demarcate the clusters.
K-mean Algorithm
Assumes Euclidean space/distance
Start by picking k, the number of clusters
• Initialize clusters by picking one point per cluster
• 1) For each point, place it in the cluster whose current centroid is nearest
• 2) After all points are assigned, update the locations of centroids of the k
clusters
• 3) Reassign all points to their closest centroid
– Sometimes moves points between clusters
• Repeat 2 and 3 until convergence
– Convergence: Points don’t move between clusters and centroids stabilize

3/24/2024 Mrs. Archana Shirke 102


How the K-Mean Clustering algorithm works?

3/24/2024 Mrs. Archana Shirke 103


Example: Assigning Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters after round 1
3/24/2024 Mrs. Archana Shirke 104
Example: Assigning Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters after round 2
3/24/2024 Mrs. Archana Shirke 105
Example: Assigning Clusters

x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters at the end
3/24/2024 Mrs. Archana Shirke 106
Issue: Getting the k right
How to select k?
• Try different k, looking at the change in the average distance to centroid
as k increases
• Average falls rapidly until right k, then changes little

Best value
of k
Average
distance to
centroid
k

3/24/2024 Mrs. Archana Shirke 107


Example
Illustration of k-Means clustering algorithms
25
A1 A2
6.8 12.6
0.8 9.8 20

1.2 11.6
2.8 9.6 15
3.8 9.9

A2
4.4 6.5 10
4.8 1.1
6.0 19.9 5
6.2 18.5
7.6 17.4
0
7.8 12.2 0 2 4 6 8 10 12
6.6 7.7 A1
8.2 4.5
Plotting data
8.4 6.9
9.0 3.4
9.6 11.1
109
16 objects with two attributes 𝑨𝟏 and 𝑨𝟐 .
Illustration of k-Means clustering algorithms
• Suppose, k=3. Three objects are chosen at random shown as circled. These three centroids are shown
below.
Initial Centroids chosen randomly

Centroid Objects
A1 A2
c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5

• Let us consider the Euclidean distance measure (L2 Norm) as the distance measurement in our
illustration.
• Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3 respectively. The distance
calculations are shown.
• Assignment of each object to the respective centroid is shown in the right-most column and the
clustering so obtained
110
Illustration of k-Means clustering algorithms
A1 A2 d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2
Initial cluster

Distance calculation 111


Illustration of k-Means clustering algorithms
The calculation new centroids of the three cluster using the mean of attribute
values of A1 and A2 is shown in the Table below. The cluster with new centroids
are shown in Fig 16.3.
New Objects
Centroid A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6

Calculation of new centroids

Initial cluster with new centroids


CS 40003: Data Analytics 112
Illustration of k-Means clustering algorithms
We next reassign the 16 objects to three clusters by determining which centroid is
closest to each one. This gives the revised set of clusters shown in Fig 16.4.
Note that point p moves from cluster C2 to cluster C1.

Cluster after first iteration


CS 40003: Data Analytics 113
Illustration of k-Means clustering algorithms
• The newly obtained centroids after second iteration are given in the table below. Note that
the centroid c3 remains unchanged, where c2 and c1 changed a little.
• With respect to newly obtained cluster centres, 16 points are reassigned again. These are the
same clusters as before. Hence, their centroids also remain unchanged.
• Considering this as the termination criteria, the k-means algorithm stops here. Hence, the
final cluster
Cluster after Second iteration

Cluster centres after second iteration

Centroid Revised Centroids


A1 A2
c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6

CS 40003: Data Analytics 114


Exercise

3/24/2024 Mrs. Archana Shirke 115


Exercise
Given: {2,4,10,12,3,20,30,11,25}, k=2. Randomly assign means:
m1=3,m2=4 and find clusters?

3/24/2024 Mrs. Archana Shirke 116


Solution
Data D1 from D2 from Cluster
m1=3 m2=4
2 1 2 1
4 1 0 2
10 7 6 2
12 9 8 2
3 0 1 1
20 17 16 2
30 27 26 2
11 8 7 2
25 22 21 2

C1: {2,3}, m1=2.5


C2: {4,10,12,20,30,11,25}, m2=16
3/24/2024 Mrs. Archana Shirke 117
Solution
Data D1 from D2 from Cluster
m1=2.5 m2=16
2 0.5 14 1
4 1.5 12 1
10 7.5 6 2
12 9.5 4 2
3 0.5 13 1
20 17.5 4 2
30 27.5 14 2
11 8.5 5 2
25 22.5 9 2

C1: {2,3,4}, m1=3


C2: {10,12,20,30,11,25}, m2=18
3/24/2024 Mrs. Archana Shirke 118
Solution
Data D1 from D2 from Cluster
m1=3 m2=18
2 1 16 1
4 1 14 1
10 7 8 1
12 9 6 2
3 0 15 1
20 17 2 2
30 27 12 2
11 8 7 2
25 22 7 2

C1: {2,3,4,10} m1=4.75


C2: {12,20,30,11,25}, m2=19.6
3/24/2024 Mrs. Archana Shirke 119
Solution
Data D1 from D2 from Cluster
m1=4.75 m2=19.6
2 2.75 17.6 1
4 0.75 15.6 1
10 5.25 9.6 1
12 9.5 4 2
3 0.5 13 1
20 17.5 4 2
30 27.5 14 2
11 6.25 8.6 1
25 22.5 9 2

C1: {2,3,4,10,11} m1=6


C2: {12,20,30,25} m2=21.75
3/24/2024 Mrs. Archana Shirke 120
Solution
Data D1 from D2 from Cluster
m1= 6 m2=21.75
2 4 19.75 1
4 2 17.75 1
10 4 11.75 1
12 6 9.75 1
3 3 18.75 1
20 14 1.75 2
30 24 8.25 2
11 5 10.75 1
25 19 3.25 2

C1: {2,3,4,10,11,12} m1=7


C2: {20,30,25} m2=25
3/24/2024 Mrs. Archana Shirke 121
Solution
Data D1 from D2 from Cluster
m1= 7 m2=25
2 5 23 1
4 3 21 1
10 3 15 1
12 5 13 1
3 4 22 1
20 13 5 2
30 23 5 2
11 4 14 1
25 18 0 2

C1: {2,3,4,10,11,12} m1=7


C2: {20,30,25} m2=25
3/24/2024 Mrs. Archana Shirke 122
CURE Algorithm
(Clustering Using REpresentatives)
The CURE Algorithm
• CURE is an efficient algorithm for large databases that is
more robust to outliers and identifies clusters having non-spherical
shapes and size variances

• CURE
– algorithm in the point-assignment class
– Assumes a Euclidean distance
– Allows clusters to assume any shape
– Uses a collection of representative points to represent clusters
– It handles data that is too large to fit in main memory

3/24/2024 Mrs. Archana Shirke 124


The CURE Algorithm
• It use sample point variant as the cluster representative rather than
every point in the cluster
• The worst-case time complexity is O(n2logn) and space complexity is
O(n) due to the use of k-d trees and heap.

• Two pass algorithm


– Starting CURE
– Finishing CURE

3/24/2024 Mrs. Archana Shirke 125


Starting CURE
Pass 1:
1. Initial clusters: Pick a random sample of points and cluster it in main
memory
– Cluster these points hierarchically – group nearest points/clusters
2. Pick representative points:
– For each cluster, pick a sample of points, as far from one another as possible
3. Shrink towards centroid:
– From the sample, pick representatives by moving them α(say 20%) toward the
centroid of the cluster

3/24/2024 Mrs. Archana Shirke 126


Finishing CURE
Pass 2:
• Now, rescan the whole dataset and
visit each point p in the data set

• Place it in the “closest cluster” p

– “closest”: Find the closest representative to p and assign it to representative’s


cluster

3/24/2024 Mrs. Archana Shirke 127


Example: Initial Clusters
h h

h
e e
e
h e
e e h
e e e e
h
salary e
h
h
h h
h h h

age

3/24/2024 Mrs. Archana Shirke 128


Example: Pick Dispersed Points
h h

h
e e
e
h e
e e h
e e e e
h
salary e
h Pick (say) 4
h remote points
h h for each
h h h cluster.

age

3/24/2024 Mrs. Archana Shirke 129


Example: Pick Dispersed Points
h h

h
e e
e
h e
e e h
e e e e
h
salary e
h Move points
h (say) 20%
h h toward the
h h h centroid.

age

3/24/2024 Mrs. Archana Shirke 130


CURE Summary
• CURE can effectively detect proper shape of the cluster with the help of
scattered representative point and centroid shrinking.
• CURE can reduce computation time and memory loading with random
sampling and 2 pass clustering.
• CURE can effectively remove outlier.
• The quality and effectiveness of CURE can be tuned be varying different
s (starting point) , p (#partitions), c(representative points),  (shrink
factor) to adapt different input data set.

3/24/2024 Mrs. Archana Shirke 131


Canopy Clustering

3/24/2024 Mrs. Archana Shirke 132


Canopy Algorithm
• This is an unsupervised pre-clustering algorithm introduced by Andrew
McCallum, Kamal Nigam and Lyle Ungar in 2000.

• It is often used as preprocessing step for the K-means or hierarchical


clustering algorithm.

• It is intended to speed up clustering operations on large data sets,


where using another algorithm directly may be impractical due to the
size of the data set.

3/24/2024 Mrs. Archana Shirke 133


Canopy Algorithm
The algorithm uses two thresholds T1(the loose distance) and T2 (the tight distance),
where T1 > T2
1. Begin with the set of data points to be clustered.
2. Remove a point from the set, beginning a new 'canopy' containing this point.
3. For each point left in the set, assign it to the new canopy if its distance to the first
point of the canopy is less than the loose distance T1.
4. If the distance of the point is additionally less than the tight distance T2, remove it
from the original set.
5. Repeat from step 2 until there are no more data points in the set to cluster.
6. These relatively cheaply clustered canopies can be sub-clustered using a more
expensive but accurate algorithm.
3/24/2024 Mrs. Archana Shirke 134
Example

3/24/2024 Mrs. Archana Shirke 135


Example

3/24/2024 Mrs. Archana Shirke 136


Example

3/24/2024 Mrs. Archana Shirke 137


Example

3/24/2024 Mrs. Archana Shirke 138


Canopy Clustering

3/24/2024 Mrs. Archana Shirke 139


Canopy Clustering with MR

3/24/2024 Mrs. Archana Shirke 140


Summary
• K-mean clustering
• CURE Algorithm
– Starting CURE
– Finishing CURE
• Canopy Clustering
• Clustering with MapReduce

3/24/2024 Mrs. Archana Shirke 141


Thank you ! ! !

3/24/2024 Mrs. Archana Shirke 142


Lecture on

Big Data Analytics

Big Data Mining Algorithms:


Classification Algorithm
Mrs. Archana Shirke,
Department of Information Technology,
Fr. C. R. I. T., Vashi.
Modules
1. Introduction to Big Data
2. Introduction to Big Data Frameworks: Hadoop, NOSQL
3. MapReduce Paradigm
4. Mining Big Data Streams
5. Big Data Mining Algorithms
– Frequent Pattern Mining :Handling Larger Datasets in Main Memory, Basic Algorithm of Park,
Chen, and Yu, The SON Algorithm, and MapReduce.
– Clustering Algorithms: CURE Algorithm. Canopy Clustering, Clustering with MapReduce
– Classification Algorithms: Parallel Decision trees, Overview SVM classifiers, Parallel SVM, K-
nearest Neighbor classifications for Big Data, One Nearest Neighbour.
6. Big Data Analytics Applications

144
Introduction
• Classification is the problem of identifying to which of a set of
categories a new observation belongs, on the basis of a training set of
data containing observations (or instances) whose category membership
is known.
• It creates model from a set of training data
– individual data records (samples, objects, tuples)
– records can each be described by its attributes
– attributes arranged in a set of classes
– supervised learning - each record is assigned a class label

3/24/2024 Mrs. Archana Shirke 145


Introduction
• Model form representations
– Mathematical formulae
– Classification rules(“if, then” statements )
– Decision trees (graphical representation)

• Model utility for data classification


– predict unknown outcomes for a new (no-test) data set
• classification - outcomes always discrete or nominal values
• regression may contain continuous or ordered values

3/24/2024 Mrs. Archana Shirke 146


Decision Tree
• Works like a flow chart
• Looks like an upside down tree
• Nodes
– appear as rectangles or circles
– represent test or decision
• Lines or branches - represent outcome of a test
• Circles - terminal (leaf) nodes
• Top or starting node- root node
• Internal nodes - rectangles
Parallel Decision Tree
• Parallel Decision Trees (PDT) overcomes limitations of equivalent serial
algorithms that enables the use of the very-large-scale training sets in
real-world applications of machine learning and data mining.

• Benefits –
– Scalability and
– Speedup
Introduction
• Different Classification Algorithms
– Decision trees
– Logistic Regression
– Support vector machines
– K- nearest Neighbor Algorithm
– Naive Bayes/ Bayesian classifier
– Random Forest
– Gradient Boosting
– Parallel Decision Tree

3/24/2024 Mrs. Archana Shirke 149


Support Vector Machines (SVM)
Support Vector Machines (SVM)
• Supervised learning methods for classification and regression for
relatively new class of successful learning methods - they can represent
non-linear functions and they have an efficient training algorithm derived
from statistical learning theory by Vapnik and Chervonenkis.

• SVM got into mainstream because of their exceptional performance in


Handwritten Digit Recognition
– 1.1% error rate which was comparable to a very carefully constructed (and complex) ANN
• SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Support Vector Machines (SVM)
• SVM is one of the most popular Supervised Learning algorithms, which
is used for Classification as well as Regression problems.
• The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that
we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
• SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, and
hence algorithm is termed as Support Vector Machine.
Support Vector Machines (SVM)
• The dimensions of the hyperplane depend on the features present in
the dataset, which means if there are 2 features, then hyperplane will
be a straight line. And if there are 3 features, then hyperplane will be a
2-dimension plane.
• Always create a hyperplane that has a maximum margin, which means
the maximum distance between the data points.
• Support Vectors: The data points or vectors that are the closest to the
hyperplane and which affect the position of the hyperplane are termed
as Support Vectors. Since these vectors support the hyperplane, hence
called a Support vector.
Support Vector Machines (SVM)
• As a task of classification, it searches for optimal hyperplane(i.e.,
decision boundary) separating the tuples of one class from another.
• Although the SVM based classification (i.e., training time) is extremely
slow, the result, is however highly accurate. Further, testing an
unknown data is very fast.
• SVM is less prone to over fitting
than other methods. It also facilitates
compact model for classification.
Support Vector Machines (SVM)
2 Types of SVM
• Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then
such data is termed as non-linear data and classifier used is called as Non-
linear SVM classifier.

Maximum margin hyperplane : a key concept in SVM.


Maximum Margin Hyperplane
• Let us assume a simplistic
situation that given a training
data D ={t1,t2,....tn} with a set of
n tuples, which belong to two
classes either + or - and each
tuple is described by two
attributes say A1, A2.

A 2D data linearly seperable by hyperplanes


Maximum Margin Hyperplane
• It can be seen that there are an infinite number of separating lines that
can be drawn. Therefore, the following two questions arise:
– Whether all hyperplanes are equivalent so far the classification of
data is concerned?
– If not, which hyperplane is the best?
• We may note that so far the classification error is concerned (with
training data), all of them are with zero error.
• However, there is no guarantee that all hyperplanes perform
equally well on unseen (i.e., test) data.
Maximum Margin Hyperplane
H2
• Thus, for a good classifier it must
choose one of the infinite
number of hyperplanes, so that it
performs better not only on
training data but as well as test
data.
• To illustrate how the different
choices of hyperplane influence the
classification error, consider any
arbitrary two hyperplanes H1 and
H2 as shown in Fig.
Maximum Margin Hyperplane
H2
• Two hyperplanes H1 and H2 have
their own boundaries called decision
boundaries(b11 and b12 for H1 and b21
and b22 for H2).
• A decision boundary is a boundary
which is parallel to hyperplane and
touches the closest class in one side
of the hyperplane.
• The distance between the two
decision boundaries of hyperplane
is called the margin. So, if data is
classified using H1, then it is with
larger margin then using H2.
Maximum Margin Hyperplane
H2
• Intuitively, the classifier that
contains hyperplane with a small
margin are more susceptible to
model over fitting and tend to
classify with weak confidence on
unseen data.
• Thus during the training or learning
phase, the approach would be to
search for the hyperplane with
maximum margin.
• Such a hyperplane is called
maximum margin hyperplane and
abbreviated as MMH.
Linear SVM
Linear SVM
• A SVM which is used to classify data which are linearly separable is
called linear SVM.

• In other words, a linear SVM searches for a hyperplane with the


maximum margin.

• This is why a linear SVM is often termed as a maximal margin


classifier (MMC).
Linear SVM
• A SVM which is used to classify data which are linearly separable is
called linear SVM.

• In other words, a linear SVM searches for a hyperplane with the


maximum margin.

• This is why a linear SVM is often termed as a maximal margin


classifier (MMC).
Linear SVM
• To find the MMH given a training set, we shall consider a binary
classification problem consisting of n training data.

• Each tuple is denoted by (Xi ,Yi ) where Xi = (xi 1,xi 2.......xim)


corresponds to the attribute set for the ith tuple (data in m-dimensional
space) and Yi ϵ [+,-] denotes its class label.
• given {(Xi , Yi )} we can obtain a hyperplane which separates all into
two sides of it (of course with maximum gap)
Linear SVM
• Let us consider a 2-D training tuple with attributes A1 and A2 as X = (x1,
x2), where x1 and x2 are values of attributes A1 and A2, respectively for X .
• Equation of a plane in 2-D space can be written as
w0 + w1x1 + w2x2 = 0 [e.g., ax + by + c = 0]
where w0, w1, and w2 are some constants defining the slope and intercept
of the line.
• Any point lying above such a hyperplane satisfies
w0 + w1x1 + w2x2 > 0
• Similarly, any point lying below the hyperplane satisfies
w0 + w1x1 + w2x2 < 0
Linear SVM
• In fact, Euclidean equation of a hyperplane in Rm is
w1x1 + w2x2 + ........ + wmxm = b
where wi ’s are the real numbers and b is a real constant (called the
intercept, which can be positive or negative).

• In matrix form, a hyperplane thus can be represented as


W.X + b = 0
where W =[w1,w2.......wm] and X = [x1,x2.......xm] and b is a real constant.
W and b are parameters of the classifier model to be evaluated given a
training set D.
Linear SVM

• 2 class problem
Class 2
• Many decision
boundaries can separate
these two classes
• Which one should we
choose?
– Bad /Good Decision
Class 1
Linear SVM
Example of Bad Decision Boundaries

Class 2 Class 2

Class 1 Class 1
Good Decision Boundary: Margin Should Be Large
• The decision boundary should be as far away from the data of both
classes as possible (We should maximize the margin, m)
2
m=
w.w
Support vectors
datapoints that the
margin pushes up
against
Class 2

The maximum margin


linear classifier is the
Class 1
m linear classifier with the
maximum margin. This is
the simplest kind of SVM
(Called an Linear SVM)
Non-linear SVM
Non-linear SVM
• SVM classification for non separable data
• In general, if data are linearly separable, then there is a
hyperplane otherwise no hyperplane.
• Figure shows a 2-D views of data when they are linearly separable
and not separable.
Non-linearly Separable Problems
• We allow “error” xi in classification; it is based on the output of the
discriminant function wTx+b. xi approximates the number of
misclassified samples
New objective function:

Class 2

C : tradeoff parameter
between error and
margin; chosen by the
user; large C means a
higher penalty to errors
Class 1
Transformation to Feature Space
• Possible problem of the transformation
–High computation burden due to high-dimensionality and hard to get a good
estimate
• SVM solves these two issues simultaneously
–“Kernel function” for efficient computation
–Minimize ||w||2 can lead to a “good” classifier
( )
( ) ( )
( ) ( ) ( )
(.) ( )
( ) ( )
( ) ( )
( ) ( )
( ) ( ) ( )
( )
( )

Input space Feature space


Examples of Kernel Functions
• Polynomial kernel with degree d

• Radial Basis Function (RBF) kernel with width 

–Closely related to radial basis function neural networks

• Sigmoid with parameter  and 


Steps in SVM
1. Prepare data matrix {(xi, yi)}
2. Select a Kernel function
3. Select the error parameter C
4. “Train” the system (to find all αi)
5. New data can be classified using αi and Support Vectors

Fitting the SVM classifier to the training set:


from sklearn.svm import SVC # "Support vector classifier"
classifier = SVC(kernel='linear', random_state=0)
classifier.fit(x_train, y_train)
Example
• Suppose we have 5 1D data points x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 5 as
class 1 and 3, 4 as class 2  y1=1, y2=1, y3=-1, y4=-1, y5=1
• We use the polynomial kernel of degree 2
–K(x,y) = (xy+1)2
–C is set to 100
• We first find i (i=1, …, 5) by
Example
• By using a solver, we get
1=0, 2=2.5, 3=0, 4=7.333, 5=4.833
–The support vectors are {x2=2, x4=5, x5=6}
• The discriminant function is

• b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x2, x4, x5 lie on


and all give b=9
Example

Value of discriminant function

class 1 class 2 class 1

1 2 4 5 6
SVM: Summary
• Strengths
– Training is relatively easy
• We don’t have to deal with local minimum like in ANN
• SVM solution is always global and unique
– Unlike ANN, doesn’t suffer from “curse of dimensionality”.
• How? Why? We have infinite dimensions?
• Maximum Margin Constraint: DOT-PRODUCTS!
– Less prone to overfitting
– Simple, easy to understand geometric interpretation.
• No large networks to mess around with.
SVM: Summary
• Weaknesses
– Training (and Testing) is quite slow compared to ANN
• Because of Constrained Quadratic Programming
– Essentially a binary classifier
• However, there are some tricks to evade this.
– Very sensitive to noise
• A few off data points can completely throw off the algorithm
– Biggest Drawback: The choice of Kernel function.
• There is no “set-in-stone” theory for choosing a kernel function for any given
problem (still in research...)
• Once a kernel function is chosen, there is only ONE modifiable parameter, the
error penalty C.
Parallel SVM
• Support Vector Machines (SVMs) suffer from a widely recognized
scalability problem in both memory use and computational time.
• To improve scalability, a parallel SVM algorithm (PSVM) is developed ,
which reduces memory use through performing a row-based,
approximate matrix factorization, and which loads only essential data to
each machine to perform parallel computation.
• Let n denote the number of training instances, p the reduced matrix
dimension after factorization (p is significantly smaller than n), and m
the number of machines. PSVM reduces the memory requirement from
O(n2 ) to O(np/m), and improves computation time to O(np2/m).
Software
• A list of SVM implementation can be found at http://www.kernel-
machines.org/software.html
• Some implementation (such as LIBSVM) can handle multi-class
classification
• SVMLight is among one of the earliest implementation of SVM
• Several Matlab toolboxes for SVM are also available
Exercise
BE-IT_SEM8_BDA_MAY18
Thank you
Lecture on

Big Data Analytics

Big Data Mining Algorithms:


K-NN Classification Algorithm
Mrs. Archana Shirke,
Department of Information Technology,
Fr. C. R. I. T., Vashi.
Modules
1. Introduction to Big Data
2. Introduction to Big Data Frameworks: Hadoop, NOSQL
3. MapReduce Paradigm
4. Mining Big Data Streams
5. Big Data Mining Algorithms
– Frequent Pattern Mining :Handling Larger Datasets in Main Memory, Basic Algorithm of Park,
Chen, and Yu, The SON Algorithm, and MapReduce.
– Clustering Algorithms: CURE Algorithm. Canopy Clustering, Clustering with MapReduce
– Classification Algorithms: Parallel Decision trees, Overview SVM classifiers, Parallel SVM, K-
nearest Neighbor classifications for Big Data, One Nearest Neighbor.
6. Big Data Analytics Applications

190
Introduction
• Classification is the problem of identifying to which of a set of
categories a new observation belongs, on the basis of a training set of
data containing observations (or instances) whose category membership
is known.
• It creates model from a set of training data
– individual data records (samples, objects, tuples)
– records can each be described by its attributes
– attributes arranged in a set of classes
– supervised learning - each record is assigned a class label

3/24/2024 Mrs. Archana Shirke 191


Introduction
• Model form representations
– Mathematical formulae
– Classification rules(“if, then” statements )
– Decision trees (graphical representation)

• Model utility for data classification


– predict unknown outcomes for a new (no-test) data set
• classification - outcomes always discrete or nominal values
• regression may contain continuous or ordered values

3/24/2024 Mrs. Archana Shirke 192


Introduction
• Different Classification Algorithms
– Decision trees
– Logistic Regression
– Support vector machines
– K- nearest Neighbor Algorithm
– Naive Bayes/ Bayesian classifier
– Random Forest
– Gradient Boosting
– Parallel Decision Tree

3/24/2024 Mrs. Archana Shirke 193


K-Nearest Neighbor

3/24/2024 Mrs. Archana Shirke 194


K-Nearest Neighbor
• K-Nearest Neighbor (K-NN) is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most
similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then
it can be easily classified into a well suite category by using K- NN
algorithm.
K-Nearest Neighbor
K-Nearest Neighbor
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time
of classification, it performs an action on the dataset.
• K-NN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
K-Nearest Neighbor
Working
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
• Step-4: Among these k neighbors, count the number of the data
points in each category.
• Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
• Step-6: model is ready.
K-Nearest Neighbor
• Although Euclidean Distance is the most common method used and
taught, it is not always the optimal decision. However, there are some
special cases like hamming distance is used in case of a categorical
variable.
• There is no particular way to determine the best value for "K", so we
need to try some values to find the best out of them. The most
preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model. Large values for K are good, but it may
find some difficulties.
K-Nearest Neighbor
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
Disadvantages:
• Always needs to determine the value of K which may be complex some
time. Low k-value is sensitive to outliers and a higher K-value is more
resilient to outliers as it considers more voters to decide prediction.
• The computation cost is high because of calculating the distance
between the data points for all the training samples.
1-Nearest Neighbor
• The 1-NN classifier is one of the oldest methods known. The idea is
extremely simple: to classify X find its closest neighbor among the
training points and assign that label to X.
• What is good about this method?
– It is conceptually simple.
– It does not require learning
– It can be used even with few examples.
– Even for moderate k: wonderful performer.
– It works very well in low dimensions for complex decision surfaces.
Example
Example
• Firstly, we will choose the number of neighbors, choose k=5.
• Calculate the Euclidean distance between the data points.
Example
• Using Euclidean distance, got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B.

• As 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
Example 2
Example 2
• K=3

New
data
Example 3
Example
• Classify Person P (Test1 = 3 and Test2 =7) into appropriate class
based on 1NN classifier.
Person Test 1 Test2 Class
A 7 7 Bad
B 7 4 Bad
C 3 4 Good
D 1 4 Good
Person Test 1 Test2 Class
A 7 7 Bad

Solution
B 7 4 Bad
C 3 4 Good
D 1 4 Good

• Calculate Euclidean Distance of new data point P(3,7)


• d(A,P)=√(7-3)2 +(7-7)2 = 4
• d(B,P)=√(7-3)2 +(4-7)2 = 5
• d(C,P)=√(3-3)2 +(4-7)2 = 3
• d(D,P)=√(1-3)2 +(4-7)2 = √13=3.6

• Conclusion as per 1NN: nearest 1 neighbour is C and class of C is “Good” so P


will be assign a class “Good”

• Conclusion as per 3NN: nearest 3 neighbour are A,C & D. Majority of class is
“Good” so P will be assign a class “Good”
Exercise
Exercise
• Find payment of Person P1( Age = 36 and Experience =7) and
P2( Age = 32 and Experience =5) using KNN classifier.
Person Age Experience Payment
A 41 21 High
B 35 10 High
C 25 4 Low
D 27 6 Low
E 30 8 Low
Summary
• K-NN
• 1-NN
• Example

3/24/2024 Mrs. Archana Shirke 212


Thank you ! ! !

3/24/2024 Mrs. Archana Shirke, Fr. C. R. I. T. Vashi 213

You might also like