Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views62 pages

BigData Week13

The document provides an overview of machine learning, focusing on supervised learning techniques such as linear regression, decision trees, and support vector machines, along with practical examples. It also discusses various algorithms used in machine learning, including K-means clustering and random forests, and highlights the importance of hyperparameter tuning and feature selection. Additionally, it includes assignments related to improving regression scores and reducing error rates in predictive models.

Uploaded by

tuanntt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views62 pages

BigData Week13

The document provides an overview of machine learning, focusing on supervised learning techniques such as linear regression, decision trees, and support vector machines, along with practical examples. It also discusses various algorithms used in machine learning, including K-means clustering and random forests, and highlights the importance of hyperparameter tuning and feature selection. Additionally, it includes assignments related to improving regression scores and reducing error rates in predictive models.

Uploaded by

tuanntt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Machine Learning

(Supervised Learning and


Examples)
BigData Week13
2024. 5.30
Eunhui Kim (김은희)
[email protected]
01 Overview of Machine Learning

02 Supervised Learning

Examples )
Linear Regression with diabetes
03
Decision Tree with petrol consumption
SVM with human body size

04 6th-Assignment 3problems (p.19, 32, 53)

2
Artificial Intelligence (AI) refers to mechanical motions that
performs tasks which are generally considered smart.

 Machine Learning is the ability of artificial intelligence-based


machines to access data and learn from it.

 Deep Learning is a machine Learning algorithm based on


neural network model trained with extensive data.

https://www.byteant.com/blog/computer-vision-vs-machine-learning-vs-deep-learning-guide-to-ai-
applications/ 3
4
https://www.coursera.org/articles/data-science-vs-machine-learning
5
https://se.mathworks.com/discovery/reinforcement-learning.html
Data with Labels

6
Data without Labels

7
States and Actions by according to the reward

8
K-means
Clustering Spectral Clustering
DBSCAN
1. Naïve Bayes Classifier Algorithm
2. K Means Clustering Algorithm Unsupervised
Learning Dimensionality PCA
3. Support Vector Machine Algorithm(SVM) Reduction
Feature Selection
Linear Discriminant Analysis
4. Principle Component Analysis (PCA) (LDA)

5. Linear Regression Algorithm


Machine Decision Tree
6. Logistic Regression Algorithm
Learning Regression Linear Regression
Logistic Regression
7. Decision Tree Algorithm
Reinforcement
8. Random Forest Algorithm Learning
Supervised
9. K Nearest Neighbors Algorithm(KNN) Learning
10. Artificial Neural Network Algorithm

Naïve Bayes
Classification SVM
K-Nearest Neighbor

9
https://www.geeksforgeeks.org/top-10-algorithms-every-machine-learning-engineer-should-know/
Toy Datasets Function
1 Boston house prices load_boston(*[ return_X_y)

2 Iris Plants dataset load_iris(*[,return_X_y, as_frame])

3 Diabets dataset load_diabetes(*[,return_X_y, as_frame])

4 Handwritten digits load_digits(*[,n_class,return_X_y,as_frame])


dataset
5 Linnerrud dataset load_linnerud(*[,return_X_y, as_frame])

6 Wine dataset load_wine(*[,return_X_y, as_frame])

7 Breast cancer wisconsin load_breast_cancer(*[,return_X_y,as_frame])

https://scikit-learn.org/stable/datasets.html
10
https://scikit-learn.org

11
k=5
k=3

distance

12
13
14
15
 diabetes dataset in scikitlearn
 The diabetes dataset consists of 10 physiological variables
(age, sex, BMI, blood pressure) measured on 442 patients,
and an indication of disease progression after one year:

https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt
16
 Linear Regression with Diabetes dataset

17
 Linear Regression with Diabetes dataset

18
 Linear Regression with Diabetes dataset

 0530 Assignment ~ problem 1


As you can see the regression score, it is too low.
How can you get the higher regression score?
Make an higher regression score code referring to the given code.
And Explain more than one line why you choose the method and
Explain the results by comparing original code and your own. 19
 Decision Tree ~ another classification idea
• We learned about linear classification and nearest neighbor, Another?
• Pick an attribute, do a simple test
• Conditioned on a choice, pick another attribute, do another test
• In the leaves, assign a class with majority vote
• Do other branches as well

• Gives axes aligned decision boundaries


20
 The cost function of the CART algorithm, is 𝐽𝐽 𝑘𝑘, 𝑡𝑡𝑘𝑘 .
 The CART splits data into two subsets using the threshold 𝑡𝑡𝑘𝑘 for feature k.
 The pair 𝑘𝑘, 𝑡𝑡𝑘𝑘 is found through a process that:
•Applies weights according to the size of the subsets.
•Splits into the purest possible subsets.
•Minimizes the cost 𝐽𝐽 𝑘𝑘, 𝑡𝑡𝑘𝑘 .

𝑚𝑚𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑚𝑚𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝐽𝐽 𝑘𝑘, 𝑡𝑡𝑘𝑘 = 𝐺𝐺𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 + 𝐺𝐺𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝑚𝑚 𝑚𝑚

𝑚𝑚𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 /𝑚𝑚𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 : number of samples in left/right subset


𝐺𝐺𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 / 𝐺𝐺𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 : impurity of left/right subset
2
𝐺𝐺𝑖𝑖 = 1 − ∑𝑛𝑛𝑘𝑘=1 𝑝𝑝𝑖𝑖,𝑘𝑘 ,: 𝑝𝑝𝑖𝑖,𝑘𝑘 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑖𝑖𝑖𝑖 𝑘𝑘 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑜𝑜𝑜𝑜 𝑖𝑖𝑖𝑖𝑖 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
21
 Quantifying Uncertainty :
Substitution method for impurity 𝐺𝐺𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 / 𝐺𝐺𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 )

22
 Entropy

23
Sepal
length
! Choose attribute that gives the highest gain
! Choose the class the high G score(or low entropy)

Sepal
with

G score:
2 2
5 49
1− − ≈ 0.168
54 54
Entropy:
49 49 5 5
− 𝑙𝑙𝑙𝑙𝑙𝑙2 − 𝑙𝑙𝑙𝑙𝑙𝑙2 ≈0.445
54 54 54 54

24
 The cost function of the CART algorithm, is 𝐽𝐽 𝑘𝑘, 𝑡𝑡𝑘𝑘 .

𝑚𝑚𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑚𝑚𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝐽𝐽 𝑘𝑘, 𝑡𝑡𝑘𝑘 = 𝑀𝑀𝑀𝑀𝑀𝑀𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 + 𝑀𝑀𝑀𝑀𝑀𝑀𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝑚𝑚 𝑚𝑚

𝑚𝑚𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 : number of samples in ith node


2
𝑀𝑀𝑀𝑀𝑀𝑀𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 = ∑𝑖𝑖∈𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑦𝑦�𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 − 𝑦𝑦 𝑖𝑖

1
𝑦𝑦�𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 = � 𝑦𝑦 (𝑖𝑖)
𝑚𝑚𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑖𝑖∈𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛

25
1. Hyperparameter Tuning: Adjusting hyperparameters like max_depth
(maximum depth of the tree), min_samples_split (minimum number of samples
required to split a node), and min_samples_leaf (minimum number of samples
required to be at a leaf node).
2. Feature Selection and Engineering: Choosing relevant features or creating
new features to help the model capture important information more effectively.
3. Pruning: Reducing the size of the decision tree to prevent overfitting. This
involves cutting back parts of the tree that provide little predictive power.
4. Data Preprocessing: Implementing scaling, outlier removal, or missing value
imputation to improve model performance.
5. Ensemble Techniques: Using ensemble methods like Random Forest or
Gradient Boosting Trees, which combine multiple decision trees to improve
performance.
6. Cross-validation: Dividing the dataset into multiple parts and training/testing
the model on these to assess and enhance its generalization ability.
7. Cost Complexity Tuning: Adjusting the complexity parameter to control the
growth of the tree, preventing it from becoming overly complex.
26
 Random Forests can be regarded as ensemble learning with decisio
n trees
 Instead of building a single decision tree and use it to make predictions
, build many slightly different trees
 Combine their predictions using majority voting
 The main two concepts behind random forests are:
 The wisdom of the crowd — a large group of experts are collectively sm
arter than individual experts
 Diversification — a set of uncorrelated trees
 A supervised machine learning algorithm
 Classification (predicts a discrete-valued output, i.e. a class)
 Regression (predicts a continuous-valued output) tasks.

27
 We have a single data set, so how do we obtain slightly
different trees?
1. Bagging (Bootstrap Aggregating):
• Take random subsets of data points from the training set to create
N smaller data sets
• Fit a decision tree on each subset
• Results in low variance.
2. Random Subspace Method (also known as Feature Bagging):
• Fit N different decision trees by constraining each one to operate
on a random subset of features

28
N subsets (with
replacement)

Training set

 Sampling with replacement from the training set is called bagging (short for
bootstrap aggregating), while sampling without replacement is called pasting
29
A test sample

75% confidence

30
 Using all training samples (bootstrap=False, max_samples=1.0) while
sampling features (bootstrap_features=True, max_features<1.0) is referred
to as the Random Subspace method.

Training data

• Feature sampling creates more diverse predictors, reducing variance


while increasing bias. 31
A test sample

66% confidence

32
Tree 1 Tree 2 Random Forest Tree N

33
For b = 1 to B:
• (a) Draw a bootstrap sample Z∗ of size N from the training data.
• (b) Grow a random-forest tree to the bootstrapped data, by recursively
repeating the following steps for each terminal node of the tree, until
the minimum node size nmin is reached.
• i. Select m variables at random from the p variables.
• ii. Pick the best variable/split-point among the m.
• iii. Split the node into two daughter nodes.
• Output the ensemble of trees.

 To make a prediction at a new point x we do:


• For regression : average the results
• For classification: majority vode

34
35
[J. Shotton, A. Fitzgiboon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake. Rea-Time Human Pose] 36
[J. Shotton, A. Fitzgiboon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake. Rea-Time Human Pose] 37
 https://www.kaggle.com/datasets/harinir/petrol-consumption

 4 characteristics and 1 decision value :


① Petrol tax
② Average income
③ Paved highways
④ Population driver license
⑤ Petrol Consumption

38
 https://www.kaggle.com/datasets/harinir/petrol-consumption
 4 characteristics and 1 decision value :
① Petrol tax
② Average income
③ Paved highways
④ Population driver license
⑤ Petrol Consumption

39
40
 0530 Assignment ~ problem 2
As you can see the metrics Mean Absolute Error/Mean Squared Error/ Root Mean
Squared Error, it is too high.
How can you get the lower error rate?
Make an lower error rate referring to the given code.
And Explain more than one line why you choose the method and
Explain the results by comparing original code and your own.

Hint: refer to slide 25.


You can tune the hyper-parameter of the model such as max_depth, max_features etc.
- Reference :
sklearn.tree.DecisionTreeRegressor — scikit-learn 1.3.2 documentation

41
 VC Dimension
• The largest number of datapoints which can be classified
by a binary classifier without misclassification is called
the Vapnik-Chervonenkis(VC) dimension.
• The model does not need to shatter all sets of datapoints
of size h. One set is sufficient.

 For planes in 3-D, h=4 even though 4 co-planar points cannot be


shattered.

SVM is from Geoffrey Hinton Lecture Notes 2008, p.33 – p.46 42


• Suppose our model class is a hyperplane.
• In 2-D, we can find a plane (i.e. a line) to deal with any
labeling of three points. A 2-D hyperplane shatters 3 points

 But we cannot deal with some of the possible labelings of


four points. A 2-D hyperplane (i.e. a line) does not shatter 4
points.
43
• The VC dimension of a hyperplane in 2-D is 3.
In k dimensions it is k+1.
• Its just a coincidence that the VC dimension of a hyperplane
is almost identical to the number of parameters it takes to
define a hyperplane.
• A sine wave has infinite VC dimension and only 2 parameters!
By choosing the phase and period carefully we can shatter
any random collection of one-dimensional datapoints
(except for nasty special cases).

f ( x) = a sin(b x)

44
 The line that maximizes the minimum
margin is a good bet.
 The model class of “hyper-planes
with a margin of m” has a low VC
dimension if m is big.
 This maximum-margin separator is
determined by a subset of the datapoints.
 Datapoints in this subset are called
“support vectors”.
 It will be useful computationally if
only a small fraction of the
datapoints are support vectors, The support vectors are
because we use the support vectors indicated by the circles
to decide which side of the separator
around them.
a test case is on.
45
 To find the maximum margin separator, we have to solve the
following optimization problem:
w.x c + b > +1 for positive cases
w.x c + b < −1 for negative cases
and || w ||2 is as small as possible
 This is tricky but it’s a convex problem. There is only one
optimum and we can find it without fiddling with learning rates
or weight decay or early stopping.
 Don’t worry about the optimization problem. It has been
solved. Its called quadratic programming.
 It takes time proportional to N^2 which is really bad for very
big datasets
• so for big datasets we end up doing approximate optimization!
46
 The separator is defined as the set of points for which:

w.x + b = 0
so if w.x c + b > 0 say its a positive case
and if w.x c + b < 0 say its a negative case

47
 Use a much bigger set of features.
• This looks as if it would make the computation hopelessly slow, but in
the next part of the lecture we will see how to use the “kernel” trick to
make the computation fast even with huge numbers of features.

 Extend the definition of maximum margin to allow non-


separating planes.
• This can be done by using “slack” variables

48
 Slack variables are constrained to be non-negative. When
they are greater than zero they allow us to cheat by putting
the plane closer to the datapoint than the margin. So we
need to minimize the amount of cheating. This means we
have to pick a value for lambda (this sounds familiar!)

w.x c + b ≥ +1 − ξ c for positive cases


w.x c + b ≤ −1 + ξ c for negative cases
with ξ c ≥ 0 for all c
|| w ||2
and + λ ∑ξ c as small as possible
2 c
49
 SKLearn SVM
hyperparameter C
C decreasing
& Margin increasing

50
 All of the computations that we need to do to find the maximum-
margin separator can be expressed in terms of scalar products
between pairs of datapoints (in the high-dimensional feature
space).
 These scalar products are the only part of the computation that
depends on the dimensionality of the high-dimensional space.
• So if we had a fast way to do the scalar products we would not
have to pay a price for solving the learning problem in the
high-D space.
 The kernel trick is just a magic way of doing scalar products a
whole lot faster than is usually possible.
• It relies on choosing a way of mapping to the high-
dimensional feature space that allows fast scalar products.

51
 For many mappings from a
low-D space to a high-D Low-D
space, there is a simple xb
operation on two vectors in xa
the low-D space that can be
used to compute the scalar
product of their two images φ
in the high-D space.
High-D

K ( x a , x b ) = φ ( x a ) .φ ( x b )
φ (xa )
Letting the doing the scalar φ ( xb )
kernel do product in the
the work obvious way 52
 The final classification rule is quite simple:

bias + ∑ s
w K ( x test
, x s
) > 0
s ε SV

The set of
support vectors
 All the cleverness goes into selecting the support vectors
that maximize the margin and computing the weight to use
on each support vector.
 We also need to choose a good kernel function and we may
need to choose a lambda for dealing with non-separable
cases.
53
 SKLearn SVM kernel
• Poly
• rbf Polynomial: K (x, y ) = (x.y + 1) p

Gaussian radial Parameters


−||x − y||2 / 2σ 2 that the user
basis function K ( x, y ) = e must choose

Neural net: K (x, y ) = tanh ( k x.y − δ )

For the neural network kernel, there is one “hidden unit” per support vector,
so the process of fitting the maximum margin hyperplane decides how
many hidden units to use. Also, it may violate Mercer’s condition.

54
Python Class BigO Support Need Scale Kernel Trick
learning Control
External
memory
LinearSVC O(mxn) NO YES NO

SGDClassifier O(mxn) YES YES NO

SVC O(m2 xn)~ NO YES YES


O(m3 xn)

55
 The dataset is from sizeKorea. https://sizekorea.kr

56
 The dataset is from sizeKorea. https://sizekorea.kr

Calculate the sum of the


number of null column

57
 Translate the data header into Korean by google translator

58
59
 As you can see this is the core I’ve tried previously.

https://scikit-learn.org/stable/modules/svm.html
60
 0530 Assignment ~ problem 3
Update the Algorithm of the SVM for the human body size data using your own
optimization methods. You can use other kinds of filter and other hyper-parameters
for the SVM algorithm.
While you report the result you should compare which option you change and the
enhanced result difference!

61
김은희 ([email protected])
TA: Yesim Selcuk([email protected])

62

You might also like