Introduction to machine learning, Pazmany University 2017-2018
Introduction to machine learning and
regression
András Horváth
This presentation is available
at:
http://users.itk.ppke.hu/~horan
/big_data
The code is available at:
https://databricks-prod-clou
dfront.cloud.databricks.com/
public/4027ec902e239c93eaaa8
714f173bcfc/2217267761988785
/1705666210822011/2819141955
166097/latest.html
Introduction to machine learning, Pazmany University 2017-2018
Machine learning, machine lintelligence
What is intelligence?
2
Introduction to machine learning, Pazmany University 2017-2018
Machine learning, machine lintelligence
What is intelligence?
The ability to acquire and apply knowledge and skills.
3
Introduction to machine learning, Pazmany University 2017-2018
Machine learning, machine lintelligence
What is intelligence?
The ability to acquire and apply knowledge and skills.
4
Introduction to machine learning, Pazmany University 2017-2018
Machine learning, machine lintelligence
What is intelligence?
The ability to acquire and apply knowledge and skills
Intelligence is the ability to adapt to change
Providing computers the ability to learn without being explicitly
programmed:
Involves: programming, Computational statistics, mathematical
optimization, image processing, natural language processing
etc...
5
Introduction to machine learning, Pazmany University 2017-2018
Machine learning, machine lintelligence
What is intelligence?
The ability to acquire and apply knowledge and skills
Intelligence is the ability to adapt to change
Providing computers the ability to learn without being explicitly
programmed:
Involves: programming, Computational statistics, mathematical
optimization, image processing, natural language processing
etc...
6
Introduction to machine learning, Pazmany University 2017-2018
Conquests of machine learning
1952 Arthur Samuel (IBM): First machine learning program playing
checkers
Arthur Samuel coined the term „machine learning”
7
Introduction to machine learning, Pazmany University 2017-2018
Conquests of machine learning
1952 Arthur Samuel (IBM): First machine learning program playing
checkers
1997 IBM Deep Blue Beats Kasparov:
First match (1996 Nov): Kasparov–Deep Blue (4–2)
Second Match (1997 May): Deep Blue–Kasparov (3½–2½)
8
Introduction to machine learning, Pazmany University 2017-2018
Conquests of machine learning
1952 Arthur Samuel (IBM): First machine learning program playing
checkers
1997 IBM Deep Blue Beats Kasparov:
2011 IBM Watson: Beating human champions in Jeopardy
It's a 4-letter term for a summit; the
first 3 letters mean a type of
simian : Apex
4-letter word for a vantage point or
a belief : View
Music fans wax rhapsodic about
this Hungarian's "Transcendental
Etudes" : Franz Liszt
While Maltese borrows many words
from Italian, it developed from a
dialect of this semitic language :
Arabic
9
Introduction to machine learning, Pazmany University 2017-2018
Conquests of machine learning
1952 Arthur Samuel (IBM): First machine learning program playing
checkers
1997 IBM Deep Blue Beats Kasparov:
2011 IBM Watson: Beating human champions in Jeopardy
2014 Deep face algorithm Facebook
Reached 97.35% accuracy
Human performance is around 97%
10
Introduction to machine learning, Pazmany University 2017-2018
Conquests of machine learning
1952 Arthur Samuel (IBM): First machine learning program playing
checkers
1997 IBM Deep Blue Beats Kasparov:
2011 IBM Watson: Beating human champions in Jeopardy
2014 Deep face algorithm Facebook
2016 Alpha go: deep learning
Fan Hui (5-0)
Lee Sedol (4-1)
99.8% win rate against other Go programs
11
Introduction to machine learning, Pazmany University 2017-2018
Types of machine learning
Supervised learning and unsupervised learning
12
Introduction to machine learning, Pazmany University 2017-2018
Types of machine learning
Supervised learning and unsupervised learning
Supervised:
We know the expected output we know the perfect, desired output on
the training set
Classification and regression
Unsupervised:
All we have is data and no labels. We have to identify rules in the
structure of the data
Clustering and association 13
Introduction to machine learning, Pazmany University 2017-2018
Types of machine learning
Supervised learning and unsupervised learning
Supervised:
Classification and regression Classification: A classification problem is when the output variable is
a category, such as “red” or “blue” or “disease” and “no disease”.
Regression: A regression problem is when the output variable is a
real value, such as “dollars” or “weight”.
Unsupervised:
Clustering: A clustering problem is where you want to discover the
Clustering and association inherent groupings in the data, such as grouping customers by
purchasing behavior.
Association: An association rule learning problem is where you want
to discover rules that describe large portions of your data, such as
people that buy X also tend to buy Y.
Semi-supervised: Some data is labeled but most of it is unlabeled and a mixture of
supervised and unsupervised techniques can be used.
14
Introduction to machine learning, Pazmany University 2017-2018
Supervised Classification
The most important conquests of deep learning comes from supervised calssifcation.
There are three major improvements in this:
- New methods and technologies
- Powerful hardware for training (GPGPUs)
- Vast amount of data is available
15
Introduction to machine learning, Pazmany University 2017-2018
Supervised Regression
Our aim is to predict a target value (Y) based on our data
(X)
We have to find a model that explains the rules how Y
can be derived from X m(X)=Y
The prefect fit is usually not possible, because our world
is not perfect:
We have a model which has flaws
There is noise in our observation
We have to fit a model which minimizes the error on the
dataset:
L1 error
L2 error 16
Introduction to machine learning, Pazmany University 2017-2018
Correlation and cause effect
relationship
Large correlation between the data does not necessary mean that one is caused by the other
One has to gather the large amount of data to ensure the coincidence by luck
One has to understand the data
17
Introduction to machine learning, Pazmany University 2017-2018
Correlation and cause effect
relationship
Large correlation between the data does not necessary mean that one is caused by the other
One has to gather the large amount of data to ensure the coincidence by luck
One has to understand the data
18
Introduction to machine learning, Pazmany University 2017-2018
Steps of machine learning
Supervised learning and unsupervised learning:
1, Gather Data
2, Understand your data
3, Data preparation
4, Choosing a model
5, Training and evaluation
6, Parameter tuning
7, Go back to point 1
19
Introduction to machine learning, Pazmany University 2017-2018
Gathreing data
Lets have an example dataset:
Boston house prices dataset:
A typical regression problem, good for trying out
different machine learning models
There are many available solutions on this dataset
Standard dataset for model evaluation contained by
Python libraries
Easy to load the data:
20
Introduction to machine learning, Pazmany University 2017-2018
Understanding the data
The dataset contains 506 rows (different data points) and 14 columns
13 features describing the properties (X):
per capita crime rate by town, proportion of residential land zoned for lots over 25,000 sq.ft., proportion of
non-retail business acres per town, Charles River dummy variable (= 1 if tract bounds river; 0 otherwise),
nitric oxides concentration (parts per 10 million), average number of rooms per dwelling, proportion of
owner-occupied units built prior to 1940, weighted distances to five Boston employment centers, index of
accessibility to radial highways, full-value property-tax rate per $10,000, pupil-teacher ratio by town,
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town, % lower status of the population
14th column: the target value (Y)
Our aim is to predict Y from X 21
Introduction to machine learning, Pazmany University 2017-2018
Data preparation
Write out some values:
22
Introduction to machine learning, Pazmany University 2017-2018
Data preparation
Plot some values as histograms:
23
Introduction to machine learning, Pazmany University 2017-2018
Data preparation
Plot some values related to the average price:
24
Introduction to machine learning, Pazmany University 2017-2018
Data preparation
Plot correlation between the different features:
25
Introduction to machine learning, Pazmany University 2017-2018
Evaluation of our method: the bias variance
problem
We have to simultanously minimize two different errors
The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to
miss the relevant relations between features and target outputs (underfitting)
The variance is error from sensitivity to small fluctuations in the training set. High variance can cause an
algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
If we fit our data perfectly on the training set, it might not be general. Our aim is to have a general model
26
Introduction to machine learning, Pazmany University 2017-2018
Evaluation of our method: the bias variance
problem
We have to simultanously minimize two different errors
The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to
miss the relevant relations between features and target outputs (underfitting)
The variance is error from sensitivity to small fluctuations in the training set. High variance can cause an
algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
If we fit our data perfectly on the training set, it might not be general. Our aim is to have a general model
27
Introduction to machine learning, Pazmany University 2017-2018
Evaluation of our method: the bias variance
problem
We have to simultanously minimize two different errors
The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to
miss the relevant relations between features and target outputs (underfitting)
The variance is error from sensitivity to small fluctuations in the training set. High variance can cause an
algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
If we fit our data perfectly on the training set, it might not be general. Our aim is to have a general model
Let's reserve some of our data to see how general our model is:
Once we have a good model we should re-run the training and evaluation on differently divided dataset
and the accuracy should be consistent. This is called cross-validation
28
Introduction to machine learning, Pazmany University 2017-2018
Scaling feautres
The features in our data can have orders of different magnitudes
Some algorithms can handle this fact, some can not
Even those which can learn these differences work faster and better with scaled data
Let's change all features to ensure that all of them has zero mean and variance of 1
29
Introduction to machine learning, Pazmany University 2017-2018
Detecting outliers
It really helps to have a look at the data:
One could see that prices are limited at 50K$
These elements will not fit into our model
They are outliers and it is not good to test the
model on them in real life
The values are also thresholded in the test data in this case, so it is good if the model could learn this fact
30
Introduction to machine learning, Pazmany University 2017-2018
Linear regression
We assume that there is a correlation in our data (x and y) which can be described as:
Y=wX+b
We want to minimize the error between:
According to w and b
There is an analytical formula to find the best w
and b
We have to find the local extremum of the error
function
31
Introduction to machine learning, Pazmany University 2017-2018
Linear regression
We assume that there is a correlation in our data (x and y) which can be described as:
Y=wX+b
Linear regression in SK learn
32
Introduction to machine learning, Pazmany University 2017-2018
Linear regression
We assume that there is a correlation in our data (x and y) which can be described as:
Y=wX+b
Linear regression in SK learn
33
Introduction to machine learning, Pazmany University 2017-2018
Linear regression - Lasso
We assume that there is a correlation in our data (x and y) which can be described as:
Y=wX+b
One can easily overfit the data:
There is noise on our input data (some variables might be more reliable than the other)
How could we select the important variables.
Lasso (Least Absolute Shrinkage and Selection Operator) penalizes large values in the
regressor (the same output accuracy where the values in w are more uniform is better)
The parameter of the algorithm is the penalty constant
34
Introduction to machine learning, Pazmany University 2017-2018
Linear regression – Ridge,
Elastic Net
We assume that there is a correlation in our data (x and y) which can be described as:
Y=wX+b
One can easily overfit the data:
There is noise on our input data (some variables might be more reliable than the other)
Ridge and ElasticNet uses more complex penalization on the parameters depending on the
data (Thikonov regularization)
35
Introduction to machine learning, Pazmany University 2017-2018
Non-linear regression
How could we fit a more complex mode?
We could fit any model we found according to the previous formula:
Y=m(x)
Where we can derivate the model according to all the parameters and we will have an equation
for every parameter
We could fit any model if it is known, but unfortunately in practice the problem is
that we do not know the model
How could we create a general model, which could approximate any possible
models
36
Introduction to machine learning, Pazmany University 2017-2018
Non-linear regression
We can approximate any function with partially linear functions
We can divide the problem into sub-domains
depending on the input values and use separated (or
combined) linear regressors to approximate the
original function.
37
Introduction to machine learning, Pazmany University 2017-2018
Ensemble regression
We can approximate any function with partially linear functions
The approximation will be the weighted summation of linear models
The weights are determined by the values of the data
http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html
38
Introduction to machine learning, Pazmany University 2017-2018
Ensemble regression
We can approximate any function with partially linear functions
39
Introduction to machine learning, Pazmany University 2017-2018
Neural networks
Linear regression:
C1(wx+b)+C2(wx+b)+C3(wx+b)+C4(wx+b)+C5(wx+b)+C5(wx+b)
What does a neuron do?
40
Introduction to machine learning, Pazmany University 2017-2018
Neural networks
Linear regression:
C1(wx+b)+C2(wx+b)+C3(wx+b)+C4(wx+b)+C5(wx+b)+C5(wx+b)
What does a neuron do?
41
Introduction to machine learning, Pazmany University 2017-2018
Neural networks
Linear regression:
C1(wx+b)+C2(wx+b)+C3(wx+b)+C4(wx+b)+C5(wx+b)+C5(wx+b)
A feed forward neural network – layers of neurons:
f((w3f((w2f((w1x)+b1))+b2))+b3)
42
Introduction to machine learning, Pazmany University 2017-2018
Neural networks
How can we define this as mathematical operations:
43
Introduction to machine learning, Pazmany University 2017-2018
Neural networks
A feed-forward network is great
How could we teach such a network:
- Backpropagation algorithm
44
Introduction to machine learning, Pazmany University 2017-2018
Neural networks
A feed-forward network is great
How could we teach such a network:
- Backpropagation algorithm
- Stochastic gradient descent
45
Introduction to machine learning, Pazmany University 2017-2018
Neural networks
A feed-forward network is great
How could we teach such a network:
- Backpropagation algorithm
- Stochastic gradient descent
Tensorflow is here to help
46
Introduction to machine learning, Pazmany University 2017-2018
Neural networks
Regularization:
Using batches
Adding Dropout
47
Introduction to machine learning, Pazmany University 2017-2018
Neural networks
48