Deep
Learning
Kairit
Sirts
Lecture
in
TUT
19.12.2016
Outline
• What can be done with deep learning?
• Deep learning demystified
• How can you get started with deep learning?
2
Why deep learning?
Deep learning Gradient boosting
Random Forest Linear model
3
http://www.infoworld.com/article/3003315/big-data/deep-learning-a-brief-guide-for-practical-problem-solvers.html
What can be done with deep learning?
Handwritten digit recognition
MNIST benchmark dataset
The best reported error rate is 0.21%
5
Street view number recognition
• Obtained from house numbers in
Google Street View images
• Best error rate is 1.69%
6
Image classification
7
Image classification
10 objects
6000 labeled instances for each object
Best accuracy so far 96.53%
8
Image classification
9
Image classification
20 superclasses
100 finegrained classes
600 labeled images per class
Best classification accuracy 75.72%
10
Detecting doodles
https://quickdraw.withgoogle.com
There are other simple and fun AI
experiments launched by Google
https://aiexperiments.withgoogle.com
11
Image captioning
12
Image captioning – not so great results
13
Automatic colorization of images
14
http://richzhang.github.io/colorization/resources/images/teaser3.jpg
Automatic colorization of images - failed
15
DeepDream
https://deepdreamgenerator.com
16
DeepDream
17
DeepDream
18
DeepDream
19
Word embeddings
20
http://metaoptimize.s3.amazonaws.com/cw-embeddings-ACL2010/embeddings-mostcommon.EMBEDDING_SIZE=50.png
Word embeddings
months
weekdays
numbers
21
Word embeddings
• 𝑊 man − 𝑊 woman ≈ 𝑊 king − 𝑊(queen)
• 𝑊 walking − 𝑊 walked ≈ 𝑊 swimming − 𝑊(swam)
22
Automatic text generation – pseudo Shakespeare
23
http://karpathy.github.io/2015/05/21/rnn-effectiveness
Machine translation
• Google Translate app
24
Learning to play Atari Arcade games
25
https://www.youtube.com/watch?v=cjpEIotvwFY
AlphaGo
26
https://www.youtube.com/watch?v=PQCrX1sQSzY
Other tasks tackled with deep neural networks
• Speech recognition
• Various tasks in robotics
• Log analysis/risk detection
• Recommendation systems
• Motion detection from videos
• Business and Economics analytics
• Etc …
27
Deep learning demystified
How does deep learning work?
• Biological neuron • Artificial neuron
http://www.theprojectspot.com/tutorial-post/introduction-to-artificial-neural-networks-part-1/7
29
• Biological neural network • Artificial neural network
30
https://www.eeweb.com/blog/rob_riemen/deep-machine-learning-and-the-google-brain http://www.theprojectspot.com/tutorial-post/introduction-to-artificial-neural-networks-part-1/7
What happens inside a neuron?
<
ℎ = 𝑥7 𝑤7 + 𝑥: 𝑤: + ⋯ + 𝑥< 𝑤< = = 𝑥> 𝑤>
>?7
Output: ℎ = 𝑓(𝑧)
31
Activation function
1
if
𝑧 ≥ th 1 𝑒 E − 𝑒 DE
𝑓 𝑧 =J 𝑓 𝑧 = 𝑓 𝑧 = E 𝑓 𝑧 = max
(0, 𝑧)
0
if
𝑧 < th 1 + 𝑒 DE 𝑒 + 𝑒 DE
32
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/neural_networks.html
Single neuron logic gates
• Threshold activation function
33
https://blog.abhranil.net/2015/03/03/training-neural-networks-with-genetic-algorithms/
XOR gate
• Cannot be done with a single neuron
• A hidden layer is necessary
𝒙𝟏 𝒙𝟐 OR NOT AND AND
0 0 𝕀 0 ∙ 1 + 0 ∙ 1 > 0.5 = 0 𝕀 0 ∙ −1 + 0 ∙ −1 > −1.5 = 1 𝕀 0 ∙ 1 + 1 ∙ 1 > 1.5 = 0
0 1 𝕀 0 ∙ 1 + 1 ∙ 1 > 0.5 = 1 𝕀 0 ∙ −1 + 1 ∙ −1 > −1.5 = 1 𝕀 1 ∙ 1 + 1 ∙ 1 > 1.5 = 1
1 0 𝕀 1 ∙ 1 + 0 ∙ 1 > 0.5 = 1 𝕀 1 ∙ −1 + 0 ∙ −1 > −1.5 = 1 𝕀 1 ∙ 1 + 1 ∙ 1 > 1.5 = 1
1 1 𝕀 1 ∙ 1 + 1 ∙ 1 > 0.5 = 1 𝕀 1 ∙ −1 + 1 ∙ −1 > −1.5 = 0 𝕀 1 ∙ 1 + 0 ∙ 1 > 1.5 = 0
34
https://blog.abhranil.net/2015/03/03/training-neural-networks-with-genetic-algorithms/
How to assign weights?
8Y9+9Y9+9Y9+9Y4=
= 270 weights
35
http://neuralnetworksanddeeplearning.com/
Backpropagation
• Standard and efficient method for training neural networks
• The general idea:
• Compute the error with a forward pass
• Propagate the error back to change the weights such that the error would become smaller
ERROR à ERROR’
ERROR’ < ERROR
36
Diversion to calculus - derivative
• 𝑦_ = 𝑓 _ 𝑥
• Derivative is the slope of the tangent
line
• It is the rate of change when going in
the direction of steepest ascent
37
Derivatives
• When 𝑓 _ 𝑥 = 0 then it is the local or
global maximum or minimum or a
saddle point
• When 𝑓 _ 𝑥 > 0 then the function is
increasing
• When 𝑓 _ 𝑥 < 0 then the function is
decreasing
38
Gradients
• Generalization of derivatives to
multivariate functions
• Derivative is a vector pointing to the
direction of steepest ascent
ab ab
• ∇𝑓(𝑥, 𝑦) = ,
ac ad
ab ab
• , - partial derivatives – take
ac ad
derivative wrt one variable while
treating all others as constant
39
Gradients and backpropagation
• Backpropagation is used to compute the gradients with respect to all parameters in a
neural network.
• The gradients are then used in a general method of gradient descent for minimizing
functions.
• We want to minimize the cost function that measures the error made by the neural
network.
• In order to do that we need to move to the direction of deepest descent given by the
gradients.
40
Gradient descent
• An iterative algorithm
• Start with initial parameter values 𝜃 f
• Update parameters iteratively until
convergence:
𝜃 gh7 =:
𝜃 g − 𝛼∇𝑓 𝜃
• 𝛼 - learning rate, controls the step size
41
Deep learning demystified
How does backpropagation work?
Backpropagation explained
• Example from:
https://mattmazur.com/2015/03/17/
• 2 inputs
• 1 hidden layer with 2 neurons
• Bias terms in both the hidden and
output layer
• 2 outputs
43
Initial configuration
• Training values
• Initial weights: 𝑤7 , … , 𝑤l
• Initial biases: 𝑏7 , 𝑏:
44
Forward pass – first hidden unit
45
Forward pass – first hidden unit
46
Forward pass – second hidden unit
47
Forward pass – first output unit
48
Forward pass – second output unit
49
Forward pass – error of the first output
50
Forward pass – output error
51
Forward pass – output error
52
Backwards pass
• Consider 𝑤n
• How much a change in 𝑤n affects the
total error?
• Apply the chain rule:
53
Chain rule
• Formula for computing derivative of the composition of two or more functions
• 𝐹 𝑥 ≡ 𝑓(𝑔 𝑥 ) ≡ (𝑓 ∘ 𝑔)(𝑥) – composition of functions 𝑓 and 𝑔
• 𝐹 _ 𝑥 = 𝑓 _ 𝑔 𝑥 𝑔_ 𝑥
• 𝐹 𝑥 =
𝑒 sc 𝑔 𝑥 = 3𝑥 𝑓 𝑔 𝑥 = 𝑒 u(c) = 𝑒 sc
• 𝐹 _ 𝑥 = 𝑓 _ 𝑔 𝑥 𝑔_ 𝑥 = (𝑒 u(c) )′𝑔′(𝑥) = 𝑒 u c (3𝑥)′ = 𝑒 sc Y 3 = 3𝑒 sc
54
Backwards pass
• Consider 𝑤n
• How much a change in 𝑤n affects the
total error?
• Apply the chain rule:
55
How much does error change wrt the output?
56
How much does output change wrt its net input?
57
Derivative of the sigmoid function
1
𝑓 𝑧 =
1 + 𝑒 DE
𝑓 _ 𝑧 = 𝑓(𝑧)(1 − 𝑓 𝑧 )
58
How much does output change wrt its net input?
59
How much does net input change wrt 𝑤n ?
60
Putting it all together
61
This is known as the delta rule
• Delta rule is the gradient descent rule for updating the weights of the inputs to
neurons in a single-layer neural network
62
Apply delta rule to outer layer weights
63
Update the weights with gradient descent
• set learning rate 𝛼 = 0.5 𝜽𝒕h𝟏 =:
𝜽𝒕 − 𝜶𝜵𝒇 𝜽
64
Backpropagation to hidden layer
• Continue backwards pass to
calculate new values for 𝑤7 , 𝑤: , 𝑤s
and 𝑤|
65
BP through hidden layer
• 𝑜𝑢𝑡€7 affects both 𝑜7 and 𝑜: and thus
needs to take into account both:
66
BP through hidden layer
• Consider one of those:
• First term can be calculated using values
computed before:
• Second term is just 𝑤n
67
BP through hidden layer
• Plug the values in:
• Compute the same value for 𝑜: :
• Compute the total:
68
BP through hidden layer
a•‚gƒ„ a<…gƒ„
• Next we need and for each
a<…gƒ„ a†
weight 𝑤
• Compute the partial derivative wrt a weight
69
BP through hidden layer
• Putting it together
• We can now update 𝑤7
70
BP through hidden layer
• Compute the partial derivatives in the same
way for 𝑤: , 𝑤s and 𝑤|
• Update 𝑤: , 𝑤s and 𝑤|
71
After first update with backpropagation
72
Did the error decrease?
• Old error was: 0.298371109
• Improvement: 0.007343335
• After 10000 updates the error will be
ca 0.000035085
• The generated outputs will be
0.015912196 for 0.01 target and
0.984065734 for 0.99 target
73
In conclusion
• Neural networks consist of artificial neurons organized into layers and connected
to each other with learnable weights.
• Backpropagation with gradient descent is the standard method for training neural
networks.
• Backpropagation can be used to compute the gradients of a neural network,
regardless of the depth of the network.
• Of course, there are other important tricks and tips but this is the basis of
understanding neural networks and deep learning.
74
Common neural network architectures
Feed-forward network
• Simplest type of neural network
• Connections between units do not
form cycles
• Information always moves in one
direction
• It never goes backwards
76
https://upload.wikimedia.org/wikipedia/en/5/54/Feed_forward_neural_net.gif
Recurrent neural network
• Connections between units form cycles
• They possess internal memory – they “remember” the past inputs
• Suitable for modeling sequential/temporal data, such as for instance text and
language data
77
Convolutional neural networks
• Convolutional layers have neurons
arranged in 3 dimensions
• Especially suitable for processing
image data
78
http://parse.ele.tue.nl/education/cluster2
Autoencoders
• Output layer attempts to reconstruct
the input
• Used for unsupervised feature learning
• The hidden layer has typically less
neurons, thus performing data
compression
79
Getting started with neural networks
Courses and tutorials
• https://www.coursera.org/learn/machine-learning -
• Introductory course on machine learning, provides necessary background
• https://www.coursera.org/learn/neural-networks
• Course on neural networks – assumes knowledge about machine learning
• http://ufldl.stanford.edu/tutorial/
• Tutorial on deep learning but covers also some simpler machine learning
• http://cs231n.stanford.edu/
• Course on convolutional neural networks
• https://www.udacity.com/course/deep-learning--ud730
• Course on deep learning
• There are many others … just google …
81
Books
• http://www.deeplearningbook.org/
• Deep Learning: A Practitioner’s approach – not released yet
• Fundamentals of deep learning – not released yet
• See more from:
• http://machinelearningmastery.com/deep-learning-books/
82
Low level libraries
• Theano - http://deeplearning.net/software/theano/
• Tensorflow - https://www.tensorflow.org/get_started/
• Python-based
• Automatic differentiation
• Can use cuda for computing on GPU
• Torch – http://torch.ch/
• Based on Lua
• Modular pieces that are easy to combine
• Lots of pretrained models
• See more: https://deeplearning4j.org/compare-dl4j-torch7-pylearn
83
Higher level libraries
• Keras - https://keras.io/
• On top of theano and tensorflow
• Based on python
• Modular
• Supports both convolutional and recurrent networks
• Supports arbitrary connectivity
• Runs on both CPU and GPU
84
Keras – example code
85
What else?
• Take the Machine Learning course in spring semester
• Use neural networks for your thesis work
• Potential supervisors in UT:
• Kairit Sirts (problems involving natural language)
• Mark Fishel (machine translation)
• Raul Vicente (computational neuroscience)
• Ilya Kuzovkin (computational neuroscience)
• Potential supervisors in TUT
• Juhan Ernits
• Tanel Alumäe (speech data)
• There are possibly others
86
In conclusion - Deep learning
• Can be used to solve very complex problems
• Based on artificial neural networks with many hidden layers
• Each artificial neuron is a simple computational unit
• Neural networks are trained with gradient descent algorithm
• Backpropagations algorithm is used to compute the gradients with respect to
tunable parameters
• There are many tutorials and online courses about deep learning
• There are various software libraries that enable to get started with deep learning
relatively easily
87