NN Unit 1
NN Unit 1
The rule-based systems usually write explicit logic, which is generally designed for
specific tasks and is not suitable for other tasks.
Traditional machine learning algorithms artificially design feature detection methods with
certain generality, such as SIFT and HOG features which are suitable for a certain type of
tasks and have certain generality. But the performance highly depends on how those
features are designed.
The emergence of neural networks has made it possible for computers to design those
features automatically through neural networks without human intervention.
Shallow neural networks typically have limited feature extraction capability, while deep
neural networks are capable of extracting high-level, abstract features and have better
performance.
In 1958, American psychologist Frank Rosenblatt proposed the first neuron model
that can automatically learn weights, called perceptron. The error between the
output value o and the true value y is used to adjust the weights of the neurons
{w1,w2,…,wn}. Frank Rosenblatt then implemented the perceptron model based
on the “Mark 1 perceptron” hardware. The input is an image sensor with 400
pixels, and the output has eight nodes.it can finally identify English letters.
The main flaw of linear models such as perceptrons is that perceptrons cannot handle
simple linear inseparable problems such as XOR. This directly led to the tough period of
perceptron-related research on neural networks. It is generally considered that 1969–1982
was the first winter of artificial intelligence.
Although it was in the tough period of AI, there were still many significant studies
published such as back propagation (BP) algorithm, which is the core foundation of
modern deep learning algorithms. In fact, the mathematical idea of the BP algorithm has
been derived as early as the 1960s, but it had not been applied to neural networks at that
time.
In 1974, American scientist Paul Werbos first proposed that the BP algorithm can be
applied to neural networks in his doctoral dissertation. Unfortunately, this result has not
received enough attention. In 1986, David Rumelhart et al. published a paper using the BP
algorithm for feature learning in Nature. Since then, the BP algorithm started gaining
widespread attention.
During the second wave of artificial intelligence renaissance that started from 1982 to
1995, convolutional neural networks, recurrent neural networks, and backpropagation
algorithms were developed. In 1986, David Rumelhart, Geoffrey Hinton, and others
applied the BP algorithm to multilayer perceptrons.
In 1989, Yann LeCun and others applied the BP algorithm to handwritten digital image
recognition and achieved great success, which is known as LeNet. The LeNet system was
successfully commercialized in zip code recognition, bank check recognition, and many
other systems.
In 1997, one of the most widely used recurrent neural network variants, Long ShortTerm
Memory (LSTM), was proposed by Jürgen Schmidhuber. In the same year, a bidirectional
recurrent neural network was also proposed.
Unfortunately, the study of neural networks has entered a tough with the rise of traditional
machine learning algorithms represented by support vector machines (SVMs), which is
known as the second winter of artificial intelligence.
Support vector machines have a rigorous theoretical foundation, require a small number of
training samples, and also have good generalization capabilities. In contrast, neural
networks lack theoretical foundation and are hard to interpret. Deep networks are difficult
to train, and the performance is normal.
1.4.2 DEEP NEURAL NETWORKS ALONG WITH ITS DEVELOPMENT
TIMELINE:
We divide the development of neural networks into shallow neural network stages and
deep learning stages, with 2006 as the dividing point.
In 2006, Geoffrey Hinton found that multilayer neural networks can be better trained
through layer-by-layer pre-training and achieved a better error rate than SVM on the
MNIST handwritten digital picture data set, turning on the third artificial intelligence
revival.
In 2011, Xavier Glorot proposed a Rectified Linear Unit (ReLU) activation function,
which is one of the most widely used activation functions now
In 2012, Alex Krizhevsky proposed an eight-layer deep neural network AlexNet, which
used the ReLU activation function and Dropout technology to prevent overfitting.
Since the AlexNet model was developed, various models have been published
successively, including VGG series, GoogleNet series, ResNet series, and DenseNet
series.
In 2014, Ian Goodfellow proposed generative adversarial networks (GANs), which
learned the true distribution of samples through adversarial training to generate samples
with higher approximation. Since then, a large number of GAN models have been
proposed.
In 2016,DeepMind applied deep neural networks to the field of reinforcement learning
and proposed the DQN algorithm, which achieved a level comparable to or even higher
than that of humans in 49 games in the Atari game platform.
In the field of Go, AlphaGo and AlphaGo Zero intelligent programs from DeepMind
have successively defeated human top Go players Li Shishi, Ke Jie, etc.
In the multi-agent collaboration Dota 2 game platform, OpenAI Five intelligent
programs developed by OpenAI defeated the TI8 champion team OG in a restricted
game environment, showing a large number of professional high-level intelligent
operations. Figure 1-9lists the major time points between 2006 and 2019 for AI
development.
Timeline for deep learning development:
2. DEEP LEARNING CHARACTERISTICS
2.1 CHARACTERISTICS OF MODERN DEEP LEARNING ALGORITHMS:
Compared with traditional machine learning algorithms and shallow neural networks, modern
deep learning algorithms usually have the following characteristics.
Following are the characteristics of modern deep learning algorithms which make it better to
use:
DATA VOLUME :
Early machine learning algorithms are relatively simple and fast to train, and the size of
the required dataset is relatively small, such as the Iris flower dataset.
With the development of computer technology, the designed algorithms are more and
more complex, and the demand for data volume is also increasing. With the rise of
neural networks, especially deep learning networks, the number of network layers and
model parameters are large.
To prevent over fitting, the size of the training dataset is usually huge.Although deep
learning has a high demand for large datasets, collecting data, especially collecting
labeled data, is often very expensive.
The formation of a dataset usually requires manual collection, crawling of raw data and
cleaning out invalid samples, and then annotating the data samples with human
intelligence, so subjective bias and random errors are inevitably introduced.
COMPUTING POWER :
The increase in computing power is an important factor in the third artificial intelligence
renaissance.
The real potential of deep learning was not realized until the release of AlexNet.
Traditional machine learning algorithms do not have stringent requirements on data
volume and computing power like deep learning. But deep learning relies heavily on
parallel acceleration computing devices.
Most of the current neural networks use parallel acceleration chips such as NVIDIA
GPU and Google TPU to train model parameters.
For example, the AlphaGo Zero program needs to be trained on 64 GPUs from scratch
for 40 days before surpassing all AlphaGo historical versions. At present, the deep
learning acceleration hardware devices that ordinary consumers can use are mainly from
NVIDIA GPU graphics cards.
This shows the variation of one billion floating-point operations per second (GFLOPS)
of NVIDIA GPU and x86 CPU from 2008 to 2017. It can be seen that the curve of x86
CPU changes relatively slowly, and the floating-point computing capacity of the
NVIDIA GPU grows exponentially, which is mainly driven by the increasing business
of game and deep learning computing.
NETWORK SCALE:
Early perceptron models and multilayer neural networks only have one or two to four
layers, and the network parameters are also around tens of thousands.
With the development of deep learning and the improvement of computing capabilities,
models such as AlexNet (8 layers), VGG16 (16 layers), GoogleNet (22 layers),
ResNet50 (50 layers), and DenseNet121 (121 layers) have been proposed successively,
while the size of inputting pictures has also gradually increased from 28×28 to 224×224
to 299×299 and even larger.
The increase of network scale enhances the capacity of the neural networks
correspondingly, so that the networks can learn more complex data modalities and the
model performance can be improved accordingly.
On the other hand, the increase of the network scale also means that we need more
training data and computational power to avoid over fitting.
GENERAL INTELLIGENCE:
Designing a universal intelligent mechanism that can automatically learn and self-adjust
like the human brain has always been the common vision of human beings.
Deep learning is one of the algorithms closest to general intelligence. In the computer
vision field, previous methods that need to design features for specific tasks and add a
priori assumptions have been abandoned by deep learning algorithms.
At present, almost all algorithms in image recognition, object detection, and semantic
segmentation are based on end-to-end deep learning models, which present good
performance and strong adaptability.
On the Atari game platform, the DQN algorithm designed by DeepMind can reach
human equivalent level in 49 games under the same algorithm, model structure, and
hyperparameter settings, showing a certain degree of general intelligence.
DQN network structure:
Image classification
• The input of the neural network is pictures, and the output value is the
probability that the current sample belongs to each category.
• Generally, the category with the highest probability is selected as the
predicted category of the sample.
Chatbot
• Mainstream task of natural language processing.
• Machines automatically learn to talk to humans, provide satisfactory
automatic responses to simple human demands, and improve customer
service efficiency and service quality.
• Neural network is trained to generate appropriate responses for input
questions.
• Chatbot is often used in consulting systems, entertainment systems, and
smart homes.
Virtual Games
• virtual game platforms can both train and test reinforcement learning
algorithms
• Neural network trains and learns in a complex environment on the basis
of reward function.
• avoid interference from irrelevant factors while also minimizing the cost
of experiments
Fraud detection:
Fraud is a growing problem in the digital world. In 2021,
consumers reported 2.8 million cases of fraud to the Federal Trade
Commission. Identify theft and imposter scams were the two most
common fraud categories.
To help prevent fraud, companies like Signifyd use deep
learning to detect anomalies in user transactions. Those companies
deploy deep learning to collect data from a variety of sources,
including the device location, length of stride and credit card
purchasing patterns to create a unique user profile.
Mastercard has taken a similar approach, leveraging its Decision
Intelligence and AI Express platforms to more accurately detect
fraudulent credit card activity. And for companies that rely on e-
commerce, Riskified is making consumer finance easier by
reducing the number of bad orders and chargebacks for merchants.
Relevant companies: Neurala, ZeroEyes, Motional
Computer Vision:
Computer Vision Image classification is a common classification. The
input of the neural network is pictures, and the output value is the
probability that the current sample belongs to each category.
Generally, the category with the highest probability is selected as the
predicted category of the sample.
Image recognition is one of the earliest successful applications of deep
learning. Classic neural network models include VGG series, Inception
series, and ResNet series.
Object detection refers to the automatic detection of the approximate
location of common objects in a picture by an algorithm. It is usually
represented by a bounding box and classifies the category information of
objects in the bounding box. Common object detection algorithms are
RCNN, Fast RCNN, Faster RCNN, Mask RCNN, SSD, and YOLO
series.
Semantic segmentation is an algorithm to automatically segment and
identify the content in a picture. We can understand semantic
segmentation as the classification of each pixel and analyze the category
information of each pixel. Common semantic segmentation models
include FCN, U-net, SegNet, and DeepLab series.
Video Understanding. As deep learning achieves better results on 2D
picture–related tasks, 3D video understanding tasks with temporal
dimension information (the third dimension is sequence of frames) are
receiving more and more attention. Common video understanding tasks
include video classification, behavior detection, and video subject
extraction. Common models are C3D, TSN, DOVF, and TS_LSTM.
Image generation learns the distribution of real pictures and samples
from the learned distribution to obtain highly realistic generated pictures.
At present, common image generation models include VAE series and
GAN series. Among them, the GAN series of algorithms have made
great progress in recent years.
Agriculture
Agriculture will remain a key source of food production in the
coming years, so people have found ways to make the process
more efficient with deep learning and AI tools. AI to detect
intrusive wild animals, forecast crop yields and power self-driving
machinery.
Blue River Technology has explored the possibilities of self-driven
farm products by combining machine learning, computer vision
and robotics. The results have been promising, leading to smart
machines — like a lettuce bot that knows how to single out weeds
for chemical spraying while leaving plants alone. In addition,
companies like Taranis blend computer vision and deep learning to
monitor fields and prevent crop loss due to weeds, insects and
other causes.
Relevant Companies: Blue River Technology, Taranis
Autonomous vehicles
Driving is all about taking in external factors like the cars around you,
street signs and pedestrians and reacting to them safely to get from point
A to B. While we’re still a ways away from fully autonomous vehicles,
deep learning has played a crucial role in helping the technology come to
fruition.
It allows autonomous vehicles to take into account where you want to go,
predict what the obstacles in your environment will do and create a safe
path to get you to that location.
Relevant companies: Zoox, Tesla, Waymo
Climate change
Organizations are stepping up to help people adapt to quickly accelerating
environmental change. One Concern has emerged as a climate
intelligence leader, factoring environmental events such as extreme
weather into property risk assessments.
Meanwhile, NCX has expanded the carbon-offset movement to include
smaller landowners by using AI technology to create an affordable carbon
marketplace.
Entertainment
Streaming platforms aggregate tons of data on what content you choose
to consume and what you ignore. Take Netflix as an example. The
streaming platform uses machine learning to find patterns in what its
viewers watch so that it can create a personalized experience for its users.
Relevant companies: Amazon, Netflix
VARIOUS DEEP LEARNING FRAMEWORKS:
1) Theano is one of the earliest deep learning frameworks. It was developed by
Yoshua Bengio and Ian Goodfellow. It is a Python-based computing library
for positioning low-level operations. Theano supports both GPU and CPU
operations. Due to Theano’s low development efficiency, long model
compilation time, and developers switching to TensorFlow, Theano has now
stopped maintenance.
2) Scikit-learn is a complete computing library for machine learning
algorithms. It has built-in support for common traditional machine learning
algorithms, and it has rich documentation and examples. However, scikit-
learn is not specifically designed for neural networks. It does not support
GPU acceleration, and the implementation of neural network–related layers
is also lacking.
3) Caffe was developed by Jia Yangqing in 2013. It is mainly used for
applications using convolutional neural networks and is not suitable for other
types of neural networks. Caffe’s main development language is C ++, and it
also provides interfaces for other languages such as Python. It also supports
GPU and CPU. Due to the earlier development time and higher visibility in
the industry, in 2017 Facebook launched an upgraded version of Caffe,
Caffe2. Caffe2 has now been integrated into the PyTorch library.
4) Torch is a very good scientific computing library, developed based on the
less popular programming language Lua. Torch is highly flexible, and it is
easy to implement a custom network layer, which is also an excellent gene
inherited by PyTorch. However, due to the small number of Lua language
users, Torch has been unable to obtain mainstream applications.
5) Keras is a high-level framework implemented based on the underlying
operations provided by frameworks such as Theano and TensorFlow. It
provides a large number of high-level interfaces for rapid training and
testing. For common applications, developing with Keras is very efficient.
But because there is no low-level implementation, the underlying framework
needs to be abstracted, so the operation efficiency is not high, and the
flexibility is average.
Keras can be understood as a set of high-level API design specifications.
Keras itself has an official implementation of the specifications. The same
specifications are also implemented in TensorFlow, which is called the
tf.keras module, and tf.keras will be used as the unique high-level interface
to avoid interface redundancy.
6) TensorFlow is a deep learning framework released by Google in 2015. The
initial version only supported symbolic programming. Thanks to its earlier
release and Google’s influence in the field of deep learning, TensorFlow
quickly became the most popular deep learning framework. However, due to
frequent changes in the interface design, redundant functional design, and
difficulty in symbolic programming development and debugging,
TensorFlow 1.x was once criticized by the industry. In 2019, Google
launched the official version of TensorFlow 2, which runs in dynamic graph
priority mode and can avoid many defects of the TensorFlow 1.x version.
TensorFlow 2 has been widely recognized by the industry.At present,
TensorFlow and PyTorch are the two most widely used deep learning
frameworks in industry. TensorFlow has a complete solution and user base
in the industry. Thanks to its streamlined and flexible interface design,
PyTorch can quickly build and debug networks, which have received rave
reviews in academia. After TensorFlow 2 was released, it makes it easier for
users to learn TensorFlow and seamlessly deploy models to production.
BIOLOGICAL NEURON VS BASIC NEURON MODEL:
Neurons are the building blocks of the nervous system. They receive and transmit signals to
different parts of the body. This is carried out in both physical and electrical forms.
Structure of a biological Neuron:
Artificial Neuron:
An artificial neuron is a connection point in an artificial neural network. Artificial neural
networks, like the human body's biological neural network, have a layered architecture and
each network node (connection point) has the capability to process input and forward output
to other nodes in the network.
Structure of an Artificial Nueron:
Artificial Neural Network:
Artificial Neural Network (ANN) is a type of neural network which is based on a
Feed-Forward strategy. It is called this because they pass information through the
nodes continuously till it reaches the output node. This is also known as the simplest
type of neural network.
NEURON MODEL:
• In 1943, the psychologist Warren McCulloch and mathematical logician Walter
Pitts proposed a mathematical model of artificial neural networks to simulate the
mechanism of biological neurons.
• This research was further developed by the American neurologist Frank Rosenblatt
into the perceptron model which is also the cornerstone of modern deep learning.
• The neuron input vector x = [x1, x2, x3, …, xn]T maps to y through function fθ : x →
y, where θ represents the parameters in the function f.
• The parameters θ = {w1, w2, w3, …, wn, b} determine the state of the neuron, and
the processing logic of this neuron can be determined by fixing those parameters.
3.1 REGRESSION : NEURON MODEL
Regression is a ML algorithm which is used to find the relationship between a
dependent variable and other independent variables.
1. If we consider a Simple Linear Regression ,the number of parameters are one.
2. When the number of input nodes n = 1 (single input), the neuron model can
be further simplified as
y =wx + b
w-slope of the straight line
b-bias of the straight line
3. Then we can plot the change of y as a function of x , which is in the form of a
straight line.
4. In order to estimate the value of w and b, we only need to sample any two
data points (x(1), y(1)) and (x(2), y(2)) from the straight line.
If the estimation is based on the two blue rectangular data points, the estimated
blue dotted line would have a larger deviation from the true orange straight line.
• In order to reduce the estimation bias introduced by observation errors,
we can sample multiple data points.
• We should find the best straight line, so that it minimizes the
sum of errors between all sampling points and the straight line.
• Due to the existence of observation errors, there may not be a
straight line that perfectly passes through all the sampling points
D.
• Therefore, we should be able to find a good straight line cloe to all
sampling points with the help of Mean Squared Error(MSE)
• Then search a set of parameters w∗ and b∗ to minimize the total error ‘L’.
• The straight line corresponding to the minimal total error is the optimal
straight line we are looking for, that is
• For a single-input neuron model, only two samples are needed to obtain the exact solution
of the equations by the elimination method.
• This is because the computer’s calculation speed is very fast. We can use the powerful
computing power to “search” and “try” multiple times, thereby reducing the error ‘L’ step by
step.
BRUTE-FORCE ALGORITHM:
• To find the most suitable w∗ and b∗, we can randomly sample any w and b from the real
number space and calculate the error value ‘L’ of the corresponding model.
• Pick out the smallest error L∗ from all the experiments L , and its corresponding w∗ and b∗
are the optimal parameters we are looking for.
• Very suitable for optimizing neural network models with massive data.
• Concept of derivative can be used to solve maximum and minimum values of a function.
• In the interval xϵ[-10,10] f(x) can be drawn as “blue” solid lines and its derivative as “yellow”
dotted line
• The gradient of a function is defined as a vector of partial derivatives of the function on each
independent variable.
z = f (x, y)
• Let ,
• Example: ( , ) = −( + )
• NOTE: The direction of the arrow always points to the direction where the function value
increases.The steeper the function surface, the longer the length of the arrow, and the
larger the modulus of the gradient.
• We have observed that gradient direction of the function always points to the direction in
which the function value increases.
• The opposite direction of the gradient should point to the direction in which the function
value decreases.
= − .
We must iteratively update x’ using the above equation, then we can get smaller and smaller
function values.
= − .
Here is known as LEARNING RATE
It is used to scale the gradient vector
It is generally set a smaller value such as 0.01 or 0.001
NOTE: FOR ONE DIMENSIONAL FUNCTIONS, The above function can be written as :
= − .
The method of optimizing parameters by the formula = − . is called the
gradient descent algorithm.
It calculates the gradient ∇f of the function f .
It iteratively updates the parameters θ to obtain the optimal numerical solution of the
parameters θ when the function f reaches its minimum value.
NOTE: model input in deep learning is generally represented as x and the parameters
to be optimized are generally represented by θ, w, and b.
Lets go back to the REGRESSION MODEL where optimization was needed in MSE.
1 ( ) ( )
= + −
we will apply the gradient descent algorithm to calculate the optimal parameters ∗
and ∗ by minimising the Mean Square Error ‘L’.
The model parameters that needs to be optimised are w and b , and they need to be
updated using = − and = −
FEATURES OF VANILLA,STOCHASTIC AND MINI BATCH
GRADIENT DECENT ALGORITHMS:
Computation time is
Faster and less lesser than SGD
Slow and computationally Computation cost is
computationally expensive than Vanilla lesser than Vanilla
expensive algorithm GD Gradient Descent
import numpy as np
data = [] # A list to save data samples
for i in range(100): # repeat 100 times
# Randomly sample x from a uniform distribution
x = np.random.uniform(-10., 10.)
# Randomly sample from Gaussian distribution
eps = np.random.normal(0., 0.01)
# Calculate model output with random errors
y = 1.477 * x + 0.089 + eps
data.append([x, y]) # save to data list
data = np.array(data) # convert to 2D Numpy array
Generating Data:
The program starts by importing the necessary libraries, particularly NumPy for numerical
operations.
It initializes an empty list data to store data samples.
It generates 100 data samples by iterating over a loop 100 times:
Randomly samples x from a uniform distribution between -10 and 10.
Randomly samples ε (error term) from a Gaussian distribution with mean 0 and standard
deviation 0.01.
Calculates the model output y=1.477×x+0.089+ε with random errors.
Appends the data sample [x, y] to the data list.
Finally, it converts the data list into a 2D NumPy array.
Defines a function mse(b, w, points) to calculate the mean squared error given the current values of
w and b and the dataset. It iterates over all data points, calculates the squared error for each point,
and sums them up.Returns the mean of the squared errors.
Defines a function step_gradient(b_current, w_current, points, lr) to perform one step of gradient
descent.Calculates the gradients for b and w using the partial derivatives of the loss function with
respect to b and w (which are derived from the mean squared error).Updates the values of b and w
using the gradients and the learning rate (lr).
Main Function:
Defines a main function to run the program.Generates a new dataset with 100 samples.Sets the
learning rate (lr), initial values for b and w, and the number of iterations.Calls gradient_descent to
train the model and obtain the optimal b and w values.Calculates the final loss using the optimal
parameters and prints the results.
CLASSIFICATION
• Aim: We will take 0–9 digital picture recognition as an example to explore how to use
machine learning to solve the classification problem.
• It is handwritten digital dataset , generally scaled to a fixed size, such as 28x28 pixels. For
simplicity grayscale information is retained.
• These pictures will be used as the input data x. Also the data is labelled in nature.
• MNIST data set contains real handwritten pictures of numbers 0–9. Each number has a total
of 7,000 pictures, collected from different writing styles.
• Out of it 60k images are used for training and 10k for testing.
NOTE:
• Generally, pixel values are integers ranging from 0 to 255 to express color intensity
information.
• For example, 0 represents the lowest intensity, and 255 indicates the highest intensity.
• If it is a color picture, each pixel contains the intensity information of the three channels R,
G, and B, which, respectively, represent the color intensity of colors red, green, and blue.
• Therefore, each pixel of a color picture is represented by a one-dimensional vector with
three elements, which represent the intensity of R, G, and B colors.
• A grayscale picture only needs a two-dimensional matrix with shape [h, w] or a three-
dimensional tensor with shape [h, w, 1] to represent its information.
• The matrix content of a picture for number 8. It can be seen that the black pixels in the
picture are represented by 0 and the grayscale information is represented by 0–255.
• The whiter pixels in the picture correspond to the larger values in the matrix.
• For multiple pictures, we can add one more dimension in front and use a tensor of shape [b,
h, w] to represent them.
• Color pictures can be represented by a tensor with the shape of [b, h, w, c], where c
represents the number of channels, which is 3 for color pictures.
CLASSIFICATION- BUILDING A MODEL
• We took ‘x’ as the the input. If it is single input scalar then the model can be expressed as
y=xw+b.
where, and
y =Wx + b
• For multiple-output and batch training, we write the model in batch form:
Y = X@W +b
• X has shape [b, din], b is the number of samples and din is the length of each sample
• The @ symbol means matrix multiplication. Since the result of the operation X @ W is a
matrix of shape [b, dout], it cannot be directly added to the vector b.
• Therefore, the + sign in batch form needs to support broadcasting, that is, expand the vector
b into a matrix of shape [b, dout] by replicating b.
• Lets build a Neural network of the same with 3 inputs and 2 outputs.
• A grayscale image is stored using a matrix with shape [h, w], and b pictures are stored using
a tensor with shape [b, h, w].
• However, our model can only accept vectors, so we need to flatten the [h, w] matrix into a
vector of length [h ⋅ w]. Thus the length of the input features din = h ⋅ w.
• The output actually can be set to a set of vectors with length dout, where dout is the same
as the number of categories.
• For example, if the output belongs to the first category, then the corresponding index is set
to 1, and the other positions are set to 0.
• For classification problems, our goal is to maximize a certain performance metric, such as
accuracy.
• As a result, the gradient descent algorithm cannot be used to optimize the model
parameters.
• For the error calculation of a classification problem, it is more common to use the cross-
entropy loss function instead of the mean squared error loss function introduced in the
regression problem.
The perception and decision-making of complex brains are far more complex than a linear model.
2. Complexity:
• The preceding solution only uses a one-layer neural network model composed of a small
number of neurons.
• Compared with the 100 billion neuron interconnection structure in the human brain, its
generalization ability is obviously weaker.
• The distribution of sampling points with observation errors is plotted. The actual distribution
may be a quadratic parabolic model.
• If you use a linear model to fit the data, it is difficult to learn a good model.
• If you use a suitable polynomial function model to learn, such as a quadratic polynomial, you
can learn a suitable model.
• But when the model is too complex, such as a ten-degree polynomial, it is likely to overfit
and hurt the generalization ability of the model
So what is the solution????
NON LINEAR MODEL
• Since a linear model is not feasible, we can embed a nonlinear function in the linear model
and convert it to a nonlinear model.
o= σ(Wx+b)
Activation Function:
• The ReLU function only retains the positive part of function y = x and sets the negative part
to be zeros.
• It has a unilateral suppression characteristic. Although simple, the ReLU function has
excellent nonlinear characteristics, easy gradient calculation, and stable training process.
• It is one of the most widely used activation functions for deep learning models.
o = ReLU( Wx+ b )
Model Complexity:
• To increase the model complexity, we can repeatedly stack multiple transformations such
as:
ℎ = ( + )
ℎ = ( ℎ + )
O= ℎ +
• the output value h1 of the first-layer neuron as the input of the second-layer neuron .
• Then take the output h2 of the second-layer neuron as the input of the third-layer neuron.
• We call the layer where the input node x is located the input layer.
• The output of each nonlinear module hi along with its parameters Wi and bi is called a
network layer.
• In particular, the layer in the middle of the network is called the hidden layer, and the last
layer is called the output layer.
• This network structure formed by the connection of a large number of neurons is called a
neural network.
• The number of nodes in each layer and the number of layers determine the complexity of
the neural network.
OPTIMISATION METHOD:CLASSIFICATION
• we can directly derive the partial derivative expression of and then calculate the
gradient for each step and update the parameters w and b using the gradient descent
algorithm.
• As complex nonlinear functions are embedded, the number of network layers and the length
of data features also increase.
• The model becomes very complicated, and it is difficult to manually derive the gradient
expressions.
• once the network structure changes, the model function and corresponding gradient
expressions also change.
• Therefore, it is obviously not feasible to rely on the manual calculation of the gradient.
• With the help of autodifferentiation technology, deep learning frameworks can build the
neural network’s computational graph during the calculation of each layer’s output
corresponding loss function and then automatically calculate the gradient of any
parameter .
• Users only need to set up the network structure, and the gradient will automatically be
calculated and updated, which is very convenient and efficient to use.
HANDS ON – CLASSIFICATION
• The output of first layer serves as input to the first hidden layer. Lets say it is ℎ ∈ . So
the output vector is of dimension 256.
• Since we are using tensorflow deep learning frame work we shall just write a single line
code:
layers.Dense(256, activation='relu')
• Next step is how do we build a multi layer network? For an example a three layer network ?
# Build a 3-layer network. The output of 1st layer is the inputof 2nd layer.
model = keras.Sequential([
layers.Dense(256, activation='relu'),
layers.Dense(128, activation='relu'),
layers.Dense(10)])
• After building the three-layer neural network, given the input x, we can call model(x) to get
the model output o and calculate the current loss L:
out = model(x)
loss = tf.square(out-y_onehot)
grads = tape.gradient(loss,model.trainable_variables)
• Then weuse the optimizer object to automatically update the model parameters ϴ
optimizer.apply_gradients(zip(grads, model.trainable_
variables))
• After multiple iterations, the learned model f θ can be used to predictthe categorical
probability of unknown pictures.
• Because the three-layer neural network has relatively strong generalization ability and the
task of handwritten digital picture recognition is relatively simple, the training error
decreases quickly.
• We can test the model’s accuracy and other indicators after several epochs to monitor the
model training effect.