Machine Learning and
Soft Computing
CSCC53
MCA V Sem
2020
Textbook
• Text Book: Machine Learning by Tom M. Mitchell, TMH
• Reference:
S. Russell and P. Norvig, Artificial Intelligence: A
Modern Approach, 3rd Eition, Prentice Hall, (third
edition) 2010.
• Papers: Selected, topic based papers from the
– Journals: Artificial Intelligence, Artificial Intelligence
Programming, Machine Learning, IEEE Explorer, Data and
Knowledge Engineering, Pattern Recognition etc.
1
Other texts
Title Authors Publisher Year Edition
AI: Structures and George F. Pearson Ed. 2004 4th
Strategies or complex Luger
problem solving
Artificial Intelligence Rich, Knight, McGraw-Hill 2012 3rd
Nair
Artificial Intelligence N. P. Padhy Oxford 2005 1st
and Intelligent Systems University
Press
Artificial Intelligence: A Michael Pearson Ed. 2011 2nd
guide to Intelligent Negnevitsky edition
Systems
Artificial Intelligence
• The automation of activities that we associate with
human thinking, activities such as decision making,
problem solving and learning.
• One of the earliest and most significant papers on
machine intelligence “Computing Machinery and
Intelligence” was written by British mathematician Alan
Turing [Turing, 1950] over sixty five years ago.
• The phrase AI was coined by John McCarthy during
1956.
• It has stood up well to the test of time, and Turing’s
approach remains universal.
2
AI techniques
• Knowledge Representation: This technique addresses the problem
of capturing the full range of knowledge required for intelligent
behavior in a formal language i.e. one suitable for computer
manipulation. Some methods are:
– Predicate Calculus
– Semantic nets (Quillian)
– Frames (Marvin Minsky to represent common sense knowledge)
– Conceptual Dependency (Schank for natural language)
– Scripts (stereo type sequence of events)
• Search: It is problem solving technique that systematically explores a
space of problem states i.e. successive and alternative stages in the
problem–solving process. Some methods are:
– BFS
– DFS
– Best First Search
– MiniMax Search
– Alpha Beta Cutoff
• Machine Learning: Grew out of work in AI (ANN, GA, HMM,
Reinforcement Learning etc. )
AI
• Two notions of AI as defined by Prof.
Andrew Ng, Stanford Univ.
– Artificial Narrow Intelligence
– Artificial General Intelligence
"Machines will not replace physicians, but
physicians using AI will soon replace those
not using it."
-Antonio Di leva
The Lancet
3
Machine learning
• Machine learning is a type of artificial intelligence (AI)
that provides computers with the ability to learn without
being explicitly programmed.
• Machine learning focuses on the development of
computer programs that can teach themselves to grow
and change when exposed to new data.
Machine Learning
• An important subfield of AI as it has a huge
impact on society.
• How can we build computer systems/programs
that automatically improve with experience, and
• What are the fundamental laws that govern all
learning processes?
4
Machine Learning
• It is very hard to write programs that solve
problems like recognizing a face.
– We don’t know what program to write because we don’t
know how its done.
– Even if we had a good idea about how to do it, the
program might be horrendously complicated.
It is very hard to say what makes a 2
5
Machine Learning
• Instead of writing a program by hand, we collect lots of
examples that specify the correct output for a given
input.
• A machine learning algorithm then takes these examples
and produces a program that does the job.
– The program produced by the learning algorithm may
look very different from a typical hand-written program.
It may contain millions of numbers.
– If we do it right, the program works for new cases
as well as the ones we trained it on.
Traditional Programming
Traditionally, we input data and a program to a computer to get output.
Data
Computer Output
Program
6
Definition of Machine Learning
Traditionally, we input data and a program to a computer to get output.
Data
Computer Program
Output
Training Testing
Data
Data Computer
Output
Computer
Program
Output
Formal Definition of ML given by Tom M.
Mitchell (1997)
• Formally: A computer program A is said to learn
from experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
improves with experience E.
• Informally: Algorithms that improve on some task
with experience .
7
Machine Learning
• The world has become immeasurably data-rich
– Human genome is being sequenced
– Vast chemical databases
– Pharmaceutical databases
– Financial databases
– Medical records of patients etc.
• To make sense out of the data, to extract information
from the data, m/c learning is the discipline to go.
• ML tasks
– to data mine historical medical records to learn which future
patients will respond best to which treatments,
– to build search engines that automatically customize to their
user’s interests.
Data Mining versus M/c Learning
• The process of machine learning is similar to that of
data mining but with a difference.
• Both systems search through data to look for patterns.
– However, instead of extracting data for human comprehension --
as is the case in data mining applications -- machine learning
uses that data to improve the program's own
understanding.
• Machine learning programs detect patterns in data and
adjust program actions accordingly. e.g.)
– Facebook's News Feed changes according to the user's
personal interactions with other users.
– If a user frequently tags a friend in photos, writes on his wall or
"likes" his links, the News Feed will show more of that friend's
activity in the user's News Feed due to presumed closeness.
8
Soft Computing
• Soft computing differs from conventional (hard)
computing in that, unlike hard computing, it is tolerant of
– imprecision,
– uncertainty,
– partial truth, and
– approximation.
• In effect, the role model for soft computing is the human
mind.
• tools, techniques, of Soft Computing (SC) are
– Fuzzy Logic (FL), Neural Networks (NN),
– Support Vector Machines (SVM),
– Evolutionary Computation (EC) or Genetic Algorithm
– Probabilistic Reasoning (PR)
3 kinds of M/C Learning
• Supervised
– Target labels are present
– sequence of training vectors or patterns, each with an
associated target output vector
• Unsupervised
– Target labels are missing
– A sequence of input vectors is provided but no target
vectors are specified
• Reinforcement
– Here an agent learns from feedback with the physical
environment by interacting and trying actions and
receiving some sort of evaluation from environment.
9
Supervised Machine Learning
• For designing m/l based model, we collect lots of
examples that specify the correct output for a given
input.
• A machine learning algorithm then takes these examples
and produces a program that does the job.
– The program produced by the learning algorithm may look very
different from a typical hand-written program. It may contain
millions of numbers.
– If we do it right, the program works for new cases as well as
the ones we trained it on.
Different types of Supervised m/c learning
• Classification: the training data set is having target
labels as classes or discrete o/p values
– E.g. 0 or 1, malignant or benign, type 1/2/3/4 cancer etc
• Regression: our goal is to predict a continuous/real
valued output
– It means approximating a real valued target.
– E.g. You have some inventory, you want to predict how many of
these items will sell over the next 3 months
– predict continuous responses— for example, changes in
temperature or fluctuations in power demand. Typical
applications include electricity load forecasting and algorithmic
trading.
• Pattern association: the desired o/p is not just a yes or
no but rather a pattern.
10
Some more examples of tasks that are best solved by using a
machine learning algorithm:
• Machine Translation
• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
• Recognizing anomalies:
– Unusual sequences of credit card transactions
– Unusual patterns of sensor readings in a
nuclear power plant
• Prediction:
– Future stock prices
– Future currency exchange rates
Type of prediction
The different types of predictive models are summed up in
the table below:
Regression Classifier
Outcome Continuous Class
Linear/Non-linear
Logistic regression,
regression
Examples SVM, Naive Bayes,
(fitting line/curve to the
Backpropagation
data)
Classifying whether an
Stock price prediction,
email is a spam or not,
electricity load
classifying whether a
forecasting, predict
Tasks tumour is malignant or
changes in temperature
benign, classifying
or fluctuations in power
whether a website is
demand, etc.
fraudulent or not, etc.
11
Type of model
The different models are summed up in
the table below:
Discriminative model Generative model
Estimate P(x|y) then use it to
Goal Directly estimate P(y|x)
deduce P(y|x)
What's
Decision boundary Probability distributions of the data
learned
Naive Bayes, Gaussian
Examples Regressions, SVMs
Discriminant Analysis
Examples of Supervised learning
• Supervised learning is about Input to Output mapping.
Input Output Application
Email Spam? (Yes/No) Spam filtering
Audio Text Transcript Speech recognition
English Text German Text Machine Translation
Ad, user info Click? (Yes/No) Online advertising
Image of phones Defect? (Yes/No) Visual inspection
Image, radar info Position of other cars Self-driving car
12
Examples of Un-supervised learning
• text mining
• Web analysis
• marketing
• E.g. Targeted marketing, Recommender Systems, and
Customer Segmentation are applications in Unsupervised
learning.
– Targeted marketing identifies an audience likely to buy services
or products and promotes those services or products to that
audience.
– Once these key groups are recognized, companies develop
marketing campaigns and specific products for those preferred
market segments.
Deep Learning
• Deep learning tasks:
– Image/Video clip analysis: face recognition
– Computer vision: object detection, car detection
– listen to an audio clip and understand what is said in an audio clip
• Deep-learning networks perform automatic feature
extraction without human intervention, unlike most
traditional machine-learning algorithms.
13
Pre-trained Models
• Pretrained models are a wonderful source of help for
people looking to learn an algorithm or try out an existing
framework.
– E.g. Classify Image Using GoogLeNet, AlexNet, VGG-16, VGG-
19, and DenseNet-201 etc.
– The pretrained networks are trained on more than a million
images and can classify images into 1000 object categories,
such as keyboard, coffee mug, pencil, and many animals.
– The training images are a subset of the ImageNet database
• Transfer Learning
Watch, Attend and Spell S/W System
• Article cutting from
Times Trends
dated March 20,
2017
14
EEG
• Article cutting
from Times
Trends dated
July 3, 2017
AI and Data
• IBM quits the facial-recognition business
dated: June 10, 2020
source: https://www.theverge.com/2020/6/8/21284683/ibm-no-
longer-general-purpose-facial-recognition-analysis-software
• IBM will no longer sell “general purpose” facial-recognition
technology, chief executive Arvind Krishna wrote in a letter to US
Congress.
• Rekognition by Amazon is a tool for monitoring people of interest
and doubled down on providing other surveillance technologies to
governments.
• In 2018, nearly 70 civil rights and research organizations wrote a
letter to Amazon CEO Jeff Bezos demanding that Amazon stop
providing face recognition technology to governments. source:
https://www.aclu.org/letter-nationwide-coalition-amazon-ceo-jeff-
bezos-regarding-rekognition/
15
MIT Technology Review
on Wednesday, June 10, 2020, Amazon shocked civil rights activists and
researchers when it announced that it would place a one-year moratorium on police
use of Rekognition.
AI and data
• In modern artificial intelligence, data rules.
• A.I. software is only as smart as the data used to train it.
• If there are many more white men than black women in the
system, it will be worse at identifying the black women.
• The reason being; society itself is biased, discriminatory,
messy, and unequal and this is embedded into datasets.
• Under-represented groups just don’t produce enough
data that AI systems can train on.
• During July 2019, Hyderabad airport became the first in
India to launch voluntary facial recognition system called
DigiYatra, then Bengaluru 2nd, and Delhi 3rd.
16
AI Definitions: Summary
Bayesian Learning
• Provides practical learning algorithms
– Naïve Bayes learning
– Bayesian belief network learning
– Combine prior knowledge (prior probabilities)
• Provides foundations for machine learning
– Evaluating learning algorithms
– Guiding the design of new algorithms
– Learning from models : meta learning
17
Dilemma
This person dropped their
ticket in the hallway.
Do you call out
“Excuse me, ma’am!”
or
“Excuse me, sir!”
You have to make a
guess.
Bayesian inference is a way to make guesses about what
your data mean based on sometimes very little data.
Bayesian Inference
• Bayesian inference is a way to capture
common sense.
• It helps you use what you know to make
better guesses.
18
Conditional probabilities
P(A | B) is the probability of A, given B.
“If I know B is the case, what is the probability that A is also the case?”
P(A | B) is not the same as P(B | A).
P(cute | puppy) is not the same as P(puppy | cute)
If I know the thing I’m holding is a puppy, what is the probability that it is cute?
If I know the the thing I’m holding is cute, what is the probability that it is a puppy?
Joint probabilities
● P(A and B) or P(A,B) or P(A with B) or P(A Ⴖ B)
● P(A,B)=P(A)*P(B|A)
● P(A,B,C)=P(A)*P(B|A)*P(C|A and B)
● P(B,A)=P(B)* P(A|B)
e.g. What is the probability that a person is both a woman and has short
hair?
P(woman with short hair)
= P(woman) * P(short hair | woman)
= .5 * .5 = .25
● P( Ⴈ A | B ) = 1- P( A | B )
19
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Sunny?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Sunny)=? P(D2=Rainy | D1=Rainy)=0.4
Answer:
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Sunny?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Sunny)=? P(D2=Rainy | D1=Rainy)=0.4
Answer:
● P(D2=Sunny)=
P(D2=Sunny and D1=Sunny) +
P(D2=Sunny and D1=Rainy)
= P(D2=Sunny | D1=Sunny) * P(D1=Sunny) +
P(D2=Sunny | D1=Rainy)* P(D1=Rainy)
=0.8 * 0.9+ 0.6 * 0.1
=0.78
20
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Rainy?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Rainy)=? P(D2=Rainy | D1=Rainy)=0.4
Answer:
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Rainy?
P(D1=Sunny)=0.9 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Sunny)=0.8
P(D2=Rainy | D1=Sunny)=0.2
P(D2=Sunny | D1=Rainy)=0.6
P(D2=Rainy)=? P(D2=Rainy | D1=Rainy)=0.4
Answer: 0.22
21
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 3 is Sunny?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Rainy)=? P(D2=Rainy | D1=Rainy)=0.4
Answer: solve it by replacing D3 by D2 and D2 by D1 from previous solution.
Theorem of Total Probability
• if events A1, …., An are mutually exclusive with
n
P ( B ) P ( B | Ai ) P ( Ai )
i 1
22
Product Rule : probability P(A,B) of a
conjunction of two events A and B:
P ( A, B ) P ( A | B ) P ( B ) P ( B | A) P ( A)
● The two joint probabilities on occurrence
of same two events are always equal.
● P(B,A) = P(A,B)
● P(B)* P(A|B)= P(A)*P(B|A)
P(A | B) = P(B|A) * P(A) / P(B)
P(A | B) = P(A,B) / P(B)
Bayes’ Theorem
P(A | B) = P(B | A) P(A)
P(B)
23
Properties of Bayes Rule
• The computation or revision of unknown or old
probabilities called priori probabilities in the light of
additional information made available by experiment or
past records to derive a set of new probabilities known
as posterior probabilities.
• Combines prior knowledge and observed data: prior
probability of a hypothesis multiplied with probability
of the hypothesis given the training data
• Probabilistic hypothesis: outputs not only a
classification, but a probability distribution over all
classes
Bayes Rule example
• There is a specific type of cancer which exists for 1% of population.
Probability of a test coming positive given that one has cancer is 0.9. And the
probability of this test coming out negative given that one doesn’t have
cancer is 0.2. What is the probability that a person has this cancer given that
he just received a positive test.
• Answer:
P(C)=0.01 P(ႨC)=0.9
P(+ | C)=0.9 P(- | C)=0.1
P(- | Ⴈ C)=0.2 P(- | Ⴈ C)=0.8
P(C | +)=?
P(C | +)= P(+|c) P(C) / P(+)
=0.9 * 0.01 / 0.207
= 0.0403 i.e. 4.03% so the test is quite sensitive.
P(+)=P(+|C) P(C)+ P(+ | ႨC) P( Ⴈ C)=0.9*0.01+0.2*0.99=0.207
24
Example 2
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the cases
in which the disease is actually present, and a correct negative
result in only 97% of the cases in which the disease is not
present. Furthermore, .008 of the entire population have this
cancer.
P ( cancer ) . 008 , P ( cancer ) . 992
P ( | cancer ) . 98 , P ( | cancer ) . 02
P ( | cancer ) . 03 , P ( | cancer ) . 97
P ( | cancer ) P ( cancer )
P ( cancer | )
P ( )
P ( | cancer ) P ( cancer )
P ( cancer | )
P ( )
Bayesian Classification: Why?
• Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with
observed data.
• Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
• Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured.
25
Reasoning Under Uncertainty
Many different types of errors can contribute to
uncertainty.
1. data might be missing or unavailable
2. data might be ambiguous or unreliable due to
measurement errors
3. the representation of data may be imprecise or
inconsistent
4. data may just be user's best guess (random)
5. data may be based on defaults, and defaults
may have exceptions
Approaches in Dealing with
Uncertainty
Numerically oriented methods:
• Bayes’ Rules
• Certainty Factors
• Dempster Shafer
• Fuzzy Sets
Quantitative approaches
• Non-monotonic reasoning
Symbolic approaches
• Cohen’s Theory of Endorsements
• Fox’s semantic systems
26
Naïve Bayesian Classifier
● It is based on Baye’s theorem with independence
assumption between predictors.
● A naïve Bayesian model is easy to build, with no
complicated iterative parameter estimation which makes it
particularly useful for very large datasets.
● Despite its simplicity, it often does surprisingly well and is
widely used because it often outperforms more
sophisticated classification methods.
Naïve Bayesian Classifier
● Let D=training set of tuples, each tuple is represented by
n-dimensional vector X=(x1, x2, x3….xn)
● Let there be m classes, C1, C2, ….Cm
● Given a tuple X, naïve Bayesian classifier predicts that
tuple X belongs to class Ci
iff P(Ci|X) > P(Cj|X) for 1≤j ≤ m, j ≠i
i.e. Maximize P(Ci|X)= Maximize P(X|Ci) P(Ci)
27
Bayesian Theorem
• Given training data D, posteriori probability of a
hypothesis h, P(h|D) follows the Bayes theorem
P(h | D) P( D | h)P(h)
P(D)
• MAP (maximum posteriori) hypothesis
h arg max P ( h | D ) arg max P ( D | h ) P ( h ) .
MAP hH hH
• Practical difficulty: require initial knowledge of
many probabilities, significant computational
cost
Estimating a-posteriori
probabilities
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
28
Basic Approach
Bayes Rule: P ( D | h) P ( h)
P(h | D)
P( D)
• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior density )
• P(D|h) = probability of D given h (likelihood of D given h)
The Goal of Bayesian Learning: the most probable hypothesis given the training data
(Maximum A Posteriori hypothesis hmap
)
hmap
max P (h | D )
hH
P ( D | h) P ( h)
max
hH P( D)
max P ( D | h) P (h)
hH
Source: http://artint.info/2e/html/ArtInt2e.Ch10.S1.SS2.html#p1
MAP (Maximum A Posteriori hypothesis) Learner
For each hypothesis h in H, calculate the posterior probability
P ( D | h ) P ( h)
P(h | D )
P( D)
Output the hypothesis hmap with the highest posterior probability
hmap max P (h | D)
hH
Comments:
Computational intensive
Providing a standard for judging the performance
of learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
29
Bayesian classification
• The classification problem may be formalized
using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
• E.g. P(class=N | outlook=sunny, windy=true,…)
• Idea: assign to sample X the class label C such
that P(C|X) is maximal
Bayesian classification
• The classification problem may be
formalized using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
• E.g. P(class=N |
outlook=sunny,windy=true,…)
• Idea: assign to sample X the class label C
such that P(C|X) is maximal
30
Naïve Bayesian Classification
• Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
• If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of
samples having value xi as i-th attribute in class
C
• If i-th attribute is continuous:
P(xi|C) is estimated through a Gaussian density
function
• Computationally easy in both cases
likelihood
31
Day Outlook Temperature Humidity Wind Play ball
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Naive Bayesian Classifier (II)
• Given a training set, we can compute the
probabilities
Outlook Y N Humidity Y N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Y N Windy Y N
hot 2/9 2/5 weak 6/9 2/5
mild 4/9 2/5 strong 3/9 3/5
cool 3/9 1/5
32
weather example: classifying
X
• An unseen sample X = <rain, hot, high, weak>
• P(X|p)·P(yes) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(yes) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
• P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286
• Sample X is classified in class n (don’t play)
33