Machine Learning
What is Machine Learning
• Machine learning - term coined around 1960.
• Composed of two words—
– machine corresponds to a computer, robot, or
other device,
– learning refers to an activity intended to acquire
or discover event patterns, which we humans are
good at.
2
Want Machines to Learn!!!?
• Computers and robots can work 24/7, need very low maintenance
and don't get tired, need breaks, call in sick, or go on strike.
• Their performance is justifiable for sophisticated problems that
involve a variety of huge datasets or complex calculations.
• Machines driven by algorithms designed by humans are able to
learn latent rules and inherent patterns and to fulfill tasks desired
by humans.
• Learning machines are better suited than humans for tasks that
are routine, repetitive, or tedious.
3
Evolution of Machine Learning
• Manually defining, maintaining, and updating rules
becomes more and more expensive over time.
• Extracting the number of possible patterns for an activity or
event (dynamic, real-time) is not practically feasible and
would exhaust all enumeration.
• It is much easier and more efficient to develop learning
rules or algorithms which command computers to learn and
extract patterns, and to figure things out themselves from
abundant data.
4
Emerging School of Thought
• Active learning or human-in-the-loop – advocates
combining the efforts of machine learners and humans.
• The idea is routine boring tasks are more suitable for
computers, and creative tasks more suitable for humans.
• According to this philosophy, machines are able to learn, by
following rules (or algorithms) designed by humans and to
do repetitive and logic tasks desired by a human.
5
Overview of Machine Learning
• Machine learning mimicking human intelligence is a
subfield of artificial intelligence.
• It’s closely related to linear algebra, probability
theory, statistics, and mathematical optimization.
• Machine learning models are built based on
statistics, probability theory and linear algebra, then
optimize the models using mathematical
optimization.
6
Overview of Machine Learning
• Machine learning definition :-
A computer program is said to learn from
experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T,
as measured by P, improves with experience E.
7
Examples of Machine Learning
• “Is this cancer?”, “What is the market value of this
house?”, “Which of these people are good friends with
each other?”, “Will this rocket engine explode on take
off?”, “Will this person like this movie?”, “Who is
this?”, “What did you say?”, and “How do you fly this
thing?”
• Playing computer games, recognize spoken words, drive
autonomous vehicles, classify/recognize structures and so
on.
8
Overview of Machine Learning
.
Machine
Input Learning Output
Model
Numeric, Explore, construct Prediction or
textual, audio, algos that learn from Classification
visual historical data and
perform on new data. Define Loss or Cost
function to optimize
the goal of learning
9
Categories of Machine Learning
• Depending on the nature of learning data Machine
Learning tasks can be classified as –
– Supervised Learning
– Unsupervised Learning
– Reinforcement Learning
10
Supervised Learning
• The general rule of the learning goal is to map input to output.
• The learning data is labeled data – comes with description,
targets or desired outputs along with indicative signals. The
labels are usually provided by event logging systems and
human experts.
• The learned rule is then used to label new data with unknown
outputs.
• Used in daily applications, such as face and speech recognition,
products or movie recommendations and sales forecasting.
11
Supervised Learning
• We can further subdivide supervised learning
into regression and classification.
• Regression trains on and predicts a continuous-
valued response, for example predicting house
prices.
• Classification attempts to find the appropriate class
label, such as analyzing positive/negative sentiment
and prediction loan defaults. 12
Unsupervised Learning
• The learning goal is to understand the data, learn from it and
then produce the results.
• Learning data is unlabeled data – contains only indicative
signals without any description attached. We need to find
structure of the data underneath, to discover hidden
information, or to determine how to describe the data.
• Unsupervised learning can be used to detect anomalies, such
as fraud or defective equipment, or to group customers with
similar online behaviors for a marketing campaign.
13
Semi-supervised Learning
• Learning data is partially labeled.
• It makes use of unlabeled data (typically a large amount) for
training, besides a small amount of labeled.
• Applied in cases where it is expensive to acquire a fully labeled
dataset while more practical to label a small subset.
• For example, it often requires skilled experts to label
hyperspectral remote sensing images, and lots of field
experiments to locate oil at a particular location, while
acquiring unlabeled data is relatively easy.
14
Reinforcement Learning
• Learning data provides feedback so that the system
adapts to dynamic conditions in order to achieve a
certain goal.
• The system evaluates its performance based on the
feedback responses and reacts accordingly.
• The best known instances include self-driving cars
and chess master AlphaGo.
15
Machine Learning Algorithms
• Logic-based learning
– They used basic rules specified by human experts, and with these rules,
systems tried to reason using formal logic, background knowledge, and
hypotheses.
• Statistical Learning
• Artificial Neural Networks
– They imitate animal brains, and consist of interconnected neurons that are also
an imitation of biological neurons. They try to model complex relationships
between inputs and outputs and to capture patterns in data.
• Genetic Algorithms
– They mimic the biological process of evolution and try to find the optimal
solutions using methods such as mutation and crossover. 16
Data for Machine Learning
• Good thing – we have a lot of data in the world.
• Bad thing – hard to process this data.
• Challenges – diversity and noisiness of the data.
• We as humans, usually process data coming in our
ears and eyes. These inputs are transformed into
electrical or chemical signals.
• Computers too can process electrical signals.
17
Data for Machine Learning
• Data for ML is represented either as numbers,
images, or text.
• Images and text are not very convenient, so they
need to be transformed into numerical values.
18
Training, Testing, Validation Data Sets
• Training Data Set – Practice Questions
– learn something from them and hopefully are able to apply this
knowledge to other similar questions.
– ML models derive patterns from these.
• Testing Data Set – Actual Exams
– Models are applied and test their compatibility.
• Validation Data Set – Mock Test
– To assess how well we will do in actual ones and to aid revision.
– Verify how well the models will perform in a simulated setting then
we fine-tune the models accordingly in order to achieve greater19hits.
Data for Machine Learning
• The model is given example input values and
example output values. Or if we are more ambitious,
we can feed the program the actual inputs and let
the machine process the data further just like an
autonomous car doesn't need a lot of human input.
20
Practice Questions
• Question 1 -------- Ans is option A
• Question 2 -------- Ans is option B
• Question 3 -------- Ans is option A
• Question 4 -------- Ans is option B
• Question 5 -------- Ans is option A
• Even if the question is not related to potatoes and tomatoes
you may memorize the answers for each question verbatim.
21
Exam Questions
• Question 1 -------- Option ???
• Question 2 -------- Option B
• Question 3 -------- Option???
• Question 4 -------- Option ???
• Question 5 -------- Option ???
• We will score very low on the exam questions as it is rare
that the exact same questions will occur in the actual exams.
22
Overfitting
• The phenomenon of memorization can cause overfitting.
• Over extracting too much information from the training sets and making
the model just work well with them – low bias in machine learning.
• This will not help us generalize with data and derive patterns from them.
• The model as a result will perform poorly on datasets that were not seen
before – high variance in machine learning.
• This occurs when the learning rules are described based on a relatively
small number of observations, instead of the underlying relationship.
• Also when the model is made excessively complex so that it fits every
training sample, such as memorizing the answers for all questions.
23
Underfitting
• A model is underfit if it does not perform well on the training sets
and will not so on the testing sets.
• It fails to capture the underlying trend of the data.
• This may occur if we are not using enough data to train the model.
(like we will fail the exam if we did not review enough material).
• Underfit will result if a wrong model is fit to the data. (like we will score
low in any exercises or exams if we take the wrong approach and learn it the wrong way)
• This is high bias in machine learning, although its variance is low as
performance in training and test sets are pretty consistent, in a bad
way. 24
Avoiding Overfitting & Underfitting
• High bias results in underfitting (incorrect assumptions)
• Variance measures how sensitive the model prediction is to
variations in the datasets. High variance causes Overfitting.
• Hence, try to always make both bias and variance as low as
possible.
• In practice, there is an explicit trade-off between
themselves, where decreasing one increases the other. This
is called bias–variance tradeoff.
25
Bias-Variance Tradeoff Example
We were asked to build a model to predict the probability of a candidate being
the next president based on the phone poll data. The poll was conducted by zip
codes. We randomly choose samples from one zip code, and from these, we
estimate that there's a 61% chance the candidate will win. However, it turns out
he loses the election. Where did our model go wrong? The first thing we think of
is the small size of samples from only one zip code. It is the source of high bias,
also because people in a geographic area tend to share similar demographics.
However, it results in a low variance of estimates. So, can we fix it simply by
using samples from a large number zip codes? Yes, but this might cause an
increased variance of estimates at the same time. We need to find the optimal
sample size, the best number of zip codes to achieve the lowest overall bias and
variance. 26
Avoiding Overfitting with Cross-validation
• Cross-validation data set – mock tests.
• The validation procedure helps evaluate how the models will
generalize to independent or unseen datasets in a simulated
setting.
• The original data is partitioned into three subsets, usually
60% for the training set, 20% for the validation set, and the
rest 20% for the testing set.
• This setting suffices if we have enough training samples after
partition. 27
Cross-validation
28
Cross-validation
• Testing results from all rounds are averaged to generate a more accurate
estimate of model prediction performance.
• Cross-validation helps reduce variability and therefore limit problems like
overfitting.
29
Cross-validation – Exhaustive scheme
• Leave-one-out-cross-validation (LOOCV) – Leave out a fixed
number of observations in each round as testing (or
validation) samples, the remaining observations as training
samples. This process is repeated until all possible different
subsets of samples are used for testing once.
• A dataset of size n, LOOCV requires n rounds of cross-
validation.
• This can be slow when n gets large.
30
Cross-validation – Non-Exhaustive scheme
• This does not try out
all possible partitions.
Most widely used type
of this scheme is k-
fold cross-validation.
• Common values
for k are 3, 5, and 10.
• Average the k sets of
test results for the
purpose of evaluation.
31
k-fold cross-validation
32
Cross-validation – Holdout Method
• Randomly split the
data into training
and testing set
numerous times.
• Problem – some
samples may never
end up in the
testing set, while
some may be
selected multiple
times in the testing
set. 33
Nested Cross-validation
• It is a combination of cross-
validations. It consists of the
following two phases:
– The inner cross-
validation, is conducted
to find the best fit,
usually implemented as
a k-fold cross-validation.
– The outer cross-
validation, is used for
performance evaluation
& statistical analysis.
34
Analogy for Cross-validation
A data scientist plans to take his car to work, and his goal is to arrive before 9 am
every day. He needs to decide the departure time and the route to take. He tries
out different combinations of these two parameters on some Mondays, Tuesdays,
and Wednesdays and records the arrival time for each trial. He then figures out
the best schedule and applies it every day. However, it doesn’t work quite well as
expected. It turns out the scheduling model is overfit with data points gathered in
the first three days and may work well on Thursdays and Fridays. A better solution
would be to test the best combination of parameters derived from Mondays to
Wednesdays on Thursdays and Fridays and similarly repeat this process based on
different sets of learning days and testing days of the week. This analogized cross-
validation ensures the selected schedule work for the whole week.
35
Avoiding Overfitting with Regularization
• Overfitting also occurs due to unnecessary complexity of the
model.
• Linear model (2 parameters) – span a 2-D space.
• Quadratic polynomial (3 parameters) – span a 3-D space.
• High order polynomial function (n parameters) – spans a much
larger space.
– Such models are easily obtained models.
– But they generalize worse than linear models.
– Hence are more prompt to overfitting. 36
Linear Function Vs. Polynomial Function
• So if we need easily-
obtainable models then
complexity has to be
handled.
• Regularization reduces the
complexity by imposing
penalties on high orders of
polynomials.
• But a less accurate and less
strict rule is learned by the
model during training phase.
37
Regularization Example
• Equip robotic guard dog the ability to identify strangers and friends.
• The rules are too complicated and unlikely to generalize well to new visitors.
• Regularization should be optimal because :
– Too small regularization does not have any impact.
– Too large regularization results in underfitting (model becomes complex, falls short of
38
data).
Feature Selection & Dimensionality Reduction
• High dimensional data
– computationally expensive
– prone to overfitting due to high
complexity.
– impossible to visualize.
• Not all the features are useful, and
may add randomness to results.
• For better model construction good
feature selection is important.
39
Preprocessing, exploration and feature engineering
• A machine learning system isn't able to recognize gibberish, so we need to help it
by cleaning the input data.
• Understand the data – First scan the data and/or second visualize the data.
• A grid of numbers is the most convenient form to process.
• Feature engineering is the process of creating or improving features.
• They are often created based on common sense, domain knowledge, or prior
experience.
• Missing values – ignore them or try imputing (arithmetic mean, median or
mode).
• Encoding (Label encoding, On-hot encoding), Scaling, Polynomial features, Power
transformations, Binning.
40
Why feature scaling?
• Most of the ML models work on the Euclidean Distance.
• Data won’t be on the same scale. Hence the distance between such data points
can’t be calculated properly.
• As a result ML models will have problems in learning this type of data.
• Hence feature scaling will transform the data to the same scale.
41
Why feature scaling?
• If x & y are not on same scale then the one of the distances will dominate the
other.
• Eg – Salary is x, Age is y
• Suppose x1 = 79000, x2 = 48000, y1 = 48, y2 = 27
• The square of the distances is –
– x1 – x2 = 31000 square = 961000000
– y1 – y2 = 21 square = 441
• Here y will almost not exist for the ML model as
y value is neglible compared to x value.
• Hence y will be ignored by the ML model and
the result will be wrong. 42
Feature scaling
• Feature Scaling is recommended even in case ML
model is not based on Euclidean Distance.
• This will help the algorithm converge must faster. Eg
– Decision Trees
43
How to load data file(s)?
• Input data sets can be in various formats (.XLS, .TXT, .CSV, JSON ).
44
Load data file from csv & excel
#Import Library Pandas
import pandas as pd
#Reading the dataset in a dataframe using Pandas print
df = pd.read_csv("E:/train.csv") # Load csv file
df=pd.read_excel("E:/EMP.xlsx", "Data") # Load Data sheet of excel file
df.head(3) #Print first three observations
45
How to convert a variable to different datatype
Converting a variable data type to other is important and common
procedure we perform after loading data.
string_outcome = str(numeric_input) #Converts numeric_input to string_outcome
integer_outcome = int(string_input) #Converts string_input to integer_outcome
float_outcome = float(string_input) #Converts string_input to integer_outcome
46
How to convert character date to Date
from datetime import datetime
char_date = 'Apr 1 2015 1:20 PM' #creating example character date
date_obj = datetime.strptime(char_date, '%b %d %Y %I:%M%p')
print (date_obj)
47
How to transpose a Data set?
To transpose Table A into Table B on variable Product. This task can be
accomplished by using dataframe.pivot.
# Load Data sheet of excel file EMP
df = pd.read_excel("E:/transpose.xlsx", "Sheet1")
print (df )
result = df.pivot(index= 'ID', columns='Product', values='Sales')
print(result)
48